Retrieving million of rows from MySQL

There are times when your query returns a very large number of rows. If you use the default cursor, chances are your process will be killed while retrieving the rows. The reason is by default MySQL clients (e.g. Java connector, Python driver) retrieve all rows and buffer them in memory before passing the result set to your code. If you run out of memory while doing that, your process is certainly killed.

The fix is to use streaming result set. In Python, you can use  MySQLdb.cursors.SSCursor  for this purpose.

import MySQLdb
conn = MySQLdb.connect(...)
cursor =  MySQLdb.SSCursor(conn)
cursor.execute(...)
while True:
    row = cursor. fetchone()
    if not row:
        break
    ...

There are two important things to remember here:
  1. You use an SSCursor instead of the default cursor. This can be done like shown above, or by passing the class name to cursor() call such as conn.cursor(MySQLdb.SSCursor).
  2. Use fetchone to fetch rows from the result set, one row at a time. Do not use fetchall. You can use fetchmany but it is the same as calling fetchone that many times.
One common misconception is to treat  SSCursor  as a server side cursor. It is not! This class is in fact only an unbuffered cursor. It does not read all result set into memory like the default cursor does (hence a buffered cursor). What it does is reading from the response stream in chunks and returning record by record to you. There is another more appropriate name for this: a streaming result set.

Because  SSCursor  is only an unbuffered cursor, (I repeat, not a real server side cursor), there are several restrictions applied to it:
  1. You must read ALL records. The rational is that you send one query, and the server replies with one answer, albeit a really long one. Therefore, before you can do anything else, even a simple ping, you must completely finish this response.
  2. This brings another restriction that you must process each row quickly. If your processing takes even half a second for each row, you will find your connection dropped unexpectedly with error 2013, "Lost connection to MySQL server during query." The reason is by default MySQL will wait for a socket write to finish in 60 seconds. The server is trying to dump large amount of data down the wire, yet the client is taking its time to process chunk by chunk. So, the server is likely to just give up. You can increase this timeout by issuing a query SET NET_WRITE_TIMEOUT = xx where xx is the number of seconds that MySQL will wait for a socket write to complete. But please do not rely on that to be a workable remedy. Fix your processing instead. Or if you cannot reduce processing time any further, you can quickly chuck the rows somewhere local to complete the query first, and then read them back later at a more leisure rate.
  3. The first restriction also means that your connection is totally held up while you are retrieving the rows. There is no way around it. If you need to run another query in parallel, do it in another connection. Otherwise, you will get error 2014, "Commands out of sync; you can't run this command now."

I hope this post will help some of you.

转载地址:https://techualization.blogspot.sg/2011/12/retrieving-million-of-rows-from-mysql.html    使用python遍历mysql中有千万行数据的大表

猜你喜欢

转载自blog.csdn.net/my_xxh/article/details/80173303
今日推荐