What's the most efficient way to convert a MySQL result set to a NumPy array?

2023-03-28 16:28 问答作者：

I'm using MySQLdb and Python. I have some basic queries such as this:

c=db.cursor()
c.execute("SELECT id, rating from video")
results = c.fetchall()

I need "results" to be a NumPy array, and I'm looking to be economical with my memory consumption. It seems like copying the data row by row would be incredibly inefficient (double the memory would be required). Is there a better way to convert MySQLdb query results into the NumPy array format?

The reason I'm looking to use the NumPy array f开发者_StackOverflowormat is because I want to be able to slice and dice the data easily, and it doesn't seem like python is very friendly to multi-dimensional arrays in that regard.

e.g. b = a[a[:,2]==1]

Thanks!

This solution uses Kieth's fromiter technique, but handles the two dimensional table structure of SQL results more intuitively. Also, it improves on Doug's method by avoiding all the reshaping and flattening in python data types. Using a structured array we can read pretty much directly from the MySQL result into numpy, cutting out python data types almost entirely. I say 'almost' because the fetchall iterator still produces python tuples.

There is one caveat though, but it's not a biggie. You must know the data type of your columns and the number of rows in advance.

Knowing the column types should be obvious, since you know what the query is presumably, otherwise you can always use curs.description, and a map of the MySQLdb.FIELD_TYPE.* constants.

Knowing the row count means you have to use client side cursor (which is the default). I don't know enough about the internals of MySQLdb and the MySQL client libraries, but my understanding is that the entire result is fetched into client side memory when using client side cursors, although I suspect there's actually some buffering and caching involved. This would mean using double memory for the result, once for the cursor copy and once for the array copy, so it's probably a good idea to close the cursor as soon as possible to free up the memory if the result set is large.

Strictly speaking, you don't have to provide the number of rows in advance, but doing so means the array memory is allocated once off in advance, and not continuously resized as more rows come in from the iterator which is meant to provide a huge performance boost.

And with that, some code

import MySQLdb
import numpy

conn = MySQLdb.connect(host='localhost', user='bob', passwd='mypasswd', db='bigdb')
curs = conn.cursor() #Use a client side cursor so you can access curs.rowcount
numrows = curs.execute("SELECT id, rating FROM video")

#curs.fetchall() is the iterator as per Kieth's answer
#count=numrows means advance allocation
#dtype='i4,i4' means two columns, both 4 byte (32 bit) integers
A = numpy.fromiter(curs.fetchall(), count=numrows, dtype=('i4,i4'))

print A #output entire array
ids = A['f0'] #ids = an array of the first column
              #(strictly speaking it's a field not column)
ratings = A['f1'] #ratings is an array of the second colum

See the numpy documentation for dtype and the link above about structured arrays for how to specify column data types, and column names.

The fetchall method actually returns an iterator, and numpy has the fromiter method to initialize an array from an interator. So, depending on what data is in the table you could combine the two easily, or use an adapter generator.

NumPy's fromiter method seems best here (as in Keith's answer, which preceded this one).

Using fromiter to recast a result set, returned by a call to a MySQLdb cursor method, to a NumPy array is simple, but there are a couple of details perhaps worth mentioning.

import numpy as NP
import MySQLdb as SQL

cxn = SQL.connect('localhost', 'some_user', 'their_password', 'db_name')
c = cxn.cursor()
c.execute('SELECT id, ratings from video')

# fetchall() returns a nested tuple (one tuple for each table row)
results = cursor.fetchall()

# 'num_rows' needed to reshape the 1D NumPy array returend by 'fromiter' 
# in other words, to restore original dimensions of the results set
num_rows = int(c.rowcount)

# recast this nested tuple to a python list and flatten it so it's a proper iterable:
x = map(list, list(results))              # change the type
x = sum(x, [])                            # flatten

# D is a 1D NumPy array
D = NP.fromiter(iterable=x, dtype=float, count=-1)  

# 'restore' the original dimensions of the result set:
D = D.reshape(num_rows, -1)

Note that fromiter returns a 1D NumPY array,

(This makes sense, of course, because you can use fromiter to return just a portion of a single MySQL Table row, by passing a parameter for count).

Still, you'll have to restore the 2D shape, hence the predicate call to the cursor method rowcount. and the subsequent call to reshape in the final line.

Finally, the default argument for the parameter count is '-1', which just retrieves the entire iterable

继续阅读：mysql-python numpy python

What's the most efficient way to convert a MySQL result set to a NumPy array?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？