Creating big Excel sheets programmatically
We are using OpenPyxl to export MySQL content to Microsoft Excel in XSLX format
https://bitbucket.org/ericgazoni/openpyxl/overview
However, the amount of data we are dealing with is big. We are running to out of memory situation. Tables may contain up to 400 columns in 50000+ rows. Even the files are big, they are not that big that Microsoft Excel or OpenOffice should have problems with them. We are assuming our issues mainly stem from the fact that Python keeps XML DOM structure in memory in not efficient enough manner.
EDIT: Eric, the author of OpenPyxl, pointed out that there is an option to make OpenPyxl write with fixed memory usage. However, this didn't solve our problem completely, as we still have issues with raw speed and something else taking up too much memory in Python.
Now we are looking for more efficient ways to create Excel files. With Python preferably, but if we cannot find a good solution we might want to look other programming languages as well.
Options, not in any specific order, include
1) Using OpenOffice and PyUno and hope their memory structures are more efficient than with OpenPyxl and the TCP/IP call bridge is efficient enough
2) Openpyxl uses xml.etree. Would Python lxml (libxml2 native extension) be more efficient wit XML memo开发者_如何转开发ry structures and is it possible to replace xml.etree directly with lxml drop-in e.g. with monkey-patching? (later the changes could be contributed back to Openpyxl if there is a clear benefit)
3) Export from MySQL to CSV and then post-process CSV files directly to XSLX using Python and file iteration
4) Use other programming languages and libraries (Java)
Pointers:
http://dev.lethain.com/handling-very-large-csv-and-xml-files-in-python/
http://enginoz.wordpress.com/2010/03/31/writing-xlsx-with-java/
If you're going to use Java, you will want to use Apache POI, but likely not the regular UserModel as you're wanting to keep your memory footprint down.
Instead, take a look at BigGridDemo, which shows you how to write a very large xlsx file using POI, with most of the work not happening in memory.
You might also find that the technique used in the BigGridDemo could equally be used in Python?
Have you tried to look at the optimized writer for openpyxl ? It's a recent feature (2 months old), but it's quite robust (used in production in several corporate projects) and can handle almost indefinite amount of data with steady memory consumption (around 7Mb)
http://packages.python.org/openpyxl/optimized.html#optimized-writer
精彩评论