Fast conversion of numeric data into fixed width format file in Python
What is the 开发者_如何学Pythonfastest way of converting records holding only numeric data into fixed with format strings and writing them to a file in Python? For example, suppose record
is a huge list consisting of objects with attributes id
, x
, y
, and wt
and we frequently need to flush them to an external file. The flushing can be done with the following snippet:
with open(serial_fname(), "w") as f:
for r in records:
f.write("%07d %11.5e %11.5e %7.5f\n" % (r.id, r.x, r.y, r.wt))
However my code is spending too much time generating external files leaving too little time for doing what it is supposed to do between the flushes.
Amendmend to the original question:
I ran into this problem while writing a server software that keeps track of a global record set by pulling the information from several "producer" systems and relays any changes to the record set to "consumer" systems in real-time or near real-time in preprocessed form. Many of the consumer systems are Matlab applications.
I have listed below some suggestions I have received so far (thanks) with some comments:
- Dump only the changes, not the whole data set: I'm actually doing this already. The resulting change sets are still huge.
- Use binary (or some other more efficient) file format: I'm pretty much constrained by what Matlab can read reasonably efficiently and in addition to that the format should be platform independent.
- Use database: I am actually trying to bypass the current database solution that is deemed both too slow and cumbersome, especially on Matlab's side.
- Dividing task to separate processes: At the moment the dumping code is running in its own thread. However because of the GIL it is still consuming the same core. I guess I could move it to completely separate process.
I was trying to check if numpy.savetxt could speed things up a bit so I wrote the following simulation:
import sys
import numpy as np
fmt = '%7.0f %11.5e %11.5e %7.5f'
records = 10000
np.random.seed(1234)
aray = np.random.rand(records, 4)
def writ(f, aray=aray, fmt=fmt):
fw = f.write
for row in aray:
fw(fmt % tuple(row))
def prin(f, aray=aray, fmt=fmt):
for row in aray:
print>>f, fmt % tuple(row)
def stxt(f, aray=aray, fmt=fmt):
np.savetxt(f, aray, fmt)
nul = open('/dev/null', 'w')
def tonul(func, nul=nul):
func(nul)
def main():
print 'looping:'
loop(sys.stdout, aray)
print 'savetxt:'
savetxt(sys.stdout, aray)
I found the results (on my 2.4 GHz Core Duo Macbook Pro, with Mac OS X 10.5.8, Python 2.5.4 from the DMG on python.org, numpy 1.4 rc1 built from sources) slightly surprising, but they're quite repeatable so I thought they may be of interest:
$ py25 -mtimeit -s'import ft' 'ft.tonul(ft.writ)'
10 loops, best of 3: 101 msec per loop
$ py25 -mtimeit -s'import ft' 'ft.tonul(ft.prin)'
10 loops, best of 3: 98.3 msec per loop
$ py25 -mtimeit -s'import ft' 'ft.tonul(ft.stxt)'
10 loops, best of 3: 104 msec per loop
so, savetxt seems to be a few percent slower than a loop calling write
... but good old print
(also in a loop) seems to be a few percents faster than write
(I guess it's avoiding some kind of call overhead). I realize that a difference of 2.5% or so isn't very important, but it's not in the direction I intuitively expected it to be, so I thought I'd report it. (BTW, using a real file instead of /dev/null
only uniformly adds 6 or 7 milliseconds, so it doesn't change things much, one way or another).
I don't see anything about your snippet of code that I could really optimize. So, I think we need to do something completely different to solve your problem.
Your problem seems to be that you are chewing large amounts of data, and it's slow to format the data into strings and write the strings to a file. You said "flush" which implies you need to save the data regularly.
Are you saving all the data regularly, or just the changed data? If you are dealing with a very large data set, changing just some data, and writing all of the data... that's an angle we could attack to solve your problem.
If you have a large data set, and you want to update it from time to time... you are a candidate for a database. A real database, written in C for speed, will let you throw lots of data updates at it, and will keep all the records in a consistent state. Then you can, at intervals, run a "report" which will pull the records and write your fixed-width text file from them.
In other words, I'm proposing you divide the problem into two parts: updating the data set piecemeal as you compute or receive more data, and dumping the entire data set into your fixed-width text format, for your further processing.
Note that you could actually generate the text file from the database without stopping the Python process that is updating it. You would get an incomplete snapshot, but if the records are independent, that should be okay.
If your further processing is in Python also, you could just leave the data in the database forever. Don't bother round-tripping the data through a fixed-width text file. I'm assuming you are using a fixed-width text file because it's easy to extract the data again for future processing.
If you use the database idea, try to use PostgreSQL. It's free and it's a real database. For using a database with Python, you should use an ORM. One of the best is SqlAlchemy.
Another thing to consider: if you are saving the data in a fixed-width text file format for future parsing and use of the data in another application, and if that application can read JSON as well as fixed-width, maybe you could use a C module that writes JSON. It might not be any faster, but it might; you could benchmark it and see.
Other than the above, my only other idea is to split your program into a "worker" part and an "updater" part, where the worker generates updated records and the updater part saves the records to disk. Perhaps have them communicate by having the worker put the updated records, in text format, to the standard output; and have the updater read from standard input and update its record of the data. Instead of an SQL database, the updater could use a dictionary to store the text records; as new ones arrived, it could simply update the dictionary. Something like this:
for line in sys.stdin:
id = line[:7] # fixed width: id is 7 wide
records[id] = line # will insert or update as needed
You could actually have the updater keep two dictionaries, and keep updating one while the other one is written out to disk.
Dividing into a worker and an updater is a good way to make sure the worker doesn't spend all its time updating, and a great way to balance the work across multiple CPU cores.
I'm out of ideas for now.
you can try to build all the output strings in the memory, e.g. use a long string. and then write this long string in the file.
more faster: you may want to use binary files rather text files for logging information. But then you need to write another tool to view the binary files.
Now that you updated your question, I have a slightly better idea of what you are facing.
I don't know what the "current database solution that is deemed both too slow and cumbersome" is, but I still think a database would help if used correctly.
Run the Python code to collect data, and use an ORM module to insert/update the data into the database. Then run a separate process to make a "report", which would be the fixed-width text files. The database would be doing all the work of generating your text file. If necessary, put the database on its own server, since hardware is pretty cheap these days.
You could use try to push your loop to C using ctypes.
精彩评论