Optimized processing of very large files
My task is relatively simple: for each line in an input file开发者_StackOverflow社区, test whether the line satisfies a given set of conditions, and if so, write specific columns of that line to a new file. I've written a python script that does this, but I'd like some help on 1) improving speed, 2) the best way to work in terms of column names (as column numbers can vary from file to file), and 3) the best way to specify my filtering conditions and desired output columns.
1) The files I work with contain photometry for astronomical images. Each file is around 1e6 lines by 150 columns of floats, typically over 1GB in size. I have an old AWK script that will process files like this in about 1 minute; my python script takes between 5 and 7 minutes. I often need to tweak the filtering conditions and rerun several times until the output file is what I want, so speed is definitely desirable. I've found that the for loop is plenty fast; it's how I do things inside the loop that slow it down. Using itemgetter to pick out just the columns I want was a big improvement over reading the entire line into memory, but I'm unsure of what I can do to further increase speed. Can this ever be as fast as AWK?
2) I'd like to work in terms of column names instead of column numbers since the column number of a particular quantity (photon counts, background, signal-to-noise, etc) can change between files. In my AWK script, I always need to check that the column numbers are correct where conditions and output columns are specified, even if the filtering and output apply to the same quantities. My solution in python has been to create a dictionary that assigns a column number to each quantity. When a file has different columns, I only need to specify a new dictionary. Perhaps there is a better way to do this?
3) Ideally, I would only need to specify the names of the input and output files, the filtering conditions, and desired columns to output, and they would be found at the top of my script so I wouldn't need to go searching through the code just to tweak something. My main issue has been with undefined variables. For example, a typical condition is 'SNR > 4', but 'SNR' (signal-to-noise) isn't actually assigned a value until lines start being read from the photometry file. My solution has been to use a combination of strings and eval/exec. Again, maybe there is a better way?
I'm not at all trained in computer science (I'm a grad student in astronomy) - I typically just hack something together and debug until it works. However, optimization with regard to my three points above has become extremely important for my research. I apologize for the lengthy post, but I felt that the details would be helpful. Any and all advice you have for me, in addition to just cleaning things up/coding style, would be greatly appreciated.
Thanks so much, Jake
#! /usr/bin/env python2.6
from operator import itemgetter
infile = 'ugc4305_1.phot'
outfile = 'ugc4305_1_filt.phot'
# names must belong to dicitonary
conditions = 'OBJ <= 2 and SNR1 > 4 and SNR2 > 4 and FLAG1 < 8 and FLAG2 < 8 and (SHARP1 + SHARP2)**2 < 0.075 and (CROWD1 + CROWD2) < 0.1'
input = 'OBJ, SNR1, SNR2, FLAG1, FLAG2, SHARP1, SHARP2, CROWD1, CROWD2'
# should contain all quantities used in conditions
output = 'X, Y, OBJ, COUNTS1, BG1, ACS1, ERR1, CHI1, SNR1, SHARP1, ROUND1, CROWD1, FLAG1, COUNTS2, BG2, ACS2, ERR2, CHI2, SNR2, SHARP2, ROUND2, CROWD2, FLAG2'
# dictionary of col. numbers for the more important qunatities
columns = dict(EXT=0, CHIP=1, X=2, Y=3, CHI_GL=4, SNR_GL=5, SHARP_GL=6, ROUND_GL=7, MAJAX_GL=8, CROWD_GL=9, OBJ=10, COUNTS1=11, BG1=12, ACS1=13, STD1=14, ERR1=15, CHI1=16, SNR1=17, SHARP1=18, ROUND1=19, CROWD1=20, FWHM1=21, ELLIP1=22, PSFA1=23, PSFB1=24, PSFC1=25, FLAG1=26, COUNTS2=27, BG2=28, ACS2=29, STD2=30, ERR2=31, CHI2=32, SNR2=33, SHARP2=34, ROUND2=35, CROWD2=36, FWHM2=37, ELLIP2=38, PSFA2=39, PSFB2=40, PSFC2=41, FLAG2=42)
f = open(infile)
g = open(outfile, 'w')
# make string that extracts values for testing
input_items = []
for i in input.replace(',', ' ').split():
input_items.append(columns[i])
input_items = ', '.join(str(i) for i in input_items)
var_assign = '%s = [eval(i) for i in itemgetter(%s)(line.split())]' % (input, input_items)
# make string that specifies values for writing
output_items = []
for i in output.replace(',', ' ').split():
output_items.append(columns[i])
output_items = ', '.join(str(i) for i in output_items)
output_values = 'itemgetter(%s)(line.split())' % output_items
# make string that specifies format for writing
string_format = []
for i in output.replace(',', ' ').split():
string_format.append('%s')
string_format = ' '.join(string_format)+'\n'
# main loop
for line in f:
exec(var_assign)
if eval(conditions):
g.write(string_format % tuple(eval(output_values)))
f.close()
g.close()
I don't think you mentioned it, but it looks like your data is in csv. You might get a lot out of using csv.DictReader. You can iterate over files 1 line at a time (avoiding loading the whole thing into memory) and refer to columns by their names.
You should also take a look at cProfile, Python's profiler, if you haven't already. It will tell you what bits of your program are taking the most time to execute.
My first step here, would be to get rid of the exec()
and eval()
calls. Each time you eval a string, it has to be compiled, and then executed, adding to the overhead of your function call on every line of your file. Not to mention, eval
tends to lead to messy, hard to debug code, and should generally be avoided.
You can start refactoring by putting your logic into a small, easily understandable functions. For example, you can replace eval(conditions)
with a function, e.g.:
def conditions(d):
return (d[OBJ] <= 2 and
d[SNRI] > 4 and
d[SNR2] > 4 and
d[FLAG1] < 8 and ...
Tip: if some of your conditionals have higher probability of failing, put them in first, and python will skip the evaluation of the rest.
I would get rid of the dictionary of column names, and simply set a bunch of variables at the top of your file, then refer to columns by line[COLNAME]
. This may help you simplify some parts like the conditions function, and you can refer to the columns by name, without having to assign each variable.
Here's how I would go about something like this...
This runs in ~35 secs vs. ~3 minutes for your original script on my machine. It is possible to add a few more optimizations (we only need to convert a few of the columns to floats, for example), but that only shaves a few seconds off of the run time.
You could also easily use csv.DictReader
here, as several people have suggested. I'm avoiding it, as you'd have to define a custom dialect, and it's only a couple of extra lines to do the same thing without it. (The various csv
module classes also check for more complex behavior (e.g. quoted strings, etc) that you don't need to worry about in this particular case. They're very, very handy in many cases, but they're slight overkill in this case.)
Note that you can also easily add your infile and outfile names as arguments when you call the script instead of hardcoding them in (i.e. infile = sys.argv[0]
, etc). This would also allow you to pipe data in or out easily... (You can check the length of sys.argv
and set infile
or outfile
to sys.stdin
and/or sys.stdout
accordingly)
def main():
infile = 'ugc4305_1.phot'
outfile = 'ugc4305_1_filt.phot'
process_data(infile, outfile)
def filter_conditions(row):
for key, value in row.iteritems():
row[key] = float(value)
cond = (row['OBJ'] <= 2 and row['SNR1'] > 4
and row['SNR2'] > 4 and row['FLAG1'] < 8
and row['FLAG2'] < 8
and (row['SHARP1'] + row['SHARP2'])**2 < 0.075
and (row['CROWD1'] + row['CROWD2']) < 0.1
)
return cond
def format_output(row):
output_columns = ('X', 'Y', 'OBJ', 'COUNTS1', 'BG1', 'ACS1', 'ERR1', 'CHI1',
'SNR1', 'SHARP1', 'ROUND1', 'CROWD1', 'FLAG1', 'COUNTS2',
'BG2', 'ACS2', 'ERR2', 'CHI2', 'SNR2', 'SHARP2', 'ROUND2',
'CROWD2', 'FLAG2')
delimiter = '\t'
return delimiter.join((row[name] for name in output_columns))
def process_data(infilename, outfilename):
column_names = ('EXT', 'CHIP', 'X', 'Y', 'CHI_GL', 'SNR_GL', 'SHARP_GL',
'ROUND_GL', 'MAJAX_GL', 'CROWD_GL', 'OBJ', 'COUNTS1',
'BG1', 'ACS1', 'STD1', 'ERR1', 'CHI1', 'SNR1', 'SHARP1',
'ROUND1', 'CROWD1', 'FWHM1', 'ELLIP1', 'PSFA1', 'PSFB1',
'PSFC1', 'FLAG1', 'COUNTS2', 'BG2', 'ACS2', 'STD2',
'ERR2', 'CHI2', 'SNR2', 'SHARP2', 'ROUND2', 'CROWD2',
'FWHM2', 'ELLIP2', 'PSFA2', 'PSFB2', 'PSFC2', 'FLAG2')
with open(infilename) as infile:
with open(outfilename, 'w') as outfile:
for line in infile:
line = line.strip().split()
row = dict(zip(column_names, line))
if filter_conditions(row.copy()):
outfile.write(format_output(row) + '\n')
if __name__ == '__main__':
main()
Like what nmichaels said, you can use the fieldnames
and dialect
parameters of csv.DictReader
to read this file. Then, for each line you will have a dictionary. With the dictionary, you won't have to use eval
, and can use statments like
if data_item['OBJ'] <= 2 and data_item['SNR1']:
g.write(data_item['X'], data_item['Y'], data_item['OBJ'])
The way you're doing it now is slow and complicated because of all the eval
s. There no need for that complexity.
If profiling shows that a lot of time is spent on the actual reading and parsing of the files and you will process the same raw file many times you can try to create an intermediate file format optimized for reading with Python.
One thing to try can be to read the file once, parse and output the result with pickle/cPickle. Then read the intermediate file with pickle/cpickle in your filter script.
In don't know python well enough to tell if this will be faster than reading each line and split them. (In c# I would use a binary serializer, but I don't know if that is available in python).
If disk IO is a bottle neck you may also try to zip your input files and read them with the gzip module.
Have you tried pandas?
I believe OBJ, SNR1, ... are column names and I hope you are applying the same condition on all your rows. If that's the case I suggest you to go with pandas.
Your code snippet would go something like this...
import pandas as pd
infile = 'ugc4305_1.phot'
outfile = 'ugc4305_1_filt.phot'
df = pd.read_csv(infile)
condition = (df['OBJ'] <= 2) & (df['SRN1'] > 4) & (df['SRN2'] > 4) & (df['FLAG1'] < 8) & (df['FLAG2'] < 8) & ((df['SHARP1'] + df['SHARP2'])**2 < 0.075) & ((df['CROWD1'] + df['CROWD2']) < 0.1)
newDf = df[condition]
columnNames = ['col1', 'col2', ...] # column names you want in result
newDf = df[columnNames]
newDf.to_csv(outfile)
精彩评论