Optimized processing of very large files

2023-02-13 06:50 问答作者：

My task is relatively simple: for each line in an input file开发者_StackOverflow社区, test whether the line satisfies a given set of conditions, and if so, write specific columns of that line to a new file. I've written a python script that does this, but I'd like some help on 1) improving speed, 2) the best way to work in terms of column names (as column numbers can vary from file to file), and 3) the best way to specify my filtering conditions and desired output columns.

1) The files I work with contain photometry for astronomical images. Each file is around 1e6 lines by 150 columns of floats, typically over 1GB in size. I have an old AWK script that will process files like this in about 1 minute; my python script takes between 5 and 7 minutes. I often need to tweak the filtering conditions and rerun several times until the output file is what I want, so speed is definitely desirable. I've found that the for loop is plenty fast; it's how I do things inside the loop that slow it down. Using itemgetter to pick out just the columns I want was a big improvement over reading the entire line into memory, but I'm unsure of what I can do to further increase speed. Can this ever be as fast as AWK?

2) I'd like to work in terms of column names instead of column numbers since the column number of a particular quantity (photon counts, background, signal-to-noise, etc) can change between files. In my AWK script, I always need to check that the column numbers are correct where conditions and output columns are specified, even if the filtering and output apply to the same quantities. My solution in python has been to create a dictionary that assigns a column number to each quantity. When a file has different columns, I only need to specify a new dictionary. Perhaps there is a better way to do this?

3) Ideally, I would only need to specify the names of the input and output files, the filtering conditions, and desired columns to output, and they would be found at the top of my script so I wouldn't need to go searching through the code just to tweak something. My main issue has been with undefined variables. For example, a typical condition is 'SNR > 4', but 'SNR' (signal-to-noise) isn't actually assigned a value until lines start being read from the photometry file. My solution has been to use a combination of strings and eval/exec. Again, maybe there is a better way?

I'm not at all trained in computer science (I'm a grad student in astronomy) - I typically just hack something together and debug until it works. However, optimization with regard to my three points above has become extremely important for my research. I apologize for the lengthy post, but I felt that the details would be helpful. Any and all advice you have for me, in addition to just cleaning things up/coding style, would be greatly appreciated.

Thanks so much, Jake

#! /usr/bin/env python2.6

from operator import itemgetter


infile = 'ugc4305_1.phot'
outfile = 'ugc4305_1_filt.phot'

# names must belong to dicitonary
conditions = 'OBJ <= 2 and SNR1 > 4 and SNR2 > 4 and FLAG1 < 8 and FLAG2 < 8 and (SHARP1 + SHARP2)**2 < 0.075 and (CROWD1 + CROWD2) < 0.1'

input = 'OBJ, SNR1, SNR2, FLAG1, FLAG2, SHARP1, SHARP2, CROWD1, CROWD2'
    # should contain all quantities used in conditions

output = 'X, Y, OBJ, COUNTS1, BG1, ACS1, ERR1, CHI1, SNR1, SHARP1, ROUND1, CROWD1, FLAG1, COUNTS2, BG2, ACS2, ERR2, CHI2, SNR2, SHARP2, ROUND2, CROWD2, FLAG2'

# dictionary of col. numbers for the more important qunatities
columns = dict(EXT=0, CHIP=1, X=2, Y=3, CHI_GL=4, SNR_GL=5, SHARP_GL=6, ROUND_GL=7, MAJAX_GL=8, CROWD_GL=9, OBJ=10, COUNTS1=11, BG1=12, ACS1=13, STD1=14, ERR1=15, CHI1=16, SNR1=17, SHARP1=18, ROUND1=19, CROWD1=20, FWHM1=21, ELLIP1=22, PSFA1=23, PSFB1=24, PSFC1=25, FLAG1=26, COUNTS2=27, BG2=28, ACS2=29, STD2=30, ERR2=31, CHI2=32, SNR2=33, SHARP2=34, ROUND2=35, CROWD2=36, FWHM2=37, ELLIP2=38, PSFA2=39, PSFB2=40, PSFC2=41, FLAG2=42)



f = open(infile)
g = open(outfile, 'w')


# make string that extracts values for testing
input_items = []
for i in input.replace(',', ' ').split():
    input_items.append(columns[i])
input_items = ', '.join(str(i) for i in input_items)

var_assign = '%s = [eval(i) for i in itemgetter(%s)(line.split())]' % (input, input_items) 


# make string that specifies values for writing
output_items = []
for i in output.replace(',', ' ').split():
    output_items.append(columns[i])
output_items = ', '.join(str(i) for i in output_items)

output_values = 'itemgetter(%s)(line.split())' % output_items


# make string that specifies format for writing
string_format = []
for i in output.replace(',', ' ').split():
    string_format.append('%s')
string_format = ' '.join(string_format)+'\n'


# main loop
for line in f:
   exec(var_assign)
   if eval(conditions):
      g.write(string_format % tuple(eval(output_values)))
f.close()
g.close()

I don't think you mentioned it, but it looks like your data is in csv. You might get a lot out of using csv.DictReader. You can iterate over files 1 line at a time (avoiding loading the whole thing into memory) and refer to columns by their names.

You should also take a look at cProfile, Python's profiler, if you haven't already. It will tell you what bits of your program are taking the most time to execute.

My first step here, would be to get rid of the exec() and eval() calls. Each time you eval a string, it has to be compiled, and then executed, adding to the overhead of your function call on every line of your file. Not to mention, eval tends to lead to messy, hard to debug code, and should generally be avoided.

You can start refactoring by putting your logic into a small, easily understandable functions. For example, you can replace eval(conditions) with a function, e.g.:

def conditions(d):
    return (d[OBJ] <= 2 and
            d[SNRI] > 4 and
            d[SNR2] > 4 and
            d[FLAG1] < 8 and ...

Tip: if some of your conditionals have higher probability of failing, put them in first, and python will skip the evaluation of the rest.

I would get rid of the dictionary of column names, and simply set a bunch of variables at the top of your file, then refer to columns by line[COLNAME]. This may help you simplify some parts like the conditions function, and you can refer to the columns by name, without having to assign each variable.

Here's how I would go about something like this...

This runs in ~35 secs vs. ~3 minutes for your original script on my machine. It is possible to add a few more optimizations (we only need to convert a few of the columns to floats, for example), but that only shaves a few seconds off of the run time.

You could also easily use csv.DictReader here, as several people have suggested. I'm avoiding it, as you'd have to define a custom dialect, and it's only a couple of extra lines to do the same thing without it. (The various csv module classes also check for more complex behavior (e.g. quoted strings, etc) that you don't need to worry about in this particular case. They're very, very handy in many cases, but they're slight overkill in this case.)

Note that you can also easily add your infile and outfile names as arguments when you call the script instead of hardcoding them in (i.e. infile = sys.argv[0], etc). This would also allow you to pipe data in or out easily... (You can check the length of sys.argv and set infile or outfile to sys.stdin and/or sys.stdout accordingly)

def main():
    infile = 'ugc4305_1.phot'
    outfile = 'ugc4305_1_filt.phot'
    process_data(infile, outfile)

def filter_conditions(row):
    for key, value in row.iteritems():
        row[key] = float(value)

    cond = (row['OBJ'] <= 2 and row['SNR1'] > 4 
       and row['SNR2'] > 4 and row['FLAG1'] < 8 
       and row['FLAG2'] < 8 
       and (row['SHARP1'] + row['SHARP2'])**2 < 0.075 
       and (row['CROWD1'] + row['CROWD2']) < 0.1
       )
    return cond

def format_output(row):
    output_columns = ('X', 'Y', 'OBJ', 'COUNTS1', 'BG1', 'ACS1', 'ERR1', 'CHI1', 
                     'SNR1', 'SHARP1', 'ROUND1', 'CROWD1', 'FLAG1', 'COUNTS2', 
                     'BG2', 'ACS2', 'ERR2', 'CHI2', 'SNR2', 'SHARP2', 'ROUND2', 
                     'CROWD2', 'FLAG2')
    delimiter = '\t'
    return delimiter.join((row[name] for name in output_columns))

def process_data(infilename, outfilename):
    column_names = ('EXT', 'CHIP', 'X', 'Y', 'CHI_GL', 'SNR_GL', 'SHARP_GL', 
                    'ROUND_GL', 'MAJAX_GL', 'CROWD_GL', 'OBJ', 'COUNTS1', 
                    'BG1', 'ACS1', 'STD1', 'ERR1', 'CHI1', 'SNR1', 'SHARP1', 
                    'ROUND1', 'CROWD1', 'FWHM1', 'ELLIP1', 'PSFA1', 'PSFB1', 
                    'PSFC1', 'FLAG1', 'COUNTS2', 'BG2', 'ACS2', 'STD2', 
                    'ERR2', 'CHI2', 'SNR2', 'SHARP2', 'ROUND2', 'CROWD2', 
                    'FWHM2', 'ELLIP2', 'PSFA2', 'PSFB2', 'PSFC2', 'FLAG2')

    with open(infilename) as infile:
        with open(outfilename, 'w') as outfile:
            for line in infile:
                line = line.strip().split()
                row = dict(zip(column_names, line))
                if filter_conditions(row.copy()):
                    outfile.write(format_output(row) + '\n')

if __name__ == '__main__':
    main()

Like what nmichaels said, you can use the fieldnames and dialect parameters of csv.DictReader to read this file. Then, for each line you will have a dictionary. With the dictionary, you won't have to use eval, and can use statments like

if data_item['OBJ'] <= 2 and data_item['SNR1']:
     g.write(data_item['X'], data_item['Y'], data_item['OBJ'])

The way you're doing it now is slow and complicated because of all the evals. There no need for that complexity.

If profiling shows that a lot of time is spent on the actual reading and parsing of the files and you will process the same raw file many times you can try to create an intermediate file format optimized for reading with Python.

One thing to try can be to read the file once, parse and output the result with pickle/cPickle. Then read the intermediate file with pickle/cpickle in your filter script.

In don't know python well enough to tell if this will be faster than reading each line and split them. (In c# I would use a binary serializer, but I don't know if that is available in python).

If disk IO is a bottle neck you may also try to zip your input files and read them with the gzip module.

Have you tried pandas?

I believe OBJ, SNR1, ... are column names and I hope you are applying the same condition on all your rows. If that's the case I suggest you to go with pandas.

Your code snippet would go something like this...

import pandas as pd

infile = 'ugc4305_1.phot'
outfile = 'ugc4305_1_filt.phot'

df = pd.read_csv(infile)

condition = (df['OBJ'] <= 2) & (df['SRN1'] > 4) & (df['SRN2'] > 4) & (df['FLAG1'] < 8) & (df['FLAG2'] < 8) & ((df['SHARP1'] + df['SHARP2'])**2 < 0.075) & ((df['CROWD1'] + df['CROWD2']) < 0.1)
newDf = df[condition]

columnNames = ['col1', 'col2', ...] # column names you want in result

newDf = df[columnNames]

newDf.to_csv(outfile)

继续阅读：astronomy optimization performance python

Optimized processing of very large files

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？