开发者

Python parsing files

I need to know the best approach for the following scenario

lets say we have some huge file which logs the output of the compilation and there are couple of error patterns which I want to test against this file, for eg. error patterns could be

 - : error:
 - : error [A-Z]*[\d ]*
 - [A-Z]*[\d]* [E\e|rror:
 -  " Cannot open include file " 
 -  " Could not find " 
 - "is not a member of"
 - "has not been declared"

Let me know if this would be efficient:

  • dump the file in some variable and close the file
  • 开发者_Go百科
  • grep for each error from the list
  • or create the regular expression for each of error and parse through the variable

Thank you


If the log file is large, it may not be a good idea to load it to memory. Instead, you may precompile all regular expressions and test against them line by line, e.g.:

def has_error(filename):
    with file(filename, 'r') as logfile:
        for line in logfile:
            for regexp in MY_REGEXPS:
                if regexp.search(line):
                    return True
        return False


Given that the log file is large, the (more) efficient way to check for errors would be to iterate through the file one line at a time and check each line against your patterns. You wouldn't want to be holding a huge file in memory unnecessarily.

In Python, probably something like this:

err = re.compile(': error(?::| [A-Z]*[\d ]*)|[A-Z]*\d* [Ee]rror:|' +
                 '" (?:Cannot open include file|Could not find) "|' +
                 '"(?:is not a member of|has not been declared)"')
with open('file.log') as f:
    for line in f:
        m = err.search(line)
        if m is not None:
            # this line indicates an error

though you might have to change the regular expression to suit your needs. An alternative would be to have a list of static strings, e.g.

err_list = ['error', 'Cannot open include file', 'Could not find', 'is not a member of', 'has not been declared']

and just search for each string in each line:

with open('file.log') as f:
    for line in f:
        if any(line.find(e) for e in err_list):
            # this line indicates an error


This really wouldn't be efficient as you're reading a huge amount of data into memory and then trying to operate on it. Unless you have a huge amount of memory, it's probably not a good idea.

Use a generator instead:

def parser(filename):
    with open(filename, 'r') as f: # For use in python > 2.4 I *think*.
        for line in f:
            if anymatches(line): # or whatever you want to do to generate a
                yield line       # true/false value

This will have the benefit of not loading the whole file into memory, and also only producing the matches as you ask for them - so if you want only the first N matches you can do this:

for i, match in zip(xrange(N), parser('mylogfile')):
    #do something with match
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜