开发者

How to apply regex to the content of a file?

I would like to app开发者_如何学编程ly regex to the content of a file without loading the entire file into a string. The RegexObject takes as its first argument a string or a buffer. Is there any way to turn the file into a buffer?


Yes! Try mmap:

you can use the re module to search through a memory-mapped file


Quote from Python doc:

Buffer objects are not directly supported by Python syntax, but can be created by calling the built-in function buffer().

And some other interesting part:

buffer(object[, offset[, size]])

The object argument must be an object that supports the buffer call interface (such as strings, arrays, and buffers).[...]

File objects does not implement buffer interface - so you have to change its content either into string (f.read()) or into array (use mmap for that).


Read the file in a line at a time and apply your reg exp to that line. re seems to be stacked to work on strings. http://docs.python.org/library/re.html contains a more detail but I was unable to find anything with regard to buffers.


Do the buffering yourself. Load in a chunk, if the regex matches a portion of the chunk, delete the portion from the chunk, carry over unused portion, read the next chunk, repeat.

If the regex is designed to be of a specific theoretical maximum, on the event that nothing matched and the buffer is at leas as big, clear the buffer, read in the next chunk. Regexes in general are NOT designed to handle very large chunks of data. The more complex the regex is, the more backtracking it has to do.


The code below demonstrates:

  • Opening a file
  • Seeking in the file
  • Reading only a portion of the file
  • Using regular expressions to match patterns

Assumption: All sentences are the same length

# import random for randomly choosing in a list
import random
# import re for regular expression matching
import re

#open a new file for read/writing
file = open("TEST", "r+")

# some strings to put in the sentence
typesOfSentences = ["test", "flop", "bork", "flat", "pork"]
# number of types of sentences
numTypes = len(typesOfSentences)

# for i values 0 to 99
for i in range(100):
   # Create a random sentence for example
   # "This is a test sentence 01"
   sentence = "This is a %s sentence %02d\n" % (random.choice(typesOfSentences), i)
   # write the sentence to the file
   file.write(sentence)

# Go back to beginning of file
file.seek(0)

# print out the whole file
for line in file:
   print line

# Determine the length of the sentence
length = len(sentence)

# go to 20th sentence from the beginning
file.seek(length * 20)

# create a regex matching the type and the number at the end
pathPattern = re.compile("This is a (.*?) sentence (\d\d)")

# print the next ten types and numbers
for i in range(10):
   # read the next line
   line = file.readline()
   # match the regex
   match = pathPattern.match(line)
   # if there was a match
   if match:
      # NOTE: match.group(0) is always the entire sentence
      # Print type of sentence it was and it's number
      print "Sentence %02d is of type %s" % (int(match.group(2)), match.group(1))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜