How to apply regex to the content of a file?

2023-02-08 21:10 问答作者：

I would like to app开发者_如何学编程ly regex to the content of a file without loading the entire file into a string. The RegexObject takes as its first argument a string or a buffer. Is there any way to turn the file into a buffer?

Yes! Try mmap:

you can use the re module to search through a memory-mapped file

Quote from Python doc:

Buffer objects are not directly supported by Python syntax, but can be created by calling the built-in function buffer().

And some other interesting part:

buffer(object[, offset[, size]])

The object argument must be an object that supports the buffer call interface (such as strings, arrays, and buffers).[...]

File objects does not implement buffer interface - so you have to change its content either into string (f.read()) or into array (use mmap for that).

Read the file in a line at a time and apply your reg exp to that line. re seems to be stacked to work on strings. http://docs.python.org/library/re.html contains a more detail but I was unable to find anything with regard to buffers.

Do the buffering yourself. Load in a chunk, if the regex matches a portion of the chunk, delete the portion from the chunk, carry over unused portion, read the next chunk, repeat.

If the regex is designed to be of a specific theoretical maximum, on the event that nothing matched and the buffer is at leas as big, clear the buffer, read in the next chunk. Regexes in general are NOT designed to handle very large chunks of data. The more complex the regex is, the more backtracking it has to do.

The code below demonstrates:

Opening a file
Seeking in the file
Reading only a portion of the file
Using regular expressions to match patterns

Assumption: All sentences are the same length

# import random for randomly choosing in a list
import random
# import re for regular expression matching
import re

#open a new file for read/writing
file = open("TEST", "r+")

# some strings to put in the sentence
typesOfSentences = ["test", "flop", "bork", "flat", "pork"]
# number of types of sentences
numTypes = len(typesOfSentences)

# for i values 0 to 99
for i in range(100):
   # Create a random sentence for example
   # "This is a test sentence 01"
   sentence = "This is a %s sentence %02d\n" % (random.choice(typesOfSentences), i)
   # write the sentence to the file
   file.write(sentence)

# Go back to beginning of file
file.seek(0)

# print out the whole file
for line in file:
   print line

# Determine the length of the sentence
length = len(sentence)

# go to 20th sentence from the beginning
file.seek(length * 20)

# create a regex matching the type and the number at the end
pathPattern = re.compile("This is a (.*?) sentence (\d\d)")

# print the next ten types and numbers
for i in range(10):
   # read the next line
   line = file.readline()
   # match the regex
   match = pathPattern.match(line)
   # if there was a match
   if match:
      # NOTE: match.group(0) is always the entire sentence
      # Print type of sentence it was and it's number
      print "Sentence %02d is of type %s" % (int(match.group(2)), match.group(1))

继续阅读：python regex

How to apply regex to the content of a file?

更多精彩内容

精彩评论

最新问答

宫颈癌术后可以性生活吗？

决战平安京人面树赏金特典皮肤什么时候上线?？

CF2024宠粉节活动入口在哪?？

原神养石任务怎么做?？

射戮骑士什么时候发售?？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？