Python: Faster regex replace
I have a large set of large files and a set of "phrases" that need to be replaced in each file.
The "business logic" imposes several restrictions:- Matching must be case-insensitive
- The whitespace, tabs and new lines in the regex cannot be ignored
My solution (see below) is a bit on the slow side. How could it be optimised, both in terms of IO and string replacement?
data = open("INPUT__FILE").read()
o = open("OUTPUT_FILE","w")
for phrase in phrases: # these are the set of words I am talking about
b1, b2 = str(phrase).strip().split(" ")
开发者_StackOverflow中文版 regex = re.compile(r"%s\ *\t*\n*%s"%(b1,b2), re.IGNORECASE)
data = regex.sub(b1+"_"+b2,data)
o.write(data)
UPDATE: 4x speed-up by converting all text to lower case and dropping re.IGNORECASE
you could avoid recompiling your regexp for every file:
precompiled = []
for phrase in phrases:
b1, b2 = str(phrase).strip().split(" ")
precompiled.append(b1+"_"+b2, re.compile(r"%s\ *\t*\n*%s"%(b1,b2), re.IGNORECASE))
for (input, output) in ...:
with open(output,"w") as o:
with open(input) as i:
data = i.read()
for (pattern, regex) in precompiled:
data = regex.sub(pattern, data)
o.write(data)
it's the same for one file, but if you're repeating over many files then you are re-using the regexes.
disclaimer: untested, may contain typos.
[update] also, you can simplify the regexp a little by replacing the various space characters with \s*
. i suspect you have a bug there, in that you would want to match " \t "
and currently don't.
You can do this in 1 pass by using a B-Tree data structure to store your phrases. This is the fastest way of doing it with a time-complexity of N O(log h)
where N is the number of characters in your input file and h is the length of your longest word. However, Python does not offer an out of the box implementation of a B-Tree.
You can also use a Hashtable (dictionary) and a replacement function to speed up things. This is easy to implement if the words you wish to replace are alphanumeric and single words only.
replace_data = {}
# Populate replace data here
for phrase in phrases:
key, value = phrase.strip().split(' ')
replace_data[key.lower()] = value
def replace_func(matchObj):
# Function which replaces words
key = matchObj.group(0).lower()
if replace_data.has_key(key):
return replace_data[key]
else:
return key
# Original code flow
data = open("INPUT_FILE").read()
output = re.sub("[a-zA-Z0-9]+", replace_func, data)
o = open('OUTPUT_FILE', 'w')
o.write(output)
o.close()
精彩评论