Fastest Python method for search and replace on a large string

2023-02-08 05:42 问答作者：

I'm looking for the fastest way to replace a large number of sub-strings inside a very large string. Here are two examples I've used.

findall() feels simpler and more elegant, but it takes an astounding amount of time.

finditer() blazes through a large file, but I'm not sure this is the right way to do it.

Here's some sample code. Note that the actual text I'm interested in is a single string around 10MB in size, and there's a huge difference in these two methods.

import re

def findall_replace(text, reg, rep):
    for match in reg.findall(text):
        output = text.replace(match, rep)
    return output

def finditer_replace(text, reg, rep):
    cursor_pos = 0
    output = ''
    for match in reg.finditer(text):
        output += "".join([text[cursor_pos:match.start(1)], rep])
 开发者_如何学运维       cursor_pos = match.end(1)
    output += "".join([text[cursor_pos:]])
    return output

reg = re.compile(r'(dog)')
rep = 'cat'
text = 'dog cat dog cat dog cat'

finditer_replace(text, reg, rep)

findall_replace(text, reg, rep)

UPDATE Added re.sub method to tests:

def sub_replace(reg, rep, text):
    output = re.sub(reg, rep, text)
    return output

Results

re.sub() - 0:00:00.031000

finditer() - 0:00:00.109000

findall() - 0:01:17.260000

The standard method is to use the built-in

re.sub(reg, rep, text)

Incidentally the reason for the performance difference between your versions is that each replacement in your first version causes the entire string to be recopied. Copies are fast, but when you're copying 10 MB at a go, enough copies will become slow.

You can, and I think you must because it certainly is an optimized function, use

re.sub(pattern, repl, string[, count, flags])

The reason why your findall_replace() function is long is that at each match, a new string object is created, as you will see by executed the following code:

ch = '''qskfg qmohb561687ipuygvnjoihi2576871987uuiazpoieiohoihnoipoioh
opuihbavarfgvipauhbi277auhpuitchpanbiuhbvtaoi541987ujptoihbepoihvpoezi 
abtvar473727tta aat tvatbvatzeouithvbop772iezubiuvpzhbepuv454524522ueh'''

import re

def findall_replace(text, reg, rep):
    for match in reg.findall(text):
        text = text.replace(match, rep)
        print id(text)
    return text

pat = re.compile('\d+')
rep = 'AAAAAAA'

print id(ch)
print
print findall_replace(ch, pat, rep)

Note that in this code I replaced output = text.replace(match, rep) with text = text.replace(match, rep) , otherwise only the last occurence is replaced.

finditer_replace() is long for the same reason as for findall_replace(): repeated creation of a string object. But the former uses an iterator re.finditer() while the latter constructs beforhand a list object, so it is longer. That's the difference between iterator and not-iterator.

By the way, your code with findall_replace() isn't safe, it can return unawaited results:

ch = 'sea sun ABC-ABC-DEF bling ranch micABC-DEF fish'

import re

def findall_replace(text, reg, rep):
    for gr in reg.findall(text):
        text = text.replace(gr, rep)
        print 'group==',gr
        print 'text==',text
    return '\nresult is : '+text

pat = re.compile('ABC-DE')
rep = 'DEFINITION'

print 'ch==',ch
print
print findall_replace(ch, pat, rep)

display

ch== sea sun ABC-ABC-DEF bling ranch micABC-DEF fish

group== ABC-DE
text== sea sun ABC-DEFINITIONF bling ranch micDEFINITIONF fish
group== ABC-DE
text== sea sun DEFINITIONFINITIONF bling ranch micDEFINITIONF fish

result is : sea sun DEFINITIONFINITIONF bling ranch micDEFINITIONF fish

继续阅读：python regex

Fastest Python method for search and replace on a large string

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？