Efficient way to do a large number of search/replaces in Python?

2023-02-04 04:26 问答作者：

I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:

line = line.replace("-","<EMDASH>")  # Replace single character with tag
line = line.replace("<\\@>","@")     # tag with single character
line = line.replace("<\\n>","") 开发者_Python百科     # remove tag
line = line.replace("\xe1","&bull;") # replace non-ascii character with entity

the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!

Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:

m = re.findall('(<[^>]+>)',line)

And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:

m = re.findall('(<[^>]+>)',line)
for tag in m:
    tag_new = re.sub("\*t\([^\)]*\)","",tag)
    tag_new = re.sub("\*p\([^\)]*\)","",tag_new)

    # do many more searches...

if tag != tag_new:
    line = line.replace(tag,tag_new,1) # potentially problematic

Any thoughts of efficiency here?

Thanks!

str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).

I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.

You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.

If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.

The best solution though would probably be to use cStringIO

Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.

Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...

[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"

...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).

You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example

>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'

Of course you can use a more complex function object, this lambda is just an example.

Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.

继续阅读：python regex

Efficient way to do a large number of search/replaces in Python?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？