Comparing two .txt files in Python and saving exact and similar matches to .txt file

2023-03-18 05:24 问答作者：

What i need is:

text_file_1.txt:
apple
orange
ice
icecream

text_file_2.txt:
apple
pear
ice

When i use "set", output will be:

apple
ice

("equivalent of re.match")

but I want to get:

apple
ice
icecream

开发者_如何学Go

("equivalent of re.search")

Is there any way how to do this? Files are large, so I can't just iterate over it and use regex.

you might want to check out difflib

If all you want is to extract from the files words which are one a substring of the other (including those that are identical) you could do:

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])
# transforming to sets saves to check twice for the same combination

result = []
for wone in fone:
    for wtwo in ftwo:
        if wone.find(wtwo) != -1 or wtwo.find(wone) != -1:
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

Alternatively, if you want a similarity based on how strings are similar in the order of their letters, you could use as suggested by Paul in his answer one of the classes provided by difflib:

import difflib as dl

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])

result = []
for wone in fone:
    for wtwo in ftwo:
        s = dl.SequenceMatcher(None, wone, wtwo)
        if s.ratio() > 0.6:  #0.6 is the conventional threshold to define "close matches"
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

I did not timed either of the two samples, but I would guess the second will run much slower, as for each couple you will have to instantiate an object...

继续阅读：compare file-comparison intersection pattern-matching python

Comparing two .txt files in Python and saving exact and similar matches to .txt file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？