Comparing two .txt files in Python and saving exact and similar matches to .txt file
What i need is:
text_file_1.txt:
apple
orange
ice
icecream
text_file_2.txt:
apple
pear
ice
When i use "set", output will be:
apple
ice
("equivalent of re.match")
but I want to get:
apple
ice
icecream
开发者_如何学Go("equivalent of re.search")
Is there any way how to do this? Files are large, so I can't just iterate over it and use regex.
you might want to check out difflib
If all you want is to extract from the files words which are one a substring of the other (including those that are identical) you could do:
fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])
# transforming to sets saves to check twice for the same combination
result = []
for wone in fone:
for wtwo in ftwo:
if wone.find(wtwo) != -1 or wtwo.find(wone) != -1:
result.append(wone)
result.append(wtwo)
for w in set(result):
print w
Alternatively, if you want a similarity based on how strings are similar in the order of their letters, you could use as suggested by Paul in his answer one of the classes provided by difflib:
import difflib as dl
fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])
result = []
for wone in fone:
for wtwo in ftwo:
s = dl.SequenceMatcher(None, wone, wtwo)
if s.ratio() > 0.6: #0.6 is the conventional threshold to define "close matches"
result.append(wone)
result.append(wtwo)
for w in set(result):
print w
I did not timed either of the two samples, but I would guess the second will run much slower, as for each couple you will have to instantiate an object...
精彩评论