Comparing strings in python to find errors
I have a string that is the correct spelling of a word:
FOO开发者_运维问答
I would allow someine to mistype the word in such ways:
FO, F00, F0O ,FO0
Is there a nice way to check for this ? Lower case should also be seen as correct, or convert to upper case. What ever would be the prettiest.
One approach is to calculate the edit distance between the strings. You can for example use the Levenshtein distance, or invent your own distance function that considers 0 and O more close than 0 and P, for example.
Another is to transform each word into a canonical form, and compare canonical forms. You can for example convert the string to uppercase, replace all 0s with Os, 1s with Is, etc., then remove duplicated letters.
>>> import itertools
>>> def canonical_form(s):
s = s.upper()
s = s.replace('0', 'O')
s = s.replace('1', 'I')
s = ''.join(k for k, g in itertools.groupby(s))
return s
>>> canonical_form('FO')
'FO'
>>> canonical_form('F00')
'FO'
>>> canonical_form('F0O')
'FO'
The builtin module difflib has a get_close_matches function.
You can use it like this:
>>> import difflib
>>> difflib.get_close_matches('FO', ['FOO', 'BAR', 'BAZ'])
['FOO']
>>> difflib.get_close_matches('F00', ['FOO', 'BAR', 'BAZ'])
[]
>>> difflib.get_close_matches('F0O', ['FOO', 'BAR', 'BAZ'])
['FOO']
>>> difflib.get_close_matches('FO0', ['FOO', 'BAR', 'BAZ'])
['FOO']
Notice that it doesn't match one of your cases. You could lower the cutoff
parameter to get a match:
>>> difflib.get_close_matches('F00', ['FOO', 'BAR', 'BAZ'], cutoff=0.3)
['FOO']
you can use the 're' module
re.compile(r'f(o|0)+',re.I) #ignore case
you can use curly braces to limit the number of occurrences too. you can also get 'fancy' and define your 'leet' sets and add them in w/ %s
as in:
ay = '(a|4|$)'
oh = '(o,0,\))'
re.compile(r'f%s+' % (oh),re.I)
精彩评论