Python: how many similar words in string?

2023-01-13 02:15 问答作者：

I have some ugly strings similar to these:

   string1 = 'Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)'
   string2 = 'Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)'

I would like a library or algorithm that will开发者_JAVA技巧 give me a percentage of how many words they have in common, while excluding special characters such as ',' and ':' and ''' and '{' etc.

I know of the Levenshtein algorithm. However, this compares numbers of similar CHARACTERS, whereas I would like to compare how many WORDS they have in common

Regex could easily give you all the words:

import re
s1 = "Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)"
s2 = "Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)"
s1w = re.findall('\w+', s1.lower())
s2w = re.findall('\w+', s2.lower())

collections.Counter (Python 2.7+) can quickly count up the number of times a word occurs.

from collections import Counter
s1cnt = Counter(s1w)
s2cnt = Counter(s2w)

A very crude comparison could be done through set.intersection or difflib.SequenceMatcher, but it sounds like you would want to implement a Levenshtein algorithm that deals with words, where you could use those two lists.

common = set(s1w).intersection(s2w) 
# returns set(['c'])

import difflib
common_ratio = difflib.SequenceMatcher(None, s1w, s2w).ratio()
print '%.1f%% of words common.' % (100*common_ratio)

Prints: 3.4% of words similar.

n = 0
words1 = set(sentence1.split())
for word in sentence2.split():
    # strip some chars here, e.g. as in [1]
    if word in words1:
        n += 1

(1: How to remove symbols from a string with Python?)

Edit: Note that this considers a word to be common to both sentences if it appears anywhere in both - to compare the position, you can omit the set conversion (just call split() on both), use something like:

n = 0
for word_from_1, word_from_2 in zip(sentence1.split(), sentence2.split()):
    # strip some chars here, e.g. as in [1]
    if word_from_1 == word_from_2:
        n += 1

The Lenvenshtein algorithm itself isn't restricted to comparing characters, it could compare any arbitrary objects. The fact that the classical form uses characters is an implementation detail, they could be any symbols or constructs that can be compared for equality.

In Python, convert the strings into lists of words then apply the algorithm to the lists. Maybe someone else can help you with cleaning up unwanted characters, presumably using some regular expression magic.

继续阅读：python string

Python: how many similar words in string?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？