Compensating for "variance" in a survey

2023-02-13 13:19 问答作者：

The title for this one was quite tricky.

I'm trying to solve a scenario, Imagine a survey was sent out to XXXXX amount of people, asking them what their favourite football club was. From the response back, it's obvious that while many are favourites of the same club, they all "expressed" it in different ways. For example,

For Manchester United, some variations include...

Man U
Man Utd.
Man Utd.
Manchester U
Manchester Utd

All are obviously the same club however, if using a simple technique, of just trying to get an extract string match, each would be a separate result.

Now, if we further complication the scenario, let's say that because of the sheer volume of different clubs (eg. Man City, as M. City, Manchester City, etc), again plagued with this problem, its impossible to manually "enter" these variances and use that to create a custom filter such that converters all Man U -> Manchester United, Man Utd. > Manchester United, etc. But instead we want to automate this filter, to look for the most likely match and converter the data accordingly.

I'm trying to do this in Python (from a .cvs file) however welc开发者_如何学Come any pseudo answers that outline a good approach to solving this.

Edit: Some additional information This isn't working off a set list of clubs, the idea is to "cluster" the ones we have together. The assumption is there are no spelling mistakes. There is no assumed length of how many clubs And the survey list is long. Long enough that it doesn't warranty doing this manually (1000s of queries)

Google Refine does just this, but I'll assume you want to roll your own.

Note, difflib is built into Python, and has lots of features (including eliminating junk elements). I'd start with that.

You probably don't want to do it in a completely automated fashion. I'd do something like this:

# load corrections file, mapping user input -> output
# load survey
import difflib

possible_values = corrections.values()

for answer in survey:
    output = corrections.get(answer,None)
    if output = None:
        likely_outputs = difflib.get_close_matches(input,possible_values)
        output = get_user_to_select_output_or_add_new(likely_outputs)
        corrections[answer] = output
        possible_values.append(output)
save_corrections_as_csv

Please edit your question with answers to the following:

You say "we want to automate this filter, to look for the most likely match" -- match to what?? Do you have a list of the standard names of all of the possible football clubs, or do the many variations of each name need to be clustered to create such a list?

How many clubs?

How many survey responses?

After doing very light normalisation (replace . by space, strip leading/trailing whitespace, replace runs of whitespace by a single space, convert to lower case [in that order]) and counting, how many unique responses do you have?

Your focus seems to be on abbreviations of the standard name. Do you need to cope with nicknames e.g. Gunners -> Arsenal, Spurs -> Tottenham Hotspur? Acronyms (WBA -> West Bromwich Albion)? What about spelling mistakes, keyboard mistakes, SMS-dialect, ...? In general, what studies of your data have you done and what were the results?

You say """its impossible to manually "enter" these variances""" -- is it possible/permissible to "enter" some "variances" e.g. to cope with nicknames as above?

What are your criteria for success in this exercise, and how will you measure it?

It seems to me that you could convert many of these into a standard form by taking the string, lower-casing it, removing all punctuation, then comparing the start of each word.

If you had a list of all the actual club names, you could compare directly against that as well; and for strings which don't match first-n-letters to any actual team, you could try lexigraphical comparison against any of the returned strings which actually do match.

It's not perfect, but it should get you 99% of the way there.

import string

def words(s):
    s = s.lower().strip(string.punctuation)
    return s.split()

def bestMatchingWord(word, matchWords):
    score,best = 0., ''
    for matchWord in matchWords:
        matchScore = sum(w==m for w,m in zip(word,matchWord)) / (len(word) + 0.01)
        if matchScore > score:
            score,best = matchScore,matchWord
    return score,best

def bestMatchingSentence(wordList, matchSentences):
    score,best = 0., []
    for matchSentence in matchSentences:
        total,words = 0., []
        for word in wordList:
            s,w = bestMatchingWord(word,matchSentence)
            total += s
            words.append(w)
        if total > score:
            score,best = total,words
    return score,best

def main():
    data = (
        "Man U",
        "Man. Utd.",
        "Manch Utd",
        "Manchester U",
        "Manchester Utd"
    )

    teamList = (
        ('arsenal',),
        ('aston', 'villa'),
        ('birmingham', 'city', 'bham'),
        ('blackburn', 'rovers', 'bburn'),
        ('blackpool', 'bpool'),
        ('bolton', 'wanderers'),
        ('chelsea',),
        ('everton',),
        ('fulham',),
        ('liverpool',),
        ('manchester', 'city', 'cty'),
        ('manchester', 'united', 'utd'),
        ('newcastle', 'united', 'utd'),
        ('stoke', 'city'),
        ('sunderland',),
        ('tottenham', 'hotspur'),
        ('west', 'bromwich', 'albion'),
        ('west', 'ham', 'united', 'utd'),
        ('wigan', 'athletic'),
        ('wolverhampton', 'wanderers')
    )

    for d in data:
        print "{0:20} {1}".format(d, bestMatchingSentence(words(d), teamList))

if __name__=="__main__":
    main()

run on sample data gets you

Man U                (1.9867767507647776, ['manchester', 'united'])
Man. Utd.            (1.7448074166742613, ['manchester', 'utd'])
Manch Utd            (1.9946817328797555, ['manchester', 'utd'])
Manchester U         (1.989100008901989, ['manchester', 'united'])
Manchester Utd       (1.9956787398647866, ['manchester', 'utd'])

继续阅读：fuzzy-comparison python

Compensating for "variance" in a survey

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？