开发者

Parsing Human Names and matching them in Ruby

I'm looking for a gem or project that would let me identify that two names are the same person. For example

J.R. Smith == John R. Smith == John Smith == John Roy Smith == Johnny Smith

I think you get the idea. I know nothing is going to be 100% accurate but I'd like to get something that at least handles the majority of cases. I know that l开发者_StackOverflow社区ast one is probably going to need a database of nicknames.


I think one option would be to use a ruby implementation of the Levenshtein distance

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

Then you could define that names with a distance less than X (being X a number you will have to tweak) are from the same person.

EDIT Through a little search I was able to find another algorithm, based on phonetics called Metaphone

Still has a lot of holes in it, but I think that in this case the best everyone can do is to give you alternatives for you to test and see what works best


This is a little late (and a shameless plug to boot), but for what it's worth, I wrote a human name parser during a GSoC project, which you can install with gem install namae. It does not detect your duplicates reliably obviously, but it helps you with such kind of tasks.

For instance, you can parse the names in your example and use a display form using initials to detect names whose initials are identical, and so on and so forth:

names = Namae.parse('J.R. Smith and John R. Smith and John Smith and John Roy Smith and Johnny Smith ')
names.map { |n| [n.given, n.family] }
#=> => [["J.R.", "Smith"], ["John R.", "Smith"], ["John", "Smith"], ["John Roy", "Smith"], ["Johnny", "Smith"]]
names.map { |n| n.initials expand: true }
#=> ["J.R. Smith", "J.R. Smith", "J. Smith", "J.R. Smith", "J. Smith"]


Something like:

1: Convert names to arrays:

irb> names.map!{|n|n.scan(/[^\s.]+\.?/)}
["J.", "R.", "Smith"]
["John", "R.", "Smith"]
["John", "Smith"]
["John", "Roy", "Smith"]
["Johnny", "Smith"]

2: Some function of identity:

for a,b in names.combination(2)
    p [(a&b).size,a,b]
end
[2, ["J.", "R.", "Smith"], ["John", "R.", "Smith"]]
[1, ["J.", "R.", "Smith"], ["John", "Smith"]]
[1, ["J.", "R.", "Smith"], ["John", "Roy", "Smith"]]
[1, ["J.", "R.", "Smith"], ["Johnny", "Smith"]]
[2, ["John", "R.", "Smith"], ["John", "Smith"]]
[2, ["John", "R.", "Smith"], ["John", "Roy", "Smith"]]
[1, ["John", "R.", "Smith"], ["Johnny", "Smith"]]
[2, ["John", "Smith"], ["John", "Roy", "Smith"]]
[1, ["John", "Smith"], ["Johnny", "Smith"]]
[1, ["John", "Roy", "Smith"], ["Johnny", "Smith"]]

Or instead of & you may use .permutation + .zip + .max to apply some custom function, which determines, are to parts of names identical.


UPD:

aim = 'Rob Bobbie Johnson'
candidates = [
    "Bob Robbie John",
    "Bobbie J. Roberto",
    "R.J.B.",
]

$synonyms = Hash[ [
    ["bob",["bobbie"]],
    ["rob",["robbie","roberto"]],
] ]

def prepare name
    name.scan(/[^\s.]+\.?/).map &:downcase
end

def mf a,b # magick function
    a.zip(b).map do |i,j|
        next 1 if i == j
        next 0.9 if $synonyms[i].to_a.include?(j) || $synonyms[j].to_a.include?(i)
        next 0.5 if i[/\.$/] && j.start_with?(i.chomp '.')
        next 0.5 if j[/\.$/] && i.start_with?(j.chomp '.')
        -10 # if some part of name appears to be different -
            # it's bad even if another two parts were good
    end.inject :+
end

for c in candidates
    results = prepare(c).permutation.map do |per|
        [mf(prepare(aim),per),per]
    end
    p [results.transpose.first.max,c]
end

[-8.2, "Bob Robbie John"]  # 0.9 + 0.9 - 10 # Johnson != John # I think ..)
[2.4, "Bobbie J. Roberto"] # 1 + 0.9 + 0.5 # Rob == Roberto, Bobbie == Bobbie, Johnson ~~ J.
[1.5, "R.J.B."]            # 0.5 + 0.5 + 0.5


For anyone who has to try to match human names from different data sources, this is a VERY hard problem to address. Using a combination of 3 gems seems to do pretty well.

We have an application where we have a million people in List A, and need to match them with dozens of different data sources. (And despite what some of the more pedantic comments claim, that is not a 'design flaw' that is the nature of dealing with 'real world' messy data.)

The only thing we have found to work reasonably well thus far is using a combination of the namae gem (for parsing names into a standardize first, middle, last, suffix representation) and the text gem to calculate levenshtein, soundex, metaphone, and porter scores, AND also fuzzy-string-match which calculates the JaroWinkler score (which is often the best of the lot).

  1. parse into a standard format separating last, first, middle, suffix using namae. We pre-process with a regex to extract nicknames when formatted John "JJ" Doe or Samuel (Sammy) Smith
  2. calculate ALL scores on a sanitized version of the full name (all caps, remove punctuation, last name first) ... jarowinkler, soundex, levenshtein, metaphone, white, porter. (JaroWinkler and Soundex often do the best.)
  3. declare a match if N scores exceed individually set thresholds. (We use any 2 that pass as a pass)
  4. if no match, try again using only last name, first name, middle initial, with higher thresholds (eg, stricter matching).
  5. Still no match, replace first name with nick name (if any) and try again.

With some tweaking of score thresholds for each scoring method, we get pretty good results. YMMV.

BTW putting last name first is very important, at least for JaroWinkler since there is generally less variation in last names (Smithe is almost always Smithe, but first name might be Tom or Tommy or Thomas in different data sources), and the beginning of the string is most 'sensitive' in JaroWinkler. For a "ROB SMITHE / ROBIN SMITHE, the JaroWinkler distance is 0.91 if you do first name first, but 0.99 if you do last name first.


The best pre-coded you will probably find for this is the gem just called "text".

https://github.com/threedaymonk/text

It has a number of matching algorithms: Levenshtein Distance, Metaphone, Soundex, and more.


I don't think such a library exists.

I don't mean to offend, but this problem seems like it arises from poor design. Maybe if you post more details about the general problem you are trying to solve, people can suggest a better way.


Ruby has a very nice gem called text and I've found the Text::WhiteSimilarity to be very good myself but it also implements a bunch of other tests


One initial attempt at a robust human name matcher / clustering solution in Ruby: https://github.com/adrianomitre/match_author_names

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜