Parsing a full name into its constituents
We are in need of developing a back end application that can parse a full name into
Prefix (Dr. Mr. Ms. etc)
First Name
Last Name
Middle Name
etc
Challenge here is that it has to support names of multiple countries and languages. One assumption that we have is we will always get a country and language along with the full name as input.
The full name m开发者_如何学运维ay come in any format. For the same country / language combination, it may come in with first name last name or the reverse. Comma will not be a part of the Full Name.
Is is feasible? We are also open to any commercially available software.
I think this is impossible. Consider Ralph Vaughan Williams. His family name is "Vaughan Williams" and his first name is "Ralph". Contrast this with Charles Villiers Stanford, whose family name is "Stanford", with first name "Charles" and middle name "Villiers".
Both are English-speaking composers from England, so country and language information is not sufficient to establish the correct parsing logic.
Since the OP was open to any commercially available offering...
The "IBM InfoSphere Global Name Analytics" appears to be a commercial solution satisfying the original request for the parsing of a [free-form unstructured] personal name [full name]; apparently with a degree of certainty in regards to resolving some of the name ambiguity issues alluded to in other responses.
Note: I have no personal experience nor association with the product, I had merely encountered this discussion and the following reference links while re-investigating effectively the same concern as described by the OP. HTH.
A general product documentation link:
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gna_con_gnaoverview.html
Refer to the "Parsing names using NameParser" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html
The NameParser is a component API for the product per
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturecapis.html
Refer to the "Parsing names using IBM NameWorks" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_parsingnamesusingnameworks.html
"IBM NameWorks combines the individual IBM InfoSphere Global Name Recognition components into a single, unified, easy-to-use application programming interface (API), and also extends this functionality to Java applications and as a Web service"
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturenwapis.html
To clarify why I think this answers the question, ameliorating some of the previous alluded difficulties in accomplishing the task... If I understood correctly what I read, the APIs use the "NameHunter Server" to search the "IBM InfoSphere Global Name Data Archive (NDA)" which is described as "a collection of nearly one billion names from around the world, along with gender and country of association for each name. This large repository of name information powers the algorithms and rules that IBM InfoSphere Global Name Recognition products use to categorize, classify, parse, genderize , and match names."
FWiW I also ran across a "Name Parser" which uses a database of ~140K names as noted at:
http://www.melissadata.com/dqt/websmart-web-services.htm
The only reasonable approach is to avoid having to do so in the first place. The most obvious (and common) way to do that is to have the user enter the title, first/given name, last/family name, suffix, etc., separately from each other, rather than attempting to parse them out of a single string.
Ask yourself: do you really need the different parts of a name? Parsing names is inherently un-doable, since different cultures use different conventions (e.g. "middle name" is a typical USA-ism) and some small percentage of names will always be treated wrongly.
It is much preferable to treat a name as an "atomic" not-splittable entity.
Here are two free PHP name parsing libraries for those on a budget:
https://code.google.com/p/php-name-parser/
http://jasonpriem.org/human-name-parse/
And here is a Javasript library in Node package manager:
https://npmjs.org/package/name-parser
I wrote a simple human name parser in javascript as an npm module:
https://www.npmjs.org/package/humanparser
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Install
npm install humanparser
Usage
var human = require('humanparser');
var fullName = 'Mr. William R. Jenkins, III'
, attrs = human.parseName(fullName);
console.log(attrs);
//produces the following output
{ saluation: 'Mr.',
firstName: 'William',
suffix: 'III',
lastName: 'Jenkins',
middleName: 'R.',
fullName: 'Mr. William R. Jenkins, III' }
A basic algorithm could do the following:
- First see if incoming string starts with a title such as Mrs and remove it if it does, checking against a fixed list of titles.
- If there is one space left and one space exactly, assume first word is first name and second word is surname (which will be incorrect at times)
To go beyond that would be lots of work, see How to parse full names to identify avenues for improvement and see these involved IBM docs for further implementation clues
"Ashton Jordan" "Jordan Ashton" -- u can't tell which is the surname and which is the give name. Also people in South India apparently don't have a surname. The same with Sherpas in the Himalayas.
But say you have a huge list of all surnames (which are never used as given names) then maybe you can use that to identify other parts of the name (Salutations/Given/Middle/Jr/Sr/I/II/...) And if there is ambiguity your name-parser could ask for human input.
As others have explained, the problem is not solvable. The best approach I can think of to storing names is storing the full name, followed by the start (and potentially also ending) offsets into a "primary collating subfield" which the person entering the name could have indicated by highlighting it or such. For example
John Robert Miller, Jr.
where the boldface is indicating what was marked as the "primary collating subfield". This range would then be moved to the beginning of the string when generating the collating key.
Of course this approach alone may not be sufficient if you also want to support titles (and ignoring them for collation purposes)...
精彩评论