Find names with Regular Expression
For finding names in a big text I have the following regex
([A-Z][a-z]*)[\s-]([A-Z][a-z]*)
This works fine for normals names like "Jack Oneill" or "John Guidetti". But there are a few possibilities that I want to find, but cannot find. Like:
Chandler Murial Bing
Gandalf the Gray
Pieter van den Woude
I cannot seem to get this right with my l开发者_JAVA百科imited knowledge of Regular Expressions. Can anyone help me (and please provide a good website/book for this) :)
The best way to approach a regular expression problem is to describe the matches you are looking for (usually called grammar).
For example, from your question, I might describe it like the following:
- A capitalized word is defined as one capital letter and 1+ letters/dashes or one capital letter and a
.
(an initial). - An uncapitalized word is defined as 1 letter and 1+ letters/dashes (not perfect, because that could allow ending in a dash).
- First word starts with a capital letter
- Last word ends with a capital letter
- 0+ capitalized words between first and last word
- Then 0-2 uncapitalized words between first capitalized words and last word
- At least two words.
- Words are broken by whitespace
If this provides a reasonably close match to the desired result set (and to be clear, for names, there are so many variations that you will either have false positives or false negatives), then you begin constructing the expression:
- Capitalized word:
[A-Z]([a-z]+|\.)
- Uncapitalized word:
[a-z][a-z\-]+
Result:
[A-Z]([a-z]+|\.)(?:\s+[A-Z]([a-z]+|\.))*(?:\s+[a-z][a-z\-]+){0,2}\s+[A-Z]([a-z]+|\.)
Matches (in bold):
Hello my name is Chandler Muriel Bing. I have a friend who is named Pieter van den Woude and he has another friend, A. A. Milne. Gandalf the Gray joins us. Together, we make up the Friends Cast and Crew.
Problems:
- Because you want to match Gandalf the Gray and Pieter van den Woude you will inevitably match other sets that consist of names with uncapitalized words in between (Friends Cast and Crew). The above grammar attempts to limit the problem by limiting it to 2 uncapitalized words. You could also create a set of allowed uncapitalized words instead ("van", "der", "the"), and only match those words.
- Doesn't allow for non-Latin-alphabet letters, ligatures, diacritics, etc.
- As I and others have pointed out, regular expressions will never be perfect for this situation, but as you said, you want something to get you most of the way there. In this case, the above expression should do a pretty good job, but consider it a blunt instrument! You've been warned.
In your case, just add another
[\s-]([A-Z][a-z]*)
Generally speaking, regex is not suitable for this problem, there are too many special cases, you will need to build a list of those names.
For complex names, you may refer to [natural language processing]: http://en.wikipedia.org/wiki/Natural_language_processing
精彩评论