开发者

Select capitalized & all-caps words using RegEx

I'm trying to find names of people and companies (everything that is capitalized but not in the beginning of a sentence) in a large body of text. The purpose is to find as many instances as possible so that they can be XML-tagged properly.

This is what I've come up with so far:

[^\W](\s\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+

It has two problems:

  1. It selects two characters too many in front of the hit. In the sentence "Is this Beetle ugly?" it finds s Beetle which complicates the subsequent tagging.
  2. When a capitalized word is preceded with an apostrophe or a colon, it isn't found. If possible I'd like to limit what characters are used for determining a sentence to just !?.

Here's the sample text I'm using to test it out:

John Adams is my hero. There's just no limits to his imagination! Is this Beetle ugly? It sings at the: La Scala opera house. I have a dream that I will find work at' Frame Store but not in the USA! This way ILM could do whatever they pleased. ILM 开发者_如何学Gowas very sweet. Visual Effects did a good job... Neither did Animatronix?

I'm using jEdit http.//jedit.org since I need something that works on both Windows and OS X.


Update, this avoids now the matching at the start of the string.

(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+

(?<!(?:[!?\.]\s|^)) is a negative lookbehind that ensures it is not preceded by one of the !?. and a space OR by the start of a new row.

I tested it with jEdit.

Update to cover Names consisting of multiple words

(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]*\b(?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)*)+
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (added)
                                            ^ (changed)

I added the group (?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)* to match optional following words starting with uppercase letters. And I changed the + to a * to match the A in your example My company's called A Few Good Men. But this change causes now the regex to match I as a name.

See tchrists comment. Names are not a simple thing and it gets really difficult if you want to cover the more complex cases.

This is also working

(?<!\p{P}\s)(\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+

But \p{P} covers all punctuation, I understood this is not what you want. But maybe you can find here on regular-expressions.info/unicode.html a property that fits your needs.

Another mistake in your expression is the | in the character class. Its not needed, you are just adding this character to your class and with it it will match words like U|S|A, so just remove it:

(?<![!?\.]\s)(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜