Select capitalized & all-caps words using RegEx

2023-03-27 09:23 问答作者：

I'm trying to find names of people and companies (everything that is capitalized but not in the beginning of a sentence) in a large body of text. The purpose is to find as many instances as possible so that they can be XML-tagged properly.

This is what I've come up with so far:

[^\W](\s\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+

It has two problems:

It selects two characters too many in front of the hit. In the sentence "Is this Beetle ugly?" it finds s Beetle which complicates the subsequent tagging.
When a capitalized word is preceded with an apostrophe or a colon, it isn't found. If possible I'd like to limit what characters are used for determining a sentence to just !?.

Here's the sample text I'm using to test it out:

John Adams is my hero. There's just no limits to his imagination! Is this Beetle ugly? It sings at the: La Scala opera house. I have a dream that I will find work at' Frame Store but not in the USA! This way ILM could do whatever they pleased. ILM 开发者_如何学Gowas very sweet. Visual Effects did a good job... Neither did Animatronix?

I'm using jEdit http.//jedit.org since I need something that works on both Windows and OS X.

Update, this avoids now the matching at the start of the string.

(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+

(?<!(?:[!?\.]\s|^)) is a negative lookbehind that ensures it is not preceded by one of the !?. and a space OR by the start of a new row.

I tested it with jEdit.

Update to cover Names consisting of multiple words

(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]*\b(?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)*)+
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (added)
                                            ^ (changed)

I added the group (?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)* to match optional following words starting with uppercase letters. And I changed the + to a * to match the A in your example My company's called A Few Good Men. But this change causes now the regex to match I as a name.

See tchrists comment. Names are not a simple thing and it gets really difficult if you want to cover the more complex cases.

This is also working

(?<!\p{P}\s)(\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+

But \p{P} covers all punctuation, I understood this is not what you want. But maybe you can find here on regular-expressions.info/unicode.html a property that fits your needs.

Another mistake in your expression is the | in the character class. Its not needed, you are just adding this character to your class and with it it will match words like U|S|A, so just remove it:

(?<![!?\.]\s)(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+

继续阅读：regex

Select capitalized & all-caps words using RegEx

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？