开发者

Regexp to pull capitalized words not at the beginning of sentence and two adjacent words

I want to pull out capitalized words that don't start a sentence al开发者_运维问答ong with the previous and following word.

I'm using:

(\w*)\b([A-Z][a-z]\w*)\b(\w*)

replace with:

$1 -- $2 -- $3

Edit: It's only returning the $2. Will try suggestions.

And regarding natural language? Don't care for this thing. I just want to see where capitals show up in a sentence so I can figure out if they're proper or not.


How about this?

([a-zA-Z]+)\s([A-Z][a-z]*)\s([a-zA-Z]+)

This doesn't take into account anything non-alphabetic though. It also assumes that all words are separated by a single whitespace character. You will need to modify it if you want more complex support.


Right now your regex fails because the \b can never match. It matches only between alphanumeric and non-alphanumeric characters; therefore it can never match between \w* and [A-Z] or another \w*.

So, you need some other (=non-alphanumeric) characters between your words:

Try

(\w*)\W+([A-Z][a-z]\w*)\W+(\w*)

although (if your regex engine allows using Unicode properties), you might be happier with

(\w*)\W+(\p{Lu}\p{Ll}\w*)\W+(\w*)

As written, only capitalized words of length 2 or more are matched, i. e. "I" (as in "me") will not be matched by this. I suppose you inserted the [a-z] to avoid matches like "IBM"? Or what was your intention?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜