Regexp to pull capitalized words not at the beginning of sentence and two adjacent words
I want to pull out capitalized words that don't start a sentence al开发者_运维问答ong with the previous and following word.
I'm using:
(\w*)\b([A-Z][a-z]\w*)\b(\w*)
replace with:
$1 -- $2 -- $3
Edit: It's only returning the $2. Will try suggestions.
And regarding natural language? Don't care for this thing. I just want to see where capitals show up in a sentence so I can figure out if they're proper or not.
How about this?
([a-zA-Z]+)\s([A-Z][a-z]*)\s([a-zA-Z]+)
This doesn't take into account anything non-alphabetic though. It also assumes that all words are separated by a single whitespace character. You will need to modify it if you want more complex support.
Right now your regex fails because the \b
can never match. It matches only between alphanumeric and non-alphanumeric characters; therefore it can never match between \w*
and [A-Z]
or another \w*
.
So, you need some other (=non-alphanumeric) characters between your words:
Try
(\w*)\W+([A-Z][a-z]\w*)\W+(\w*)
although (if your regex engine allows using Unicode properties), you might be happier with
(\w*)\W+(\p{Lu}\p{Ll}\w*)\W+(\w*)
As written, only capitalized words of length 2 or more are matched, i. e. "I" (as in "me") will not be matched by this. I suppose you inserted the [a-z]
to avoid matches like "IBM"? Or what was your intention?
精彩评论