Nested Groups in Regex
I'm constructing a regex that is looking for dates. I would like to return the date found and the sentence it was found in. In the code below, the strings on either side of date_string should check for the conditions of a sentence. For your sake, I've omitted the regex for date_string - sufficed to say, it works for picking out dates. While the inside of date_string isn't important, it is grouped as one entire regex.
"((?:[^.|?|!]*)"+date_string+"(?:[^.|?|!]*[.|?|!]\s*))"
The problem is that date_string is only matching the last number of any given date, presumably because the re开发者_C百科gex in front of date_string is matching too far and overrunning the date regex. For example, if I say "Independence Day is July 4.", I will get the sentence and 4, even though it should match 'July 4'. In case you're wondering, my regex inside date_string are ordered in such a way that 'July 4' should match first. Is there any way to do this all in one regex? Or do I need to split it up somehow (i.e. split up all text into sentences, and then check each sentence)?
There are several things wrong with your regex.
- There is no alternation in character classes. You want
[^.?!]
, not[^.|?|!]
. - You don't need the non-capturing groups at all.
- You probably don't need any "outer" grouping, since the entire match is what you look for.
- Your match part preceding the date is greedy where it should not be (this runs over part of your date).
- You make assumptions about what resembles a sentence that do not match reality. Your own example proves that, if you try.
Putting that last point aside for the moment, you end up with this version:
[^.?!]*?(July 4)[^.?!]*[.?!]\s*
Where the literal July 4
stands in for your date regex. This matches in your question text:
' For example, if I say "Independence Day is July 4.'
'", I will get the sentence and 4, even though it should match 'July 4'. '
which pretty much proves my point #5.
You can make the repetition operator non-greedy by adding a question mark. In your case it would be
[^.?!]*?
And yes, splitting the text into sentences (preferably excluding the last character) would make it really easier.
(Seems like I didn't look at what was in the character class. Replaced it with tloflin's.)
精彩评论