Regex to find all sentences of text?
I have been trying to teach myself Regexes in python and I decided to print out all the sentences of a text. I have been tinkering with the regular expressions for the past 3 hours to no avail.
I just tried the following but couldn't do anything.
p = open('anan.txt')
process = p.read()
regexMatch = re.findall('^[A-Z].+\s+[.!?]$',process,re.I)
print regexMatch
p.close()
My input file is like this:
OMG is this a question ! Is this a sentence ? My.
name is.
This prints no outputs. But when I remove "My. name is.", it prints OMG is this a question and Is this a sentence together as if it only reads the first line.
What is the best solution of regex that can find all sentences in a text file - regardless if the sentence carries to new line or so - and also 开发者_StackOverflow中文版reads the entire text? Thanks.
Something like this works:
## pattern: Upercase, then anything that is not in (.!?), then one of them
>>> pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
>>> pat.findall('OMG is this a question ! Is this a sentence ? My. name is.')
['OMG is this a question !', 'Is this a sentence ?', 'My.']
Notice how name is.
is not in the result because it does not start with a uppercase letter.
Your problem comes from the use of the ^$
anchors, they work on the whole text.
There are two issues in your regex:
- Your expression is anchored by
^
and$
, which are the "start of line" and "end of line" anchors, respectively. That means that your pattern is looking to match an entire line of your text. - You are searching for
\s+
before your punctuation character, which specifies one or more whitespace character. If you don't have whitespace before your punctuation, the expression will not match.
Edited: now it will work with multiline sentences too.
>>> t = "OMG is this a question ! Is this a sentence ? My\n name is."
>>> re.findall("[A-Z].*?[\.!?]", t, re.MULTILINE | re.DOTALL )
['OMG is this a question !', 'Is this a sentence ?', 'My\n name is.']
Only one thing left to explain - re.DOTALL
makes .
match newline as described here
Thank you cji and Jochen Ritzel.
sentence=re.compile("[A-Z].*?[\.!?] ", re.MULTILINE | re.DOTALL )
I think this is the best, just add a space at the end.
SampleReport='I image from 08/25 through 12. The patient image 1.2, 23, 34, 45 and 64 from serise 34. image look good to have a tumor in this area. It has been resected during the interval between scans. The'
if use
pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
pat.findall(SampleReport)
The result will be:
['I image from 08/25 through 12.',
'The patient image 1.',
'It has been resected during the interval between scans.']
The bug is it can't handle digit like 1.2. But this one works perfectly.
sentence.findall(SampleReport)
Result
['I image from 08/25 through 12. ',
'The patient image 1.2, 23, 34, 45 and 64 from serise 34. ',
'It has been resected during the interval between scans. ']
Try the other way around: Split the text at sentence boundaries.
lines = re.split(r'\s*[!?.]\s*', text)
If that doesn't work, add a \
before the .
.
You can try:
p = open('a')
process = p.read()
print process
regexMatch = re.findall('[^.!?]+[.!?]',process)
print regexMatch
p.close()
The regex used here is [^.!?]+[.!?]
which tries to match one or more non-sentence delimiter followed by a sentence delimiter.
I tried on Notepad++, and I got this :
.*$
And activate the multiline option :
re.MULTILINE
Cheers
精彩评论