Regex to find all sentences of text?

2023-01-12 22:37 问答作者：

I have been trying to teach myself Regexes in python and I decided to print out all the sentences of a text. I have been tinkering with the regular expressions for the past 3 hours to no avail.

I just tried the following but couldn't do anything.

p = open('anan.txt')
process = p.read()
regexMatch = re.findall('^[A-Z].+\s+[.!?]$',process,re.I)
print regexMatch
p.close()

My input file is like this:

OMG is this a question ! Is this a sentence ? My.
name is.

This prints no outputs. But when I remove "My. name is.", it prints OMG is this a question and Is this a sentence together as if it only reads the first line.

What is the best solution of regex that can find all sentences in a text file - regardless if the sentence carries to new line or so - and also 开发者_StackOverflow中文版reads the entire text? Thanks.

Something like this works:

## pattern: Upercase, then anything that is not in (.!?), then one of them
>>> pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
>>> pat.findall('OMG is this a question ! Is this a sentence ? My. name is.')
['OMG is this a question !', 'Is this a sentence ?', 'My.']

Notice how name is. is not in the result because it does not start with a uppercase letter.

Your problem comes from the use of the ^$ anchors, they work on the whole text.

There are two issues in your regex:

Your expression is anchored by ^ and $, which are the "start of line" and "end of line" anchors, respectively. That means that your pattern is looking to match an entire line of your text.
You are searching for \s+ before your punctuation character, which specifies one or more whitespace character. If you don't have whitespace before your punctuation, the expression will not match.

Edited: now it will work with multiline sentences too.

>>> t = "OMG is this a question ! Is this a sentence ? My\n name is."
>>> re.findall("[A-Z].*?[\.!?]", t, re.MULTILINE | re.DOTALL )
['OMG is this a question !', 'Is this a sentence ?', 'My\n name is.']

Only one thing left to explain - re.DOTALL makes . match newline as described here

Thank you cji and Jochen Ritzel.

sentence=re.compile("[A-Z].*?[\.!?] ", re.MULTILINE | re.DOTALL )

I think this is the best, just add a space at the end.

 SampleReport='I image from 08/25 through 12. The patient image 1.2, 23, 34, 45 and 64 from serise 34. image look good to have a tumor in this area.  It has been resected during the interval between scans.  The'

if use

pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
pat.findall(SampleReport)

The result will be:

['I image from 08/25 through 12.',
'The patient image 1.',
 'It has been resected during the interval between scans.']

The bug is it can't handle digit like 1.2. But this one works perfectly.

sentence.findall(SampleReport)

Result

['I image from 08/25 through 12. ',
'The patient image 1.2, 23, 34, 45 and 64 from serise 34. ',
 'It has been resected during the interval between scans. ']

Try the other way around: Split the text at sentence boundaries.

lines = re.split(r'\s*[!?.]\s*', text)

If that doesn't work, add a \ before the ..

You can try:

p = open('a')
process = p.read()
print process
regexMatch = re.findall('[^.!?]+[.!?]',process)
print regexMatch
p.close()

The regex used here is [^.!?]+[.!?] which tries to match one or more non-sentence delimiter followed by a sentence delimiter.

I tried on Notepad++, and I got this :

.*$

And activate the multiline option :

re.MULTILINE

Cheers

继续阅读：python regex

Regex to find all sentences of text?

Result

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Result

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？