Limit the number of sentences in a string
A beginner's Python question:
I have a string with x number of sentences. How to I extract first 2 sentence开发者_运维问答s (may end with . or ? or !)
Ignoring considerations such as when a .
constitutes the end of sentence:
import re
' '.join(re.split(r'(?<=[.?!])\s+', phrase, 2)[:-1])
EDIT: Another approach that just occurred to me is this:
re.match(r'(.*?[.?!](?:\s+.*?[.?!]){0,1})', phrase).group(1)
Notes:
- Whereas the first solution lets you replace the 2 with some other number to choose a different number of sentences, in the second solution, you change the 1 in
{0,1}
to one less than the number of sentences you want to extract. - The second solution isn't quite as robust in handling, e.g., empty strings, or strings with no punctuation. It could be made so, but the regex would be even more complex than it is already, and I would favour the slightly less efficient first solution over an unreadable mess.
I solved it like this: Separating sentences, though a comment on that post also points to NLTK, though I don't know how to find the sentence segmenter on their site...
Here's how yo could do it:
str = "Sentence one? Sentence two. Sentence three? Sentence four. Sentence five."
sentences = str.split(".")
allSentences = []
for sentence in sentences
allSentences.extend(sentence.split("?"))
print allSentences[0:3]
There are probably better ways, I look forward to seeing them.
Here is a step by step explanation of how to disassemble, choose the first two sentences, and reassemble it. As noted by others, this does not take into account that not all dot/question/exclamation characters are really sentence separators.
import re
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
# split the first two sentences by the dot/question/exclamation.
sentences = re.split('([.?!])', testline, 2)
print "result of split: ", sentences
# toss everything else (the last item in the list)
firstTwo = sentences[:-1]
print firstTwo
# put the first two sentences back together
finalLine = ''.join(firstTwo)
print finalLine
Generator alternative using my utility function returning piece of string until any item in search sequence:
from itertools import islice
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
def multis(search_sequence,text,start=0):
""" multisearch by given search sequence values from text, starting from position start
yielding tuples of text before found item and found sequence item"""
x=''
for ch in text[start:]:
if ch in search_sequence:
if x: yield (x,ch)
else: yield ch
x=''
else:
x+=ch
else:
if x: yield x
# split the first two sentences by the dot/question/exclamation.
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation
print "result of split: ", two_sentences
print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)
精彩评论