开发者

Regex to extract paragraph

I am attempting to write a regex in Python to extract part of a paragraph.

In the below paragraph, the part I wish to extract is bolded.

Proposal will boost bailout fund, inject cash into banks and cut Greek debt says reports.

My regex and output as follows,

>>> text = 'Proposal will boost bailout fund, inject cash into banks and cut Greek debt says reports.'
>>>开发者_开发百科; pattern = re.compile(r'(boost bailout)+?([\s\S]*?)(debt)+?')
>>> print re.findall(pattern, text)

[('boost bailout', ' fund, inject cash into banks and cut Greek ', 'debt')]

Although it does extract the correct section, is it right that the extraction is separated into 3 parts in a tuple and not just a single line such as the below?

[('boost bailout fund, inject cash into banks and cut Greek debt')]


From the documentation:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

-- http://docs.python.org/library/re.html

If you want one match, do:

#!/usr/bin/env python
import re
text = 'Proposal will boost bailout fund, inject cash into banks and cut Greek debt says reports.'
pattern = re.compile(r'boost bailout[\s\S]*?debt')
print re.findall(pattern, text)


use

re.search(reg, text).group(0)

or (your case):

pattern.search(text).group(0)


You are returned a tuple because, as you can read in the Python documentation for the re module, parentheses create capture groups, which may then be retrieved separately. In order to avoid this you should use a non capturing group: (?: ... )


Your pattern is incorrect:

(boost bailout)+ means : the string 'boost bailout' repeated several times,
which is certainly not what is wanted. If you put several pairs of parens in the pattern, you'll obtain several catching groups. The correct pattern, if you want only to extract all the text between 'boost bailout' and the LAST string 'debt' is:

pattern = r'boost bailout.+debt'

and the regex is

reg = re.compile(r'boost bailout.+debt',re.DOTALL)  

re.DOTALL is a flag that makes the dot symbol matching every character, comprised the newlines: it replaces [\s\S].

But if you want to extract between 'boost bailout' and FIRST appearance of 'debt', it must be

pattern = r'boost bailout.+?debt'

Also, use reg.search(text).group() instead of reg.findall(text) that produces a list of one element.

Note that pattern defined by pattern = r'boost bailout.+?debt' is a string object,
and that reg defined by reg = re.compile(pattern) is a RegexObject object.

What deserves the name regex is the RegexObject, what deserves the name pattern is the string.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜