Regex to extract paragraph

2023-04-08 12:24 问答作者：

I am attempting to write a regex in Python to extract part of a paragraph.

In the below paragraph, the part I wish to extract is bolded.

Proposal will boost bailout fund, inject cash into banks and cut Greek debt says reports.

My regex and output as follows,

>>> text = 'Proposal will boost bailout fund, inject cash into banks and cut Greek debt says reports.'
>>>开发者_开发百科; pattern = re.compile(r'(boost bailout)+?([\s\S]*?)(debt)+?')
>>> print re.findall(pattern, text)

[('boost bailout', ' fund, inject cash into banks and cut Greek ', 'debt')]

Although it does extract the correct section, is it right that the extraction is separated into 3 parts in a tuple and not just a single line such as the below?

[('boost bailout fund, inject cash into banks and cut Greek debt')]

From the documentation:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

-- http://docs.python.org/library/re.html

If you want one match, do:

#!/usr/bin/env python
import re
text = 'Proposal will boost bailout fund, inject cash into banks and cut Greek debt says reports.'
pattern = re.compile(r'boost bailout[\s\S]*?debt')
print re.findall(pattern, text)

use

re.search(reg, text).group(0)

or (your case):

pattern.search(text).group(0)

You are returned a tuple because, as you can read in the Python documentation for the re module, parentheses create capture groups, which may then be retrieved separately. In order to avoid this you should use a non capturing group: (?: ... )

Your pattern is incorrect:

(boost bailout)+ means : the string 'boost bailout' repeated several times,
which is certainly not what is wanted. If you put several pairs of parens in the pattern, you'll obtain several catching groups. The correct pattern, if you want only to extract all the text between 'boost bailout' and the LAST string 'debt' is:

pattern = r'boost bailout.+debt'

and the regex is

reg = re.compile(r'boost bailout.+debt',re.DOTALL)

re.DOTALL is a flag that makes the dot symbol matching every character, comprised the newlines: it replaces [\s\S].

But if you want to extract between 'boost bailout' and FIRST appearance of 'debt', it must be

pattern = r'boost bailout.+?debt'

Also, use reg.search(text).group() instead of reg.findall(text) that produces a list of one element.

Note that pattern defined by pattern = r'boost bailout.+?debt' is a string object,
and that reg defined by reg = re.compile(pattern) is a RegexObject object.

What deserves the name regex is the RegexObject, what deserves the name pattern is the string.

继续阅读：python regex

Regex to extract paragraph

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？