Finding a strings in a text using regular expressions with Python

2023-01-21 03:26 问答作者：

I have a text, in which only  and  has been used.for exampleabcd efg-123 . Can can I extra开发者_如何学Pythonct the string between these tags? also I need to extract 3 words before and after this chunk of abcd efg-123 string. How can I do that? what would be the suitable regular expression for this?

this will get what's in between the tags,

>>> s="1 2 3<b>abcd efg-123</b>one two three"
>>> for i in s.split("</b>"):
...   if "<b>" in i:
...      print i.split("<b>")[-1]
...
abcd efg-123

Handles tags inside the  unless they are  ofcouse.

import re    
sometext = 'blah blah 1 2 3<b>abcd efg-123</b>word word2 word3 blah blah'
result = re.findall(
      r'(((?:(?:^|\s)+\w+){3}\s*)'            # Match 3 words before
      r'<b>([^<]*|<[^/]|</[^b]|</b[^>])</b>'  # Match <b>...</b>
      r'(\s*(?:\w+(?:\s+|$)){3}))', sometext) # Match 3 words after

result == [(' 1 2 3<b>abcd efg-123</b>word word2 word3 ',
    ' 1 2 3',
    'abcd efg-123',
    'word word2 word3 ')]

This should work, and perform well, but if it gets any more advanced then this you should consider using a html parser.

This is actually a very dumb version and doesn't allow nested tags.

re.search(r"(\w+)\s+(\w+)\s+(\w+)\s+<b>([^<]+)</b>\s+(\w+)\s+(\w+)\s+(\w+)", text)

See Python documentation.

You should not use regexes for HTML parsing. That way madness lies.

The above-linked article actually provides a regex for your problem -- but don't use it.

继续阅读：parsing python regex

Finding a strings in a text using regular expressions with Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？