Regex matching items following a header in HTML

2023-01-21 16:47 问答作者：

What should be a fairly simple regex extraction is confounding me. Couldn't find a similar question on SO, so happy to be pointed to one if it exists. Given the following HTML:

<h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>

<h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>

(amongst a larger document - the extracts will most probably run across multiple lines)

How can I construct a regular expression that finds the text within the A tags, within the first P following an H1? The regex will go in a loop, such that I can pass in the header, in order to retrieve the items that follow.

<a[^>]*>([0-9.]+?)</a> obviously matches all items in a tag (and should be fine as a ta开发者_JAVA百科gs cannot be nexted), but I can't tie them to an H1.

.+Title One.+<a[^>]*>([0-9.]+?)</a></p> fails.

I had tried to use look behind as so:

(?<=Title One.+)<a[^>]*>([0-9.]+?)</a></p> and some variations but it is only allowed for fixed width matches (which won't be the case here).

For context, this will be using Python's regex engine. I know regex isn't necessarily the best solution for this, so alternative suggestions using DOM or something else also gratefully received :)

Update

To clarify from the above, I'd like to get back the following:

{"Title One": ["40.5", "31.3"], "Title Two": ["12.1", "82.0"]}

(not that I need help composing the dictionary, but it does demonstrate how I need the values to be related to the title).

So far BeautifulSoup looks like the best shot. LXML will also probably work as the source HTML isn't really tag-soup - it's pretty well-structured, at least in the places I'm interested in.

Is this the kind of thing you're after?

>>> from lxml import etree
>>>
>>> data = """
... <h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>
... <h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
... """
>>>
>>> d = etree.HTML(data)
>>> d.xpath('//h1/following-sibling::p[1]/a/text()')
['40.5', '31.3', '12.1', '82.0']

This solution uses lxml.etree and an xpath expression.

Update

>>> from lxml import etree
>>> from pprint import pprint
>>>
>>> data = """
... <h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>
... <h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
... """
>>>
>>> d = etree.HTML(data)
>>> #d.xpath('//h1[following-sibling::*[1][local-name()="p"]]') 
...
>>> results = {}
>>> for h in d.xpath('//h1[following-sibling::*[1][local-name()="p"]]'):
...   r = results.setdefault(str(h.text),[])
...   r += [ str(x) for x in h.xpath('./following-sibling::*[1][local-name()="p"]/a/text()') ]
...
>>> pprint(results)
{'Title One': ['40.5', '31.3'], 'Title Two': ['12.1', '82.0']}

Now using predicates to look ahead, this should iterate through <h1> tags which are immediately followed by <p> tags. ( Casting tag.text to strings explicitly as I have a recollection that they aren't normal strings, you'd have trouble pickling them, etc.)

You're right, regex is absolutely the wrong tool for HTML matching.

Your question, however, sounds exactly like the problem for Beautiful Soup - a HTML parser that can deal with less-than-perfect HTML.

The other obvious answer to solve this problem is BeautifulSoup -- I like that it handles the kind of crappy html that you often run into out in the wild as sensibly and gracefully as you can hope.

Don't use regex to parse html. That can't be done, by definition. Use a html parser instead. I suggest lxml.html.

lxml.html deals with badly formed html better than BeautifulSoup, is actively maintained (BeautifulSoup isn't) and is a lot faster since it uses libxml2 internally.

Here's a way using just normal string manipulation

html='''
<h1 class="title">Title One</h1><p><a href="#">40.5</a>
<a href="#">31.3</a></p>
<h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
'''

for i in html.split("</a>"):
    if "<a href" in i:
        print i.split("<a href")[-1].split(">")[-1]

output

$ python test.py
40.5
31.3
12.1
82.0

I don't actually understand what you want to get, but if your requirement is SIMPLE, yes, a regex or a few string mangling can do it. Not necessary need a parser for that.

继续阅读：parsing python regex

Regex matching items following a header in HTML

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？