python regex retrieve only one group
I have juste a little experience with the regex, and now I have a little problem.
I must retrieve the strings between the .
So here is a sample :
Categories: <a href="/car/2/page1.html">2</a>, <a href="/car/nissan/">nissan</a>,<a href="/car/all/page1.html">all</a>
And this is my little regex:
re.findall("""<a href=".*">.*</a>""",string)
Well, it works , but 开发者_运维技巧I just want the strings between the , not the href, so how could I do this ?
thanks.
Use parentheses to form a capturing group:
'<a href=".*">(.*)</a>'
You also probably want to use a non-greedy quantifier to avoid matching far more than you intended.
'<a href=".*?">(.*?)</a>'
Result:
['2', 'nissan', 'all']
Or even better, consider using an HTML parser, such as BeautifulSoup.
Regex is never a good idea for parsing HTML. There are too many edge cases that make crafting a robust regular expression difficult. Consider the following perfectly browser-viewable links:
< a href="/car/all/page1.html">all</a>
<a href="/car/all/page1.html">all</a>
<a href= "/car/all/page1.html">all</a>
<a id="foo" href="/car/all/page1.html">all</a>
<a
href="/car/all/page1.html">all</a>
All of which will not be matched by the given regular expression. I highly recommend an HTML parser, such as Beautiful Soup or lxml. Here's an lxml example:
from lxml import etree
html = """
Categories: <a href="/car/2/page1.html">2</a>, <a href="/car/nissan/">nissan</a>,<a href="/car/all/page1.html">all</a>
"""
doc = etree.HTML(html)
result = doc.xpath('//a[@href]/text()')
Result:
['2', 'nissan', 'all']
no matter if the HTML is different or even somewhat malformed.
精彩评论