Complicated regex to extract author name in python

2023-03-17 18:14 问答作者：

I'm trying to create a regex quite unsuccessfully, what I'm looking to do is get the content of any html element that has a class of (author|byline|writer)

Here is what I have so far

<([A-Z][A-Z0-9]*)class=\"(byLineTag|byline|author|by)\"[^>]*>(.*?)</\1>

examples of what I need to match to

  <h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/r开发者_如何转开发eference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>

<div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>

Any help would be appreciated a lot. -Stefan

Regex is not particularly well-suited to parsing HTML.
Thankfully there are tools specifically created for parsing HTML, e.g. BeautifulSoup and lxml; the latter of which is demonstrated below:

markup = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6><div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>'''

import lxml.html

import lxml.html
doc = lxml.html.fromstring(markup)
for a in doc.cssselect('.author, .by, .byline, .byLineTag'):
    print a.text_content()
# By JACK EWING and LANDON THOMAS Jr.
# By 
# Sarah Shemkus

Strongly suggest not using a regexp to parse the html for reasons already mentioned. Use an existing HTML parser. AS an example of how easy it can be, I've included an example of using lxml and it's CSS selector.

from lxml import etree
from lxml.cssselect import CSSSelector

## Your html string
html_string = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>'''

## lxml html parser
html = etree.HTML(html_string)

## lxml CSS selector
sel = CSSSelector('.author, .byline, .writer')

## Call the selector to get matches
matching_elements = sel(html)

for elem in matching_elements:
    primt elem.text

Try this :

<([A-Z][A-Z0-9]*).*?class=\"(byLineTag|byline|author|by)\"[^>]*?>(.*?)</\1>

What i have added :
- .*?, in case the class attribute doesn't appear right after the starting tag.
- *? , set the * operator as non greedy for finding the closing >

You forgot to account for the space between the tag name and the first attribute name. Also, unless you're sure that class will always be the first attribute, you should account for the opposite in your expression. Furthermore, the \1 should be a \0 (back-references are zero-indexed), if you really care about the closing tag. As I've noted in my comment, you should also include lower-case characters in your wildcards.

Here is a better expression (I've disregarded the closing tag to make it simpler):

<[A-Za-z][A-Za-z0-9]*.*? class=["'](byLineTag|byline|author|by)["'][^>]*>

Remeber to run all lines together first, to avoid errors when tags are split across several lines. Of course, you would probably save yourself a lot of work if you used Python's HTML parser instead.

继续阅读：python regex

Complicated regex to extract author name in python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？