开发者

Complicated regex to extract author name in python

I'm trying to create a regex quite unsuccessfully, what I'm looking to do is get the content of any html element that has a class of (author|byline|writer)

Here is what I have so far

<([A-Z][A-Z0-9]*)class=\"(byLineTag|byline|author|by)\"[^>]*>(.*?)</\1>

examples of what I need to match to

  <h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/r开发者_如何转开发eference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>

or

<div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>

Any help would be appreciated a lot. -Stefan


Regex is not particularly well-suited to parsing HTML.
Thankfully there are tools specifically created for parsing HTML, e.g. BeautifulSoup and lxml; the latter of which is demonstrated below:

markup = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6><div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>'''

import lxml.html

import lxml.html
doc = lxml.html.fromstring(markup)
for a in doc.cssselect('.author, .by, .byline, .byLineTag'):
    print a.text_content()
# By JACK EWING and LANDON THOMAS Jr.
# By 
# Sarah Shemkus


Strongly suggest not using a regexp to parse the html for reasons already mentioned. Use an existing HTML parser. AS an example of how easy it can be, I've included an example of using lxml and it's CSS selector.

from lxml import etree
from lxml.cssselect import CSSSelector

## Your html string
html_string = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>'''

## lxml html parser
html = etree.HTML(html_string)

## lxml CSS selector
sel = CSSSelector('.author, .byline, .writer')

## Call the selector to get matches
matching_elements = sel(html)

for elem in matching_elements:
    primt elem.text


Try this :

<([A-Z][A-Z0-9]*).*?class=\"(byLineTag|byline|author|by)\"[^>]*?>(.*?)</\1>

What i have added :
- .*?, in case the class attribute doesn't appear right after the starting tag.
- *? , set the * operator as non greedy for finding the closing >


You forgot to account for the space between the tag name and the first attribute name. Also, unless you're sure that class will always be the first attribute, you should account for the opposite in your expression. Furthermore, the \1 should be a \0 (back-references are zero-indexed), if you really care about the closing tag. As I've noted in my comment, you should also include lower-case characters in your wildcards.

Here is a better expression (I've disregarded the closing tag to make it simpler):

<[A-Za-z][A-Za-z0-9]*.*? class=["'](byLineTag|byline|author|by)["'][^>]*>

Remeber to run all lines together first, to avoid errors when tags are split across several lines. Of course, you would probably save yourself a lot of work if you used Python's HTML parser instead.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜