开发者

Python, search for html tags inside a file using regex

So I am doing some data analysis in which I am required to extract the page title, breadcrumb, h1 tags from hundreds of HTML and SHTML files.

Those tags are in the following format (meaning stuffs inside , and breadcrumb):

<title>Mapping a Drive: Macintosh OSX &lt; Mapping a Drive &lt; eHelp &lt; Cal Poly Pomona开发者_开发知识库</title>

<p><!-- InstanceBeginEditable name="breadcrumb" --><a href="../index.html">eHelp</a> &raquo; <a href="index.shtml">Mapping a Drive</a> &raquo; Mac OS X<!-- InstanceEndEditable --></p>


<h1><a name="contentstart" id="contentstart"></a><!-- InstanceBeginEditable name="page_heading" --><a name="top" id="top"></a>Mapping a Drive:<span class="goldletter"> Macintosh </span>OS X  <!-- InstanceEndEditable --></h1>

After getting those tags, I want to further extract the first part of the title Mapping a Drive: Macintosh OSX, last part of the breadcrumb Mac OS X and the whole h1 Mapping a Drive: Macintosh OSX

Any idea how that can be accomplished?


Use a real HTML parser, not a regex. You will be happier. lxml.html is highly regarded, as is BeautifulSoup.


Since most HTML is basically xml (or can easily be trimmed to be compatible with most xml parsers) I would suggest using an xml parser. Most python HTML-specific parsers are just subclasses of an xml parser anyway.

Check out: Python and XML.

Here is a good tutorial: Python XML Parser Tutorial.

Also, the xml.dom.minidom Class has been super useful for me personally.

Another similar method is explained here: xml.etree.ElementTree.

This is a good example from the xml.dom.minidom reference page:

import xml.dom.minidom

document = """\
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>

<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""

dom = xml.dom.minidom.parseString(document)

def getText(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
    return ''.join(rc)

def handleSlideshow(slideshow):
    print "<html>"
    handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
    slides = slideshow.getElementsByTagName("slide")
    handleToc(slides)
    handleSlides(slides)
    print "</html>"

def handleSlides(slides):
    for slide in slides:
        handleSlide(slide)

def handleSlide(slide):
    handleSlideTitle(slide.getElementsByTagName("title")[0])
    handlePoints(slide.getElementsByTagName("point"))

def handleSlideshowTitle(title):
    print "<title>%s</title>" % getText(title.childNodes)

def handleSlideTitle(title):
    print "<h2>%s</h2>" % getText(title.childNodes)

def handlePoints(points):
    print "<ul>"
    for point in points:
        handlePoint(point)
    print "</ul>"

def handlePoint(point):
    print "<li>%s</li>" % getText(point.childNodes)

def handleToc(slides):
    for slide in slides:
        title = slide.getElementsByTagName("title")[0]
        print "<p>%s</p>" % getText(title.childNodes)

handleSlideshow(dom)

If you absolutely must use regex instead of a parser, check out the re module:

In [1]: import re
In [2]: grps = re.search(r"<([^>]+)>([^<]+)</\1>", "<abc>123</abc>")
In [3]: if grps:
In [4]:     print grps.groups()
Out[3]: ('abc', '123')


html5lib is a very reliable html parser. Since your xhtml is somewhat broken, an xml parser will reject it. Fortunately, html5lib has lxml integration, so you can still use the full power of lxml and xpath to extract your data.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜