Python HTMLParser
I'm parsing a html document using HTMLParser and I want to print the contents 开发者_运维技巧between the start and end of a p tag
See my code snippet
def handle_starttag(self, tag, attrs):
if tag == 'p':
print "TODO: print the contents"
Based on what @tauran posted, you probably want to do something like this:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def print_p_contents(self, html):
self.tag_stack = []
self.feed(html)
def handle_starttag(self, tag, attrs):
self.tag_stack.append(tag.lower())
def handle_endtag(self, tag):
self.tag_stack.pop()
def handle_data(self, data):
if self.tag_stack[-1] == 'p':
print data
p = MyHTMLParser()
p.print_p_contents('<p>test</p>')
Now, you might want to push all <p>
contents into a list and return that as a result or something else like that.
TIL: when working with libraries like this, you need to think in stacks!
I extended the example from the docs:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag
def handle_data(self, data):
print "Encountered data %s" % data
p = MyHTMLParser()
p.feed('<p>test</p>')
-
Encountered the beginning of a p tag
Encountered data test
Encountered the end of a p tag
It did not seem to work for my code so I defined tag_stack = []
outside like a sort of global variable.
from html.parser import HTMLParser
tag_stack = []
class MONanalyseur(HTMLParser):
def handle_starttag(self, tag, attrs):
tag_stack.append(tag.lower())
def handle_endtag(self, tag):
tag_stack.pop()
def handle_data(self, data):
if tag_stack[-1] == 'head':
print(data)
parser=MONanalyseur()
parser.feed()
精彩评论