How do I use BeautifulSoup to strip the tags and just deliver the text back into the soup?

2023-04-11 06:17 问答作者：

I'm trying to replace any  tags with just the contents in my soup. This is in the middle of other processing that I'm doing using BeautifulSoup.

This is slightly different to a similar question on extracting the text.

Example input:

... </p> ... <p>Here is some text</p> ... and some more

Desired output:

... ... Here is some text ... and some more

And what would I do if I only want to do that processing in say a d开发者_如何学JAVAiv of class="content"?

I don't yet seem to have my BeautifulSoup head on yet!

I didn't use beautifulSoup, but it should be similar to the built in HTMLParser library. This is a class I built to parse input html and convert the tags to a required different "markup".

class BaseHTMLProcessor(HTMLParser):
    def reset(self):                       
        # extend (called by HTMLParser.__init__)
        self.pieces = []
        HTMLParser.reset(self)

    def handle_starttag(self, tag, attrs):
        # called for each start tag
        # attrs is a list of (attr, value) tuples
        # e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
        # Ideally we would like to reconstruct original tag and attributes, but
        # we may end up quoting attribute values that weren't quoted in the source
        # document, or we may change the type of quotes around the attribute value
        # (single to double quotes).
        # Note that improperly embedded non-HTML code (like client-side Javascript)
        # may be parsed incorrectly by the ancestor, causing runtime script errors.
        # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
        # to ensure that it will pass through this parser unaltered (in handle_comment).
        if tag == 'b': 
            v = r'%b[1]'
        elif tag == 'li': 
            v = r'%f[1]'
        elif tag == 'strong': 
            v = r'%b[1]%i[1]'
        elif tag == 'u': 
            v = r'%u[1]'
        elif tag == 'ul': 
            v = r'%n%'
        else:
            v = ''
        self.pieces.append("{0}".format(v))

    def handle_endtag(self, tag):         
        # called for each end tag, e.g. for </pre>, tag will be "pre"
        # Reconstruct the original end tag.
        if tag == 'li': 
            v = r'%f[0]' 
        elif tag == '/b': 
            v = r'%b[0]'
        elif tag == 'strong': 
            v = r'%b[0]%i[0]'
        elif tag == 'u': 
            v = r'%u[0]'
        elif tag == 'ul': 
            v = ''
        elif tag == 'br': 
            v = r'%n%' 
        else: 
            v = '' # it matched but we don't know what it is! assume it's invalid html and strip it
        self.pieces.append("{0}".format(v))

    def handle_charref(self, ref):         
        # called for each character reference, e.g. for "&#160;", ref will be "160"
        # Reconstruct the original character reference.
        self.pieces.append("&#%(ref)s;" % locals())

    def handle_entityref(self, ref):       
        # called for each entity reference, e.g. for "&copy;", ref will be "copy"
        # Reconstruct the original entity reference.
        self.pieces.append("&%(ref)s" % locals())
        # standard HTML entities are closed with a semicolon; other entities are not
        if htmlentitydefs.entitydefs.has_key(ref):
            self.pieces.append(";")

    def handle_data(self, text):           
        # called for each block of plain text, i.e. outside of any tag and
        # not containing any character or entity references
        # Store the original text verbatim.
        output = text.replace("\xe2\x80\x99","'").split('\r\n')
        for count,item in enumerate(output):
            output[count] = item.strip()
        self.pieces.append(''.join(output))

    def handle_comment(self, text):        
        # called for each HTML comment, e.g. <!-- insert Javascript code here -->
        # Reconstruct the original comment.
        # It is especially important that the source document enclose client-side
        # code (like Javascript) within comments so it can pass through this
        # processor undisturbed; see comments in unknown_starttag for details.
        self.pieces.append("<!--%(text)s-->" % locals())

    def handle_pi(self, text):             
        # called for each processing instruction, e.g. <?instruction>
        # Reconstruct original processing instruction.
        self.pieces.append("<?%(text)s>" % locals())

    def handle_decl(self, text):
        # called for the DOCTYPE, if present, e.g.
        # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        #     "http://www.w3.org/TR/html4/loose.dtd">
        # Reconstruct original DOCTYPE
        self.pieces.append("<!%(text)s>" % locals())

    def output(self):              
        """Return processed HTML as a single string"""
        return "".join(self.pieces)

To use the class, just source it. Then in your code use these lines:

parser = BaseHTMLProcessor()
for line in input:  
    parser.feed(line)
    parser.close()
    output = parser.output()
    parser.reset()
    print output

It works by tokenizing the input stream. Each piece of html that it comes to is dealt with in the appropriate method. So This is bold text! would trigger handle_starttag twice, then handle_data once, then handle_endtag twice. Finally, when the output method is called, it returns the stream contents joined back together.

继续阅读：python

How do I use BeautifulSoup to strip the <p> tags and just deliver the text back into the soup?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？