using python, Remove HTML tags/formatting from a string [duplicate]

2023-01-09 21:07 问答作者：

This question already has answers here: Strip HTML from strings in Python (28 answers) Closed 5 years ago.

I have a string that contains html markup like links, bo开发者_开发知识库ld text, etc.

I want to strip all the tags so I just have the raw text.

What's the best way to do this? regex?

If you are going to use regex:

import re
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

>>> striphtml('<a href="foo.com" class="bar">I Want This <b>text!</b></a>')
'I Want This text!'

AFAIK using regex is a bad idea for parsing HTML, you would be better off using a HTML/XML parser like beautiful soup.

Use lxml.html. It's much faster than BeautifulSoup and raw text is a single command.

>>> import lxml.html
>>> page = lxml.html.document_fromstring('<!DOCTYPE html>...</html>')
>>> page.cssselect('body')[0].text_content()
'...'

Use SGMLParser. regex works in simple case. But there are a lot of intricacy with HTML you rather not have to deal with.

>>> from sgmllib import SGMLParser
>>>
>>> class TextExtracter(SGMLParser):
...     def __init__(self):
...         self.text = []
...         SGMLParser.__init__(self)
...     def handle_data(self, data):
...         self.text.append(data)
...     def getvalue(self):
...         return ''.join(ex.text)
...
>>> ex = TextExtracter()
>>> ex.feed('<html>hello &gt; world</html>')
>>> ex.getvalue()
'hello > world'

Depending on whether the text will contain '>' or '<' I would either just make a function to remove anything between those, or use a parsing lib

def cleanStrings(self, inStr):
  a = inStr.find('<')
  b = inStr.find('>')
  if a < 0 and b < 0:
    return inStr
  return cleanString(inStr[a:b-a])

继续阅读：python regex

using python, Remove HTML tags/formatting from a string [duplicate]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？