How to remove all html tags from downloaded page [duplicate]

2023-01-09 09:16 问答作者：

This question already has answers here: Strip HTML from strings in Python (28 answers) Closed 9 months ago.

I have downloaded a page using urlopen. How do I remove all html tag开发者_如何学JAVAs from it? Is there any regexp to replace all <*> tags?

I can also recommend BeautifulSoup which is an easy to use html parser. There you would do something like:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
all_text = ''.join(soup.findAll(text=True))

This way you get all the text from a html document.

There's a great python library called bleach. This call below will remove all html tags, leaving everything else (but not removing the content inside tags that are not visible).

bleach.clean(thestring, tags=[], attributes={}, styles=[], strip=True)

Try this:

import re

def remove_html_tags(data):
  p = re.compile(r'<.*?>')
  return p.sub('', data)

You could use html2text which is supposed to make a readable text equivalent from an HTML source (programatically with Python or as a command-line tool). Thus I may extrapolate your needs from your question...

If you need HTML parsing, Python has a module for you!

There are multiple options to filter out Html tags from data. you can use Regex or remove_tags from w3lib which is in-built in python.

from w3lib.html import remove_tags
data_to_remove = '<p>hello\t\t, \tworld\n</p>'
print remove_tags(data_to_remove)`

OUTPUT: hello-world

Note: remove_tags accept string object. you can pass remove_tags(str(data_to_remove))

A very simple regexp would be :

import re
notag = re.sub("<.*?>", " ", html)

The drawback of this solution is that it doesn't remove javascript or css, but only tags.

继续阅读：python

How to remove all html tags from downloaded page [duplicate]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？