Best way to strip out everything but text from a webpage?

2023-01-02 03:06 问答作者：

I'm looking to take an html page and just extract the pure text on that page. Anyone know of a good 开发者_Python百科way to do that in python?

I want to strip out literally everything and be left with just the text of the articles and what ever other text is between tags. JS, css, etc... gone

thanks!

The first answer here doesn't remove the body of CSS or JavaScript tags if they are in the page (not linked). This might get closer:

def stripTags(text):
  scripts = re.compile(r'<script.*?/script>')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

  return text

You could try the rather excellent Beautiful Soup

f = open("my_source.html","r")
s = f.read()
f.close()
soup = BeautifulSoup.BeautifulSoup(s)
txt = soup.body.getText()

But be warned: what you get back from any parsing attempt will be subject to 'mistakes'. Bad HTML, bad parsing and just general unexpected output. If your source documents are well known and well presented you should be ok, or able to at least work around idiosyncrasies in them, but if it's just general stuff found "out on the internet" then expect all kinds of weird and wonderful outliers.

As per here:

def remove_html_tags(data):
     p = re.compile(r'<.*?>')
     return p.sub('', data)

As he notes in the article, the "re module needs to be imported in order to use regular expression."

The lxml.html module is worth considering. However, it takes a bit of massaging to remove the CSS and JavaScript:

def stripsource(page):
    from lxml import html

    source = html.fromstring(page)
    for item in source.xpath("//style|//script|//comment()"):
        item.getparent().remove(item)

    for line in source.itertext():
        if line.strip():
            yield line

The yielded lines can be simply concatenated, but that can lose significant word boundaries, if there isn't any whitespace around whitespace-generating tags.

You might also want to iterate over just the <body> tag, depending on your requirements.

I would also recommend BeautifulSoup, but I would recommend using something like on the answer to this question which I'll copy here for those who don't want to look there:

soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

I tried it on this page for instance and it worked quite well.

This was the cleanest and simplest solution I found to strip CSS and JavaScript:

''.join(BeautifulSoup(content).findAll(text=lambda text: 
text.parent.name != "script" and 
text.parent.name != "style"))

https://stackoverflow.com/a/3002599/1203188 by Matthew Flaschen

继续阅读：python

Best way to strip out everything but text from a webpage?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？