开发者

Best way to strip out everything but text from a webpage?

I'm looking to take an html page and just extract the pure text on that page. Anyone know of a good 开发者_Python百科way to do that in python?

I want to strip out literally everything and be left with just the text of the articles and what ever other text is between tags. JS, css, etc... gone

thanks!


The first answer here doesn't remove the body of CSS or JavaScript tags if they are in the page (not linked). This might get closer:

def stripTags(text):
  scripts = re.compile(r'<script.*?/script>')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

  return text


You could try the rather excellent Beautiful Soup

f = open("my_source.html","r")
s = f.read()
f.close()
soup = BeautifulSoup.BeautifulSoup(s)
txt = soup.body.getText()

But be warned: what you get back from any parsing attempt will be subject to 'mistakes'. Bad HTML, bad parsing and just general unexpected output. If your source documents are well known and well presented you should be ok, or able to at least work around idiosyncrasies in them, but if it's just general stuff found "out on the internet" then expect all kinds of weird and wonderful outliers.


As per here:

def remove_html_tags(data):
     p = re.compile(r'<.*?>')
     return p.sub('', data)

As he notes in the article, the "re module needs to be imported in order to use regular expression."


The lxml.html module is worth considering. However, it takes a bit of massaging to remove the CSS and JavaScript:

def stripsource(page):
    from lxml import html

    source = html.fromstring(page)
    for item in source.xpath("//style|//script|//comment()"):
        item.getparent().remove(item)

    for line in source.itertext():
        if line.strip():
            yield line

The yielded lines can be simply concatenated, but that can lose significant word boundaries, if there isn't any whitespace around whitespace-generating tags.

You might also want to iterate over just the <body> tag, depending on your requirements.


I would also recommend BeautifulSoup, but I would recommend using something like on the answer to this question which I'll copy here for those who don't want to look there:

soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

I tried it on this page for instance and it worked quite well.


This was the cleanest and simplest solution I found to strip CSS and JavaScript:

''.join(BeautifulSoup(content).findAll(text=lambda text: 
text.parent.name != "script" and 
text.parent.name != "style"))

https://stackoverflow.com/a/3002599/1203188 by Matthew Flaschen

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜