开发者

How to get the content of a Html page in Python

I have downloaded the web page into an html file. I am wondering what's the simplest way to get开发者_JAVA技巧 the content of that page. By content, I mean I need the strings that a browser would display.

To be clear:

Input:

<html><head><title>Page title</title></head>
       <body><p id="firstpara" align="center">This is paragraph <b>one</b>.
       <p id="secondpara" align="blah">This is paragraph <b>two</b>.
       </html>

Output:

Page title This is paragraph one. This is paragraph two.

putting together:

from BeautifulSoup import BeautifulSoup
import re

def removeHtmlTags(page):
    p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
    return p.sub('', page)

def removeHtmlTags2(page):
    soup = BeautifulSoup(page)
    return ''.join(soup.findAll(text=True))

Related

  • Python HTML removal
  • Extracting text from HTML file using Python
  • What is a light python library that can eliminate HTML tags? (and only text)
  • Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
  • RegEx match open tags except XHTML self-contained tags (famous don't use regex to parse html rant)


Parse the HTML with Beautiful Soup.

To get all the text, without the tags, try:

''.join(soup.findAll(text=True))


Personally, I use lxml because it's a swiss-army knife...

from lxml import html

print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()

This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.

I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.

The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.


You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.


The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.


If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.


The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜