How to get the content of a Html page in Python

2022-12-22 23:00 问答作者：

I have downloaded the web page into an html file. I am wondering what's the simplest way to get开发者_JAVA技巧 the content of that page. By content, I mean I need the strings that a browser would display.

To be clear:

Input:

<html><head><title>Page title</title></head>
       <body><p id="firstpara" align="center">This is paragraph <b>one</b>.
       <p id="secondpara" align="blah">This is paragraph <b>two</b>.
       </html>

Output:

Page title This is paragraph one. This is paragraph two.

putting together:

from BeautifulSoup import BeautifulSoup
import re

def removeHtmlTags(page):
    p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
    return p.sub('', page)

def removeHtmlTags2(page):
    soup = BeautifulSoup(page)
    return ''.join(soup.findAll(text=True))

Python HTML removal
Extracting text from HTML file using Python
What is a light python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
RegEx match open tags except XHTML self-contained tags (famous don't use regex to parse html rant)

Parse the HTML with Beautiful Soup.

To get all the text, without the tags, try:

''.join(soup.findAll(text=True))

Personally, I use lxml because it's a swiss-army knife...

from lxml import html

print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()

This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.

I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.

The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.

You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.

The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.

If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.

The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

继续阅读：parsing python

How to get the content of a Html page in Python

Related

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？