开发者

Beautifulsoup, Python and HTML automatic page truncating?

I'm using Python and BeautifulSoup to parse HTML pages. Unfortunately, for some pages (> 400K) BeatifulSoup is truncating the HTML content.

I use the following code to get the se开发者_JAVA技巧t of "div"s:

findSet = SoupStrainer('div')
set = BeautifulSoup(htmlSource, parseOnlyThese=findSet)
for it in set:
    print it

At a certain point, the output looks like:

correct string, correct string, incomplete/truncated string ("So, I")

although, the htmlSource contains the string "So, I am bored", and many others. Also, I would like to mention that when I prettify() the tree I see the HTML source truncated.

Do you have an idea how can I fix this issue?

Thanks!


Try using lxml.html. It is a faster, better html parser, and deals better with broken html than latest BeautifulSoup. It is working fine for your example page, parsing the entire page.

import lxml.html

doc = lxml.html.parse('http://voinici.ceata.org/~sana/test.html')
print len(doc.findall('//div'))

Code above returns 131 divs.


I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.

The only thing you need to do is to install:

pip install html5lib

and add it as a parameter to BeautifulSoup:

soup = BeautifulSoup(html, 'html5lib')
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜