Beautifulsoup, Python and HTML automatic page truncating?

2023-01-16 13:52 问答作者：

I'm using Python and BeautifulSoup to parse HTML pages. Unfortunately, for some pages (> 400K) BeatifulSoup is truncating the HTML content.

I use the following code to get the se开发者_JAVA技巧t of "div"s:

findSet = SoupStrainer('div')
set = BeautifulSoup(htmlSource, parseOnlyThese=findSet)
for it in set:
    print it

At a certain point, the output looks like:

correct string, correct string, incomplete/truncated string ("So, I")

although, the htmlSource contains the string "So, I am bored", and many others. Also, I would like to mention that when I prettify() the tree I see the HTML source truncated.

Do you have an idea how can I fix this issue?

Thanks!

Try using lxml.html. It is a faster, better html parser, and deals better with broken html than latest BeautifulSoup. It is working fine for your example page, parsing the entire page.

import lxml.html

doc = lxml.html.parse('http://voinici.ceata.org/~sana/test.html')
print len(doc.findall('//div'))

Code above returns 131 divs.

I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.

The only thing you need to do is to install:

pip install html5lib

and add it as a parameter to BeautifulSoup:

soup = BeautifulSoup(html, 'html5lib')

继续阅读：python

Beautifulsoup, Python and HTML automatic page truncating?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？