开发者

HTML parser for GAE

Generally I use lxml for my HTML p开发者_JAVA技巧arsing needs, but that isn't available on Google App Engine. The obvious alternative is BeautifulSoup, but I find it chokes too easily on malformed HTML. Currently I am testing libxml2dom and have been getting better results.

Which pure Python HTML parser have you found performs best? My priority is the ability to handle bad HTML over speed.


From the BeautifulSoup documentation:

Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than version 3.0.8 does

So, it might help you to use this earlier version. That is precisely what the author himself recommends.

You can pretend that Beautiful Soup version 3.1.0 was never released. Version 3.0.8 still works fine on Python 2.3 through 2.6.


No longer a problem - lxml is supported: https://developers.google.com/appengine/docs/python/tools/libraries27

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜