HTML parser for GAE
Generally I use lxml for my HTML p开发者_JAVA技巧arsing needs, but that isn't available on Google App Engine. The obvious alternative is BeautifulSoup, but I find it chokes too easily on malformed HTML. Currently I am testing libxml2dom and have been getting better results.
Which pure Python HTML parser have you found performs best? My priority is the ability to handle bad HTML over speed.
From the BeautifulSoup documentation:
Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than version 3.0.8 does
So, it might help you to use this earlier version. That is precisely what the author himself recommends.
You can pretend that Beautiful Soup version 3.1.0 was never released. Version 3.0.8 still works fine on Python 2.3 through 2.6.
No longer a problem - lxml is supported: https://developers.google.com/appengine/docs/python/tools/libraries27
精彩评论