Best Python Module for HTML parsing [closed]
I have a website updater(people can update content(text) not the look of the website) which has HTML, javascript as the front end languages & python as the back-end/server side.
I am finding that updating HTML is very difficult from the front end because when I grab the updated HTML by ele.innerHTML or $(ele).html() gives altered HTML depending on the browser(DAMN IE).
So I have decided to update my HTML from the backend, ie, in Python
What do you think is the best python module to parse HTML & grab information?
My requirements are:
- that the module be atleast in Python 2.5 or less(because of my webhost) - I will be parsing HTML & finding all the HTML elements that are of the class "updatable" - For each element of the class "updatable": extract the innerText(not html only text/content)Which python module would you suggest is best for this?
- HTMLParser.py - htmllib.py - know of any other python 2.5 compatible modules?For parsing HTML I would suggest you take a look at Beautiful Soup. It's pretty powerful and can deal with some messed up markup as well.
http://www.crummy.com/software/BeautifulSoup/
Check this out and see if it helps you out! Hope it does.
I've been using lxml ( http://lxml.de/lxmlhtml.html ). It relatively fast for normal sized html documents and has support for using BeautifulSoup. As I understand it, BeautifulSoup is no longer supported so for all new projects I've used lxml.
精彩评论