How can/should I break an html document into parts using Python? (Techno- and logically)
I've an HTML document I'm trying to break into se开发者_如何学Pythonparate, smaller chunks. Say, take each < h3 > header and turn into its own separate file, using only the HTML encoded within that chunk (along with html, head, body, tags).
I am using Python's Beautiful Soup which I am new to, but seems easy to use for easy tasks such as this (Any better suggestions like lxml or Mini-dom?). So:
1) How do I go, 'parse all < h3 >s and turn each into a separate doc'? Anything from pointers to the right direction to code snippets to online documentation (found quite little for Soup) will be appreciated.
2) Logically, finding the tag won't be enough - I need to physically 'cut it out' and put it in a separate file (and remove it from original). Perhaps parsing the text lines instead of nodes would be easier (albeit super-ugly, parsing raw text from a formed structure...?)
3) Similarly related - suppose I want to delete a certain attribute from all tags of a type (like, delete the alignment attribute of all images). This seems easy but I've failed - any help will be appreciated! Thanks for any help!
Yes, you use BeautifulSoup or lxml. Both have methods to find the nodes you want to extract. You can then also recreate HTML from the node objects, and hence save that HTML to new files.
精彩评论