How can/should I break an html document into parts using Python? (Techno- and logically)

2023-02-02 11:35 问答作者：

I've an HTML document I'm trying to break into se开发者_如何学Pythonparate, smaller chunks. Say, take each < h3 > header and turn into its own separate file, using only the HTML encoded within that chunk (along with html, head, body, tags).

I am using Python's Beautiful Soup which I am new to, but seems easy to use for easy tasks such as this (Any better suggestions like lxml or Mini-dom?). So:

1) How do I go, 'parse all < h3 >s and turn each into a separate doc'? Anything from pointers to the right direction to code snippets to online documentation (found quite little for Soup) will be appreciated.

2) Logically, finding the tag won't be enough - I need to physically 'cut it out' and put it in a separate file (and remove it from original). Perhaps parsing the text lines instead of nodes would be easier (albeit super-ugly, parsing raw text from a formed structure...?)

3) Similarly related - suppose I want to delete a certain attribute from all tags of a type (like, delete the alignment attribute of all images). This seems easy but I've failed - any help will be appreciated! Thanks for any help!

Yes, you use BeautifulSoup or lxml. Both have methods to find the nodes you want to extract. You can then also recreate HTML from the node objects, and hence save that HTML to new files.

继续阅读：python

How can/should I break an html document into parts using Python? (Techno- and logically)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？