Searching for specific HTML string using Python

2022-12-25 20:55 问答作者：

What modules would be the best to write a python program that searches through hundreds of html documents and deletes a certain string of html that is given. For instance, if I have an html doc that has <a href="test.html">Test</a> and I want to delete this out of every html page that has it.

Any help is much appreciated, and I don't need someone to write the program for me, just a help开发者_如何学编程ful point in the right direction.

If the string you are searching for will be in the HTML literally, then simple string replacement will be fine:

old_html = open(html_file).read()
new_html = old_html.replace(my_string, "")
if new_html != old_html:
    open(html_file, "w").write(new_html)

As an example of the string not being in the HTML literally, suppose you are looking for "Test" as you said. Do you want it to match these snippets of HTML?:

<a href='test.html'>Test</a>
<A HREF='test.html'>Test</A>
<a href="test.html" class="external">Test</a>
<a href="test.html">Tes&#116;</a>

and so on: the "same" HTML can be expressed in many different ways. If you know the precise characters used in the HTML, then simple string replacement is fine. If you need to match at an HTML semantic level, then you'll need to use more advanced tools like BeautifulSoup, but then you'll also have potentially very different HTML output than you started with, even in the sections not affected by the deletion, because the entire file will have been parsed and reconstituted.

To execute code over many files, you'll find os.path.walk useful for finding files in a tree, or glob.glob for matching filenames to shell-like wildcard patterns.

BeautifulSoup or lxml.

htmllib

This module defines a class which can serve as a base for parsing text files formatted in the HyperText Mark-up Language (HTML). The class is not directly concerned with I/O — it must be provided with input in string form via a method, and makes calls to methods of a “formatter” object in order to produce output. The HTMLParser class is designed to be used as a base class for other classes in order to add functionality, and allows most of its methods to be extended or overridden. In turn, this class is derived from and extends the SGMLParser class defined in module sgmllib. The HTMLParser implementation supports the HTML 2.0 language as described in RFC 1866.

继续阅读：python

Searching for specific HTML string using Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？