Python html2text regex performance

2023-02-17 02:04 问答作者：

i have build a html to plain text regex sequence. I use this in up to 100 threads to clean up html files. I want get all visible text information of an given html file.

    self.content = re.sub(r'<!--(.|\n)*?-->', '', self.content)
    self.content开发者_如何学Go = re.sub(r'<script (.|\n)*?>(.|\n)*?</script>', '', self.content)
    self.content = re.sub(r'<style (.|\n)*?>(.|\n)*?</style>', '', self.content)
    self.content = re.sub(r'(<[^>]*?>+)', ' ', self.content)

I am not realy a regex pro. Maybe i could improve the performance of this regex?

I dont want use beautifulsoap or django or html2text c++ distribution. they are after tests slower then my regex. I need just a space separeted string, not a tree or links ect.

Thanks for helping. I know on stackoverflow are some really smart people

Use a tool like BeautifulSoup or htmllib and don't try being smarter than the rest of the world. Parsing HTML using regular expressions is the worst thing you can do! There will always be one Html file more where your regexes will fail.

There is a common credo according which HTML and XML texts must ne-e-ever be treated with regex tools. You must take into account that the risks of such treatments are real and impossible to manage if it is practiced for too much ambitious aims. HTML and XML are too much complicated markup language to be analysed by regexes.

However I don't totally share this common credo. In my opinion, it isn't a so much absurd method if it is lucidly used with the preoccupation of using regex in conditions that may be reasonbly considered as legitimating this use because the risks seem at the minimum.

I believe that regexes can be used for limited and simple treatments of HTML or XML texts. I really understood here on stacoverflof.com that it is impracticable to parse HTML/XML with regexes. But when a parsing (extracting all or part of a markup tree) isn't implied in a treatment, why to so religiously reject the regexes (I allude to the cited link)
It seems to me that a good security step is to limit the use of a code using regex tools only on texts from a constant origin, and not trying to make it analysing various HTM or XML texts.

After these warnings, I dare to propose to you the following improvements to your REs:

re.sub('<!--.*?-->', '', self.content, flags=re.DOTALL)

and

re.sub('<(script|style) .*?\\1>', '', self.content, flags=re.DOTALL)

继续阅读：python regex

Python html2text regex performance

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？