Use Python to parse html data which contains "&"

2023-04-04 15:46 问答作者：

I'm using the python library SGMLParser to parse some html. I encounter an html tag of the form

<td class="school">Texas A&amp;M</td>

I'd like to read out "Texas A开发者_JS百科&M". But when handle_data gets called, it gets called with "Texas A", and then, separately, "M" (quotes for clarity).

How do I replace the

&amp;

string with an & before the call, without replacing all special ampersands in the whole string (some of which I may need).

Thanks!

If you switch from the deprecated SGMLParser to a modern alternative such as LXML (which also handles HTML), this becomes trivial:

>>> etree.fromstring('''<td class="school">Texas A&amp;M</td>''').text
'Texas A&M'

SGMLParser has convert_entityref() method, but instead of deprecated SGMLParser I would recommend using lxml or Beautiful Soup which have better parser API.

Entity references like & are handled by handle_entity. Check that this method knows how to translate &. The default implementation should call handle_data('&'), but you may have accidentally overwritten it.

Also, if possible, consider using the far more advanced lxml instead.

继续阅读：html-parsing python

Use Python to parse html data which contains "&"

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？