Use Python to parse html data which contains "&"
I'm using the python library SGMLParser to parse some html. I encounter an html tag of the form
<td class="school">Texas A&M</td>
I'd like to read out "Texas A开发者_JS百科&M". But when handle_data gets called, it gets called with "Texas A", and then, separately, "M" (quotes for clarity).
How do I replace the
&
string with an & before the call, without replacing all special ampersands in the whole string (some of which I may need).
Thanks!
If you switch from the deprecated SGMLParser
to a modern alternative such as LXML (which also handles HTML), this becomes trivial:
>>> etree.fromstring('''<td class="school">Texas A&M</td>''').text
'Texas A&M'
SGMLParser has convert_entityref()
method, but instead of deprecated SGMLParser I would recommend using lxml or Beautiful Soup which have better parser API.
Entity references like &
are handled by handle_entity
. Check that this method knows how to translate &
. The default implementation should call handle_data('&')
, but you may have accidentally overwritten it.
Also, if possible, consider using the far more advanced lxml instead.
精彩评论