开发者

BeautifulSoup -- Prevent Tag From Automatically Closing

BeautifulSoup is choking on parsing the follo开发者_StackOverflowwing code:

>>> soup = BeautifulSoup('<img src="#" alt="Click Here >" border="0" />')
>>> soup.prettify()
'<img src="#" alt="Click Here &gt;" />\n" border="0" />\n'

I should also note, I have no control over the input html. There are many different variations of the text/attributes so I want to avoid using Regex.

Anyone have a suggestion for stopping BeautifulSoup from automatically closing the img tag when it runs into the ">" symbol?

Edit 1: I have found this in the documentation. Could I control how BeautifulSoup parses the IMG tag?

Edit 2: I solved my problem. Before I called BS, I did did a text replace

text.replace('>"','&gt;"')


BeautifulSoup4 has been updated to be context aware and has since solved this issue. If you update to the latest version of BeautifulSoup4 it will ignore the > tag when enclosed in quotes.

soup = BeautifulSoup('<img src="#" alt="Click Here >" border="0" />')
print(soup.img.attrs)
# {'src': '#', 'alt': 'Click Here >', 'border': '0'}
soup.prettify()
# '<img src="#" alt="Click Here &gt;" />\n" border="0" />\n'

The example shows that the alt attribute correctly has the > character, and the border attribute has been recognised.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜