开发者

Is it possible for BeautifulSoup to work in a case-insensitive manner?

I am trying to extract Meta Description for fetched webpages. But here I am facing the problem of case sensitivity of BeautifulSoup.

As some of the pages have <meta name="Description and some have <meta name="description.

My problem is very much similar to that of Question on Stackoverflow

The only difference is that I can't use lxml .. I have to stick with Beautifulsou开发者_运维问答p.


You can give BeautifulSoup a regular expression to match attributes against. Something like

soup.findAll('meta', name=re.compile("^description$", re.I))

might do the trick. Cribbed from the BeautifulSoup docs.


A regular expression? Now we have another problem.

Instead, you can pass in a lambda:

soup.findAll(lambda tag: tag.name.lower()=='meta',
    name=lambda x: x and x.lower()=='description')

(x and avoids an exception when the name attribute isn't defined for the tag)


With minor changes it works.

soup.findAll('meta', attrs={'name':re.compile("^description$", re.I)})


With bs4 use the following:

soup.find('meta', attrs={'name': lambda x: x and x.lower()=='description'})


Better still use a css attribute = value selector with i argument for case insensitivity

soup.select('meta[name="description" i]')


change case of the html page source. Use functions such as string.lower(), string.upper()

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜