Is it possible for BeautifulSoup to work in a case-insensitive manner?
I am trying to extract Meta Description for fetched webpages. But here I am facing the problem of case sensitivity of BeautifulSoup.
As some of the pages have <meta name="Description
and some have <meta name="description
.
My problem is very much similar to that of Question on Stackoverflow
The only difference is that I can't use lxml .. I have to stick with Beautifulsou开发者_运维问答p.
You can give BeautifulSoup a regular expression to match attributes against. Something like
soup.findAll('meta', name=re.compile("^description$", re.I))
might do the trick. Cribbed from the BeautifulSoup docs.
A regular expression? Now we have another problem.
Instead, you can pass in a lambda:
soup.findAll(lambda tag: tag.name.lower()=='meta',
name=lambda x: x and x.lower()=='description')
(x and
avoids an exception when the name
attribute isn't defined for the tag)
With minor changes it works.
soup.findAll('meta', attrs={'name':re.compile("^description$", re.I)})
With bs4 use the following:
soup.find('meta', attrs={'name': lambda x: x and x.lower()=='description'})
Better still use a css attribute = value selector with i
argument for case insensitivity
soup.select('meta[name="description" i]')
change case of the html page source. Use functions such as string.lower(), string.upper()
精彩评论