开发者

Dict Index's in BeautifulSoup and "if x in Dict"

I don't think I understand how to check if an array index exists...

for tag in soup.findAll("input"):
            print tag['type']
            if 'type' in tag:
                print "b"

Outputs:

2255
text
hidden
text
text
tex开发者_JAVA技巧t
Traceback (most recent call last):
  File "/home//workspace//src/x.py", line 268, in <module>
    print tag['type']
  File "/home//workspace//src/BeautifulSoup.py", line 601, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'type'

Why is it not outputting 'b' ever?


A BeautifulSoup Tag is not a dict. Sometimes it acts like one in certain ways ([] notation as you discovered gets the value of an attribute), but in other ways it doesn't. in on a Tag will check if a tag is a direct child of that tag; it does not check attributes.

Instead, you could do something like this:

if not tag.get('type', None):
    pass # type is empty or nonexistent


Why is it not outputting 'b' ever?

You're assuming that the tags returned from findAll are dicts, when in fact they're not. The BeautifulSoup library that you're using has its own custom classes, in this case BeautifulSoup.Tag, which may work a lot like a dict, but isn't.

Here, check this out:

    >>> doc = ['<html><head><title>Page title</title></head>',
    ...        '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', 
    ...        '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
    ...        '</html>']
    >>> soup = BeautifulSoup(''.join(doc))
    >>> tag = soup.findAll("p")[0]
    >>> type(tag)
    class 'BeautifulSoup.Tag'>
    >>> isinstance(tag, dict)
    False

Since it's not actually a dict, you're getting some different (domain-specific) behavior, in this case a list of immediate children (tags immediately contained within the tag you're "indexing").

It looks like you want to know if the input tag has an attribute type, so according to the BeautifulSoup documentation you can list the attributes of a tag using tag.attrs and attrMap.

    >>> tag.attrs
    [(u'id', u'firstpara'), (u'align', u'center')]
    >>> tag.attrMap
    {u'align': u'center', u'id': u'firstpara'}
    >>> 'id' in tag.attrMap
    True

BeautifulSoup is a really helpful library, but it's one that you have to play with a bit to get the results you want. Make sure to spend time in the interactive console playing with the classes, and remember to use the help(someobject) syntax to see what you're playing with and what methods it has.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜