开发者

using output from beautifulsoup in python

Hey all, I am using beautifulsoup (after unsuccessfully struggling for two days with scrapy) to scrape starcraft 2 league data however I am encountering a problem.

I have this table with the result of which I want the string content of all tags which i do like this:

from BeautifulSoup import *
from urllib import urlopen

def parseWithSoup(url):
    print "Reading:" , url
    html = urlopen(url).read().lower()
    bs = BeautifulSoup(html)
    table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="tblt_table") 
    rows = table.findAll(lambda tag: tag.name=='tr')

    rows.pop(0) #first row is header
    for row in rows:
        tags = row.findAll(lambda tag: tag.name=='a')
        content = []
        for tagcontent in tags:
            content.ap开发者_如何学编程pend(tagcontent.string)
        print content

if __name__ == '__main__':
    content = "http://www.teamliquid.net/tlpd/sc2-international/games#tblt-5018-1-1-DESC"
    metSoup = parseWithSoup(content)

however the output is as follows:

[u'+', u'gadget show live i..', u'crevasse', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'metalopolis 1.1', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'shakuras plateau 2.0', u'socke', u'select']
etc...

My question is: where does the u'' come from (is it from unicode?) and how can I remove this? I just need the strings that are in u''...


The u means Unicode string. It doesn't change anything for you as a programmer and you should just disregard it. Treat them like normal strings. You actually want this u there.

Be aware that all Beautiful Soup output is unicode. That's a good thing, because if you run across any Unicode characters in your scraping, you won't have any problems. If you really want to get rid of the u, (I don't recommend it), you can use the unicode string's decode() method.


What you see are Python unicode strings.

Check the Python documentation

http://docs.python.org/howto/unicode.html

in order to deal correctly with unicode strings.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜