开发者

Working with strings in python produces strange quotation marks

currently I am working with scrapy, which is a web crawling framework based on python. The data is extracted from html using XPATH . (I am new to python) To wrap the data scrapy uses items, e.g.

item = MyItem()

item['id'] = obj.select('div[@class="id"]').extract()

When the id is printed like print item['id'] I get following output

[u'12346']

My problem is that this output is not always in the same form. Sometimes I get an output like

"[u""someText""]"

This happens only with text, but actually there is nothing speciall with the text compared to other text that is handled corretly just like the ID.

Does anyone know what the quotation marks mean? Like I said the someText was crawled like all other text data, e.g. from

<a>someText</a>

Any ideas?

Edit:

My spider crawls all pages of a blog. Here is the exact output

[u'41039'];[u'title]

[u'40942'];"[u""title""]"]

...

Extracted with

item['title']   = site.select('div[@class="header"]/h2/a/@title').extr开发者_开发百科act()

I noticed that always the same blog posts have this quotation marks. So they dont appear randomly. But there is nothing special to the text. E.g. this title produces quotation marks

<a title="Xtra Pac Telekom web'n'walk Stick Basic für 9,95" href="someURL">
    Xtra Pac Telekom web'n'walk Stick Basic für 9,95</a>

So my first thought was that this is because of some special chars but there arent any.

This happeny only when the items are written to csv, when I print them in cmd there are no quotation marks.

Any ideas?


python can use both single ' and double " quotes as quotation marks. when it prints something out it chooses single quotes normally, but will switch to double quotes if the text it is printing contains single quotes (to avoid having to escape the quote in the string):

so normally, it is printing [u'....'] but sometimes you have text that contains a ' character and then it prints [u"...."].

then there is an extra complication writing to csv. if a string is written to csv that contains just a ' then it is written as it is. so [u'....'] is written as [u'....'].

but if it contains double quotes then (1) everything is put inside double quotes and (2) any double quotes are repeated twice. so u["..."] is written as "[u""...""]". if you read the csv data back with a csv library then this will be detected and removed, so it will not cause any problems.

so it's a combination of the text containing a single quote (making python use double quotes) and the csv quoting rules (which apply to double quotes, but not single quotes).

if this is a problem the csv library has various options to change the behaviour - http://docs.python.org/library/csv.html

the wikipedia page explains the quoting rules in more detail - the behavuour here is shown by the example with "Super, ""luxurious"" truck"

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜