开发者

Scrapy Python spider: Storing results in Latin-1, not in unicode

Currently my spider fetches results as needed but encodes them in unicode (UTF-8, I believe). When I save these results to a csv, I have a ton of cleaning to do as a result,开发者_开发问答 with all the [u' & other characters that Scrapy inserts.

How exactly would I store the results as Latin characters, & not unicode. Where exactly would I need to make the change?

Thanks. -TM


The item_extracted is of type unicode. You can either encode it to latin where it's extracted (in the parse function) or in an item pipeline or output processor

Easiest way is to add this line to your parse function

item_to_be_stored = item_extracted.encode('latin-1','ignore')

Or you could define a function in your item class.

from scrapy.utils.python import unicode_to_str

def u_to_str(text):
    unicode_to_str(text,'latin-1','ignore')

class YourItem(Item):
    name = Field(output_processor=u_to_str())


If your problem is what you say it is, the solution is as simple as casting to a string.

>>> a = u'spam and eggs'
>>> a
u'spam and eggs'
>>> type(a)
<type 'unicode'>
>>> b = str(a)
>>> b
'spam and eggs'
>>> type(b)
<type 'str'>

EDIT: Knowing that an exception could occur it might be a good idea to wrap this in a try and except

try:
    str(a)
except UnicodeError:
    print "Skipping string %s" % a
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜