Scrapy Python spider: Storing results in Latin-1, not in unicode
Currently my spider fetches results as needed but encodes them in unicode (UTF-8, I believe). When I save these results to a csv, I have a ton of cleaning to do as a result,开发者_开发问答 with all the [u' & other characters that Scrapy inserts.
How exactly would I store the results as Latin characters, & not unicode. Where exactly would I need to make the change?
Thanks. -TM
The item_extracted is of type unicode. You can either encode it to latin where it's extracted (in the parse function) or in an item pipeline or output processor
Easiest way is to add this line to your parse function
item_to_be_stored = item_extracted.encode('latin-1','ignore')
Or you could define a function in your item class.
from scrapy.utils.python import unicode_to_str
def u_to_str(text):
unicode_to_str(text,'latin-1','ignore')
class YourItem(Item):
name = Field(output_processor=u_to_str())
If your problem is what you say it is, the solution is as simple as casting to a string.
>>> a = u'spam and eggs'
>>> a
u'spam and eggs'
>>> type(a)
<type 'unicode'>
>>> b = str(a)
>>> b
'spam and eggs'
>>> type(b)
<type 'str'>
EDIT: Knowing that an exception could occur it might be a good idea to wrap this in a try and except
try:
str(a)
except UnicodeError:
print "Skipping string %s" % a
精彩评论