Problem with encode decode. Python. Django. BeautifulSoup
In this code:
soup=BeautifulSoup(program.Description.encode('utf-8'))
name=soup.find('div',{'class':'head'})
print name.string.decode('utf-8')
error happening when i'm trying to print or save to database.
dosnt metter what i'm doing:
print name.string.encode('utf-8')
or just
print name.string
Traceback (most recent call last):
File "./manage.py", line 16, in <module>
execute_manager(setting开发者_如何转开发s)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 362, in execute_manager
utility.execute()
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 303, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 195, in run_from_argv
self.execute(*args, **options.__dict__)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 222, in execute
output = self.handle(*args, **options)
File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 50, in handle
self.FirstTimeLoad()
File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 115, in FirstTimeLoad
print name.string.decode('utf-8')
File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-5: ordinal not in range(128)
This is repr(name.string)
u'\u0412\u044b\u043f\u0443\u0441\u043a \u043e\u0442 27 \u0434\u0435\u043a\u0430\u0431\u0440\u044f'
I don't know what you are trying to do with name.string.decode('utf-8')
. As the BeautifulSoup documentation eloquently points out, "BeautifulSoup gives you Unicode, dammit". So name.string
is already decoded - it is in unicode. You can encode it back to utf-8 if you want to, but you can't decode it any further.
You can try:
print name.string.encode('ascii', 'replace')
The output should be accepted whatever the encoding of sys.stdout
is (including None).
In fact, the file-like object that you are printing to might not accept UTF-8. Here is an example: if you have the apparently benign program
# -*- coding: utf-8 -*-
print u"hérisson"
then running it in a terminal that can print accented characters works fine:
lebigot@weinberg /tmp % python2.5 test.py
hérisson
but printing to a standard output connected to a Unix pipe does not:
lebigot@weinberg /tmp % python2.5 test.py | cat
Traceback (most recent call last):
File "test.py", line 3, in <module>
print u"hérisson"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
because sys.stdout
has encoding None
, in this case: Python considers that the program that reads through the pipe should receive ASCII, and the printing fails because ASCII cannot represent the word that we want to print. A solution like the one above solves the problem.
Note: You can check the encoding of your standard output with:
print sys.stdout.encoding
This can help you debug encoding problems.
Edit: name.string
comes from BeautifulSoup, so it is presumably already a unicode string.
However, your error message mentions 'ascii':
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-5:
ordinal not in range(128)
According to the PrintFails Python wiki page, if Python does not know or
can not determine what kind of encoding your output device is expecting, it sets
sys.stdout.encoding to None
and print
attempts to encode its arguments with
the 'ascii' codec.
I believe this is the cause of your problem. You can can confirm this by seeing
if print sys.stdout.encoding
prints None
.
According to the same page, linked above, you can circumvent the problem by
explicitly telling Python what encoding to use. You do that be wrapping
sys.stdout in an instance of StreamWriter
:
For example, you could try adding
import sys
import locale
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
to your script before the print
statement. You may have to change
locale.getpreferredencoding()
to and explicit encoding (e.g. 'utf-8',
'cp1252', etc.). The right encoding to use depends on your output device.
It should be set to whatever encoding your output device is expecting. If
you are outputing to a terminal, the terminal may have a menu setting to allow
the user to set what type of encoding the terminal should expect.
Original answer: Try:
print name.string
or
print name.string.encode('utf-8')
try
text = text.decode("utf-8", "replace")
精彩评论