Problem with encode decode. Python. Django. BeautifulSoup

2023-01-11 06:27 问答作者：

In this code:

   soup=BeautifulSoup(program.Description.encode('utf-8'))
   name=soup.find('div',{'class':'head'})
   print name.string.decode('utf-8')

error happening when i'm trying to print or save to database.

dosnt metter what i'm doing:

print name.string.encode('utf-8')

or just

 print name.string


Traceback (most recent call last):
  File "./manage.py", line 16, in <module>
    execute_manager(setting开发者_如何转开发s)
  File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 362, in execute_manager
    utility.execute()
  File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 303, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 195, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 222, in execute
    output = self.handle(*args, **options)
  File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 50, in handle
    self.FirstTimeLoad()
  File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 115, in FirstTimeLoad
    print name.string.decode('utf-8')
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-5: ordinal not in range(128)

This is repr(name.string)

u'\u0412\u044b\u043f\u0443\u0441\u043a \u043e\u0442 27 \u0434\u0435\u043a\u0430\u0431\u0440\u044f'

I don't know what you are trying to do with name.string.decode('utf-8'). As the BeautifulSoup documentation eloquently points out, "BeautifulSoup gives you Unicode, dammit". So name.string is already decoded - it is in unicode. You can encode it back to utf-8 if you want to, but you can't decode it any further.

You can try:

print name.string.encode('ascii', 'replace')

The output should be accepted whatever the encoding of sys.stdout is (including None).

In fact, the file-like object that you are printing to might not accept UTF-8. Here is an example: if you have the apparently benign program

# -*- coding: utf-8 -*-
print u"hérisson"

then running it in a terminal that can print accented characters works fine:

lebigot@weinberg /tmp % python2.5 test.py 
hérisson

but printing to a standard output connected to a Unix pipe does not:

lebigot@weinberg /tmp % python2.5 test.py | cat
  Traceback (most recent call last):
  File "test.py", line 3, in <module>
print u"hérisson"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

because sys.stdout has encoding None, in this case: Python considers that the program that reads through the pipe should receive ASCII, and the printing fails because ASCII cannot represent the word that we want to print. A solution like the one above solves the problem.

Note: You can check the encoding of your standard output with:

print sys.stdout.encoding

This can help you debug encoding problems.

Edit: name.string comes from BeautifulSoup, so it is presumably already a unicode string.

However, your error message mentions 'ascii':

UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-5:
ordinal not in range(128)

According to the PrintFails Python wiki page, if Python does not know or can not determine what kind of encoding your output device is expecting, it sets sys.stdout.encoding to None and print attempts to encode its arguments with the 'ascii' codec.

I believe this is the cause of your problem. You can can confirm this by seeing if print sys.stdout.encoding prints None.

According to the same page, linked above, you can circumvent the problem by explicitly telling Python what encoding to use. You do that be wrapping sys.stdout in an instance of StreamWriter:

For example, you could try adding

import sys
import locale
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)

to your script before the print statement. You may have to change locale.getpreferredencoding() to and explicit encoding (e.g. 'utf-8', 'cp1252', etc.). The right encoding to use depends on your output device. It should be set to whatever encoding your output device is expecting. If you are outputing to a terminal, the terminal may have a menu setting to allow the user to set what type of encoding the terminal should expect.

Original answer: Try:

 print name.string

 print name.string.encode('utf-8')

try

text = text.decode("utf-8", "replace")

继续阅读：encoding python utf-8

Problem with encode decode. Python. Django. BeautifulSoup

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Easiest way to get words of one line from istream into a vector?

性激素六项检查的最佳时间是多久？多少钱？？

抽烟只抽炫赫门？

Infinite gtk warnings when I right click on the icon