Python encoding problems

2023-03-30 17:43 问答作者：

So, I've read a lot about Python encoding and stuff - maybe not enough but I've been working on this for 2 days and still nothing - but I'm still getting troubles. I'll try to be as clear as I can. The main thing is that I'm trying to remove all accents and characters such as #, !, %, &...

The thing is, I do a query search on Twitter Search API with this call:

query = urllib2.urlopen(settings.SEARCH_URL + '?%s' % params)

Then, I call a method (avaliar_pesquisa()) to evaluate the results I've got, based on the tags (or terms) of the input:

dados = avaliar_pesquisa(simplejson.loads(query.read()), str(tags))

On avaliar_pesquisa(), the following happens:

def avaliar_pesquisa(dados, tags):
    resultados = []
    # Percorre os resultados
    for i in dados['results']
        resultados.append({'texto'          : i['text'],
                           'imagem'         : i['profile_image_url'],
                           'classificacao'  : avaliar_texto(i['text'], tags),
                           'timestamp'      : i['created_at'],
                         })

Note the avaliar_texto() which evaluates the Tweet text. And there's exactly the problem on the following lines:

def avaliar_texto(texto, tags):
    # Remove accents
    from unicodedata import normalize
    def strip_accents(txt):
        return normalize('NFKD', txt.decode('utf-8'))

    # Split
    texto_split = strip_accents(texto)
    texto_split = texto.lower().split()

    # Remove non-alpha characters
    import re
    pattern = re.compile('[\W_]+')
    texto_aux = []
    for i in texto_split:
        texto_aux.append(pattern.sub('', i))
    texto_split = texto_aux

The split doesn't really matter here. The thing is, if I print the type of the var texto on this last method, I may get str or unicode as answer. If there is any kind of accent on the text, it comes like unicode. S开发者_JAVA百科o, I get this error running the application that receives 100 tweets max as answer:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 17: ordinal not in range(128)

For the following text:

Text: Agora o problema é com o speedy. type 'unicode'

Any ideas?

See this page.

The decode() method is to be applied to a str object, not a unicode object. Given a unicode string as input, it first tries to encode it to a str using the ascii codec, then decode as utf-8, which fails.

Try return normalize('NFKD', unicode(txt) ).

This is what I used in my code to discard accents, etc.

text = unicodedata.normalize('NFD', text).encode('ascii','ignore')

Ty placing:

# -*- coding: utf-8 -*-

at the beginning of your python script containing the code.

继续阅读：django encoding python twitter utf-8

Python encoding problems

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？