开发者

Writing UTF-8 String to MySQL with Python

I am trying to push user account data from an Active Directory to our MySQL-Server. This works flawlessly but somehow the strings end up showing an encoded version of umlauts and other special characters.

The Active Directory returns a string using this sample format: M\xc3\xbcller

This actually is the UTF-8 encoding for Müller, but I want to write Müller to my database not M\xc3\xbcller.

I tried converting the string w开发者_开发问答ith this line, but it results in the same string in the database: tempEntry[1] = tempEntry[1].decode("utf-8")

If I run print "M\xc3\xbcller".decode("utf-8") in the python console the output is correct.

Is there any way to insert this string the right way? I need this specific format for a web developer who wants to have this exact format, I don't know why he is not able to convert the string using PHP directly.

Additional info: I am using MySQLdb; The table and column encoding is utf8_general_ci


As @marr75 suggests, make sure you set charset='utf8' on your connections. Setting use_unicode=True is not strictly necessary as it is implied by setting the charset.

Then make sure you are passing unicode objects to your db connection as it will encode it using the charset you passed to the cursor. If you are passing a utf8-encoded string, it will be doubly encoded when it reaches the database.

So, something like:

conn = MySQLdb.connect(host="localhost", user='root', password='', db='', charset='utf8')
data_from_ldap = 'M\xc3\xbcller'
name = data_from_ldap.decode('utf8')
cursor = conn.cursor()
cursor.execute(u"INSERT INTO mytable SET name = %s", (name,))

You may also try forcing the connection to use utf8 by passing the init_command param, though I'm unsure if this is required. 5 mins testing should help you decide.

conn = MySQLdb.connect(charset='utf8', init_command='SET NAMES UTF8')

Also, and this is barely worth mentioning as 4.1 is so old, make sure you are using MySQL >= 4.1


Assuming you are using MySQLdb you need to pass use_unicode=True and charset="utf8" when creating your connection.

UPDATE: If I run the following against a test table I get -

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
>>> c = db.cursor()
>>> c.execute("INSERT INTO last_names VALUES(%s)", (u'M\xfcller', ))
1L
>>> c.execute("SELECT * FROM last_names")
1L
>>> print c.fetchall()
(('M\xc3\xbcller',),)

This is "the right way", the characters are being stored and retrieved correctly, your friend writing the php script just isn't handling the encoding correctly when outputting.

As Rob points out, use_unicode and charset combined is being verbose about the connection, but I have a natural paranoia about even the most useful python libraries outside of the standard library so I try to be explicit to make bugs easy to find if the library changes.


import MySQLdb

# connect to the database
db = MySQLdb.connect("****", "****", "****", "****") #don't use charset here

# setup a cursor object using cursor() method
cursor = db.cursor()

cursor.execute("SET NAMES utf8mb4;") #or utf8 or any other charset you want to handle

cursor.execute("SET CHARACTER SET utf8mb4;") #same as above

cursor.execute("SET character_set_connection=utf8mb4;") #same as above

# run a SQL question
cursor.execute("****")

#and make sure the MySQL settings are correct, data too


I found the solution to my problems. Decoding the String with .decode('unicode_escape').encode('iso8859-1').decode('utf8') did work at last. Now everything is inserted as it should. The full other solution can be found here: Working with unicode encoded Strings from Active Directory via python-ldap


Recently I had the same issue with field value being a byte string instead of unicode. Here's a little analysis.

Overview

In general all one needs to do to have unicode values from a cursor, is to pass charset argument to connection constructor and have non-binary table fields (e.g. utf8_general_ci). Passing use_unicode is useless because it is set to true whenever charset has a value.

MySQLdb respects cursor description field types, so if you have a DATETIME column in cursor the values will be converted to Python datatime.datetime instances, DECIMAL to decimal.Decimal and so on, but binary values will be represented as is, by byte strings. Most of decoders are defined in MySQLdb.converters, and one can override them on instance basis by providing conv argument to connection constructor.

But unicode decoders are an exception here, which is likely a design shortcoming. They are appended directly to connection instance converters in its constructor. So it's only possible to override them on instance-basic.

Workaround

Let's see the issue code.

import MySQLdb

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
cursor     = connection.cursor()

cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")

print cursor.fetchone() 
# (u'abcd\u0451', 'abcd\xd1\x91')
print cursor.description 
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags 
# (1, 0)

It shows that b field is returned as a byte string instead of unicode. However it is not binary, MySQLdb.constants.FLAG.BINARY & cursor.description_flags[1] (MySQLdb field flags). It seems like bug in the library (opened #90). But the reason for it I see as MySQLdb.constants.FIELD_TYPE.LONG_BLOB (cursor.description[1][1] == 251, MySQLdb field types) just hasn't a converter at all.

import MySQLdb
import MySQLdb.converters as conv
import MySQLdb.constants as const

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
connection.converter[const.FIELD_TYPE.LONG_BLOB] = connection.converter[const.FIELD_TYPE.BLOB]
cursor = connection.cursor()

cursor.execute(u"SELECT 'abcdё' `s`, ExtractValue('<a>abcdё</a>', '/a') `b`")

print cursor.fetchone()
# (u'abcd\u0451', u'abcd\u0451')
print cursor.description
# (('s', 253, 6, 15, 15, 31, 0), ('b', 251, 6, 50331648, 50331648, 31, 1))
print cursor.description_flags
# (1, 0)

Thus by manipulating connection instance converter dict, it is possible to achieve desired unicode decoding behaviour.

If you want to override the behaviour here's how a dict entry for possible text field looks like after constructor.

import MySQLdb
import MySQLdb.constants as const

connection = MySQLdb.connect(user = 'guest', db = 'test', charset = 'utf8')
print connection.converter[const.FIELD_TYPE.BLOB]
# [(128, <type 'str'>), (None, <function string_decoder at 0x7fa472dda488>)]

MySQLdb.constants.FLAG.BINARY == 128. This means that if a field has binary flag it will be str, otherwise unicode decoder will be applied. So you want to try to convert binary values as well, you can pop the first tuple.


(Would like to reply to above answer but do not have enough reputation...)

The reason why you don't get unicode results in this case:

>>> print c.fetchall()
(('M\xc3\xbcller',),)

is a bug from MySQLdb 1.2.x with *_bin collation, see:

http://sourceforge.net/tracker/index.php?func=detail&aid=1693363&group_id=22307&atid=374932 http://sourceforge.net/tracker/index.php?func=detail&aid=2663436&group_id=22307&atid=374932

In this particular case (collation utf8_bin - or [anything]_bin...) you have to expect the "raw" value, here utf-8 (yes, this sucks as there is no generic fix).


there is another situation maybe a little rare.

if you create a schema in mysqlworkbench firstly,you will get the encoding error and can't solve it by add charset configuration.

it is because mysqlworkbench create schema by latin1 by default, so you should set the charset at first!

Writing UTF-8 String to MySQL with Python


and db.set_character_set('utf8'), imply that use_unicode=True ?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜