开发者

Working with unicode encoded Strings from Active Directory via python-ldap

I already came up with this problem, but after some testing I decided to create a new question with some more specific Infos:

I a开发者_如何学Gom reading user accounts with python-ldap (and Python 2.7) from our Active Directory. This does work well, but I have problems with special chars. They do look like UTF-8 encoded strings when printed on the console. The goal is to write them into a MySQL DB, but I don't get those strings into proper UTF-8 from the beginning.

Example (fullentries is my array with all the AD entries):

fullentries[23][1].decode('utf-8', 'ignore')    
print fullentries[23][1].encode('utf-8', 'ignore')
print fullentries[23][1].encode('latin1', 'ignore')
print repr(fullentries[23][1])

A second test with a string inserted by hand as follows:

testentry = "M\xc3\xbcller"
testentry.decode('utf-8', 'ignore')
print testentry.encode('utf-8', 'ignore')
print testentry.encode('latin1', 'ignore')
print repr(testentry)

The output of the first example ist:

M\xc3\xbcller
M\xc3\xbcller
u'M\\xc3\\xbcller'

Edit: If I try to replace the double backslashes with .replace('\\\\','\\) the output remains the same.

The output of the second example:

Müller
M�ller
'M\xc3\xbcller'

Is there any way to get the AD output properly encoded? I already read a lot of documentation, but it all states that LDAPv3 gives you strictly UTF-8 encoded strings. Active Directory uses LDAPv3.

My older question this topic is here: Writing UTF-8 String to MySQL with Python

Edit: Added repr(s) infos


First, know that printing to a Windows console is often the step that garbles data, so for your tests, you should print repr(s) to see the precise bytes you have in your string.

You need to find out how the data from AD is encoded. Again, print repr(s) will let you see the content of the data.

UPDATED:

OK, it looks like you're getting strange strings somehow. There might be a way to get them better, but you can adapt in any case, though it isn't pretty:

u.decode('unicode_escape').encode('iso8859-1').decode('utf8')

You might want to look into whether you can get the data in a more natural format.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜