Working with unicode encoded Strings from Active Directory via python-ldap

2023-03-25 22:52 问答作者：

I already came up with this problem, but after some testing I decided to create a new question with some more specific Infos:

I a开发者_如何学Gom reading user accounts with python-ldap (and Python 2.7) from our Active Directory. This does work well, but I have problems with special chars. They do look like UTF-8 encoded strings when printed on the console. The goal is to write them into a MySQL DB, but I don't get those strings into proper UTF-8 from the beginning.

Example (fullentries is my array with all the AD entries):

fullentries[23][1].decode('utf-8', 'ignore')    
print fullentries[23][1].encode('utf-8', 'ignore')
print fullentries[23][1].encode('latin1', 'ignore')
print repr(fullentries[23][1])

A second test with a string inserted by hand as follows:

testentry = "M\xc3\xbcller"
testentry.decode('utf-8', 'ignore')
print testentry.encode('utf-8', 'ignore')
print testentry.encode('latin1', 'ignore')
print repr(testentry)

The output of the first example ist:

M\xc3\xbcller
M\xc3\xbcller
u'M\\xc3\\xbcller'

Edit: If I try to replace the double backslashes with .replace('\\\\','\\) the output remains the same.

The output of the second example:

Müller
M�ller
'M\xc3\xbcller'

Is there any way to get the AD output properly encoded? I already read a lot of documentation, but it all states that LDAPv3 gives you strictly UTF-8 encoded strings. Active Directory uses LDAPv3.

My older question this topic is here: Writing UTF-8 String to MySQL with Python

Edit: Added repr(s) infos

First, know that printing to a Windows console is often the step that garbles data, so for your tests, you should print repr(s) to see the precise bytes you have in your string.

You need to find out how the data from AD is encoded. Again, print repr(s) will let you see the content of the data.

UPDATED:

OK, it looks like you're getting strange strings somehow. There might be a way to get them better, but you can adapt in any case, though it isn't pretty:

u.decode('unicode_escape').encode('iso8859-1').decode('utf8')

You might want to look into whether you can get the data in a more natural format.

继续阅读：active-directory python unicode utf-8

Working with unicode encoded Strings from Active Directory via python-ldap

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？