开发者

Creating dictionary from a list of special characters

I'm working on this small script: basically it's mapping the list elements (with special characters in it) to its index to create a dictionary.

#!/usr/bin/env python
#-*- coding: latin-1 -*-

ln1 = '?0>9<8~7|65"4:3}2{1+_)'
ln2 = "(*&^%$£@!/`'\][=-#¢"

refStr = ln2+ln1

keyDict = {}
for i in range(0,len(refStr)):
    keyDict[refStr[i]] = i


print "-" * 32
print "Originl: ",refStr
print "KeyDict: ", keyDict

# added just to test a few special characters
tsChr = ['£','%','\\','¢']

for k in tsChr:
    if k in keyDict:
        print k, "\t", keyDict[k]
    else: print k, "\t", "not in the dic."

It returns the result like this:

Originl:  (*&^%$£@!/`'\][=-#¢?0>9<8~7|65"4:3}2{1+_)
KeyDict:  {'!': 9, '\xa3': 7, '\xa2': 20, '%': 4, '$': 5, "'": 12, '&': 2, ')': 42, '(': 0, '+': 40, '*': 1, '-': 17, '/': 10, '1': 39, '0': 22, '3': 35, '2': 37, '5': 31, '4': 33, '7': 28, '6': 30, '9': 24, '8': 26, ':': 34, '=': 16, '<': 25, '?': 21, '>': 23, '@': 8, '\xc2': 19, '#': 18, '"': 32, '[': 15, ']': 14, '\\': 13, '_': 41, '^': 3, '`': 11, '{': 38, '}': 36, '|': 29, '~': 27}
开发者_如何学C

which is all good, except for the characters £, % and \ are converting to \xa3, \xa2 and \\ respectively. Does any one know why printing ln1/ln2 is just fine but the dictionary is not. How can I fix this? Any help greatly appreciated. Cheers!!


Update 1

I've added extra special characters - # and ¢ and then this is what I get following @Duncan's suggestion:

! 9
? 7
? 20
% 4
$ 5
....
....
8 26
: 34
= 16
< 25
? 21
> 23
@ 8
? 19
....
....

Notice that 7th, 19th and 20th elements, which is not printing correctly at all. 21st element is the actual ? character. Cheers!!


Update 2

Just added this loop to my original post to actually test my purpose:

tsChr = ['£','%','\\','¢']
for k in tsChr:
    if k in keyDict:
        print k, "\t", keyDict[k]
    else: print k, "\t", "not in the dic."

and this what I get as result:

£   not in the dic.
%   4
\   13
¢   not in the dic.

Whist running the script, it thinks that £ and ¢ are not actually in the dictionary - and that's my problem. Anyone knows how to fix that or what/where am I doing wrong?

eventually, I'll be checking for the character(s) from a file (or a line of text) in the dictionary to see if it exists and there is a chance of having character like é or £ and so on in the text. Cheers!!


When you print a dictionary or list that contains strings Python will display the repr() of the strings. If you print repr(ln2) you'll see that nothing has changed: your dictionary key is just the latin-1 encoding of '£' &c. characters.

If you do:

for k in keyDict:
    print k, keyDict[k]

then the characters will display as you expect.


In my humble opinion it would be useful to learn about unicode in general and it's use in python

if you are not interested to know why people had to mess up things so you have to deal with a '\xa3' instead of having just a plain £ then Duncan answer above is perfect and tells you everything you want to know.

Update (regardin your Update #2)

please assert your file is saved with latin-1 encoding and non utf-8 as it's now and your test will pass (or just change #-*- coding: latin-1 -*- to #-*- coding: utf-8 -*-)

This is a thing you could easily understand reading (and understanding) contents from my link above:

your file is saved as utf-8 this means for char £ 2 bytes are used but since you tell python interpreter encoding is latin-1 he will use each of the 2 utf-8 bytes of £ for a key.

Infact I can count 19 chars in ln2 but if you issue len(ln2) it will return 21.

When you test for '£' in keyDict.keys() you are looking for a 2-char string while each of the 2-chars got its own key in dictionary, that's why it won't find it.

Also you can test len(keyDict) and find it's longer than what you expect.

I guess this explains everything, please understand not all the story is easy to be explained in a single webpage but the link above, in my humble opinion is a nice starting point, mixing some story and some coding examples.

Cheers

P.S.: I'm using this code, saving it as UTF-8 and it works flawlessly:

#!/usr/bin/env python
#-*- coding: utf-8 -*-

ln1 = u'?0>9<8~7|65"4:3}2{1+_)'
ln2 = u"(*&^%$£@!/`'\][=-#¢"

refStr = u"%s%s" % (ln2, ln1)

keyDict = {}
for idx, chr_ in enumerate(refStr):
    print chr_,
    keyDict[chr_] = idx

print u"-" * 32
print u"Originl: ", refStr
print u"KeyDict: ", keyDict

tsChr = [u'£', u'%', u'\\', u'¢']
for k in tsChr:
    if k in keyDict.keys():
        print k, "\t", keyDict[k]
    else: print k, repr(k), "\t", "not in the dic."
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜