how do I specify extended ascii (i.e. range(256)) in the python magic encoding specifier line?
I'm using mako templates to generate specialized config files. Some of these files contain extended ASCII chars (>127), but mako chokes saying that the chars are out of range when I use:
## -*- coding: ascii -*-
So I'm wondering if perhaps there's something like:
## -*- coding: eascii -*-
That I can use that will be ok with the range(128, 256) chars.
EDIT:
Here's the dump of the offending section of the file:
000001b0 39 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce |9...............|
000001c0 cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de |................|
000001d0 df e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 开发者_运维百科ea eb ec ed ee |................|
000001e0 ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe |................|
000001f0 ff 5d 2b 28 27 73 29 3f 22 0a 20 20 20 20 20 20 |.]+('s)?". |
00000200 20 20 74 6f 6b 65 6e 3a 20 57 4f 52 44 20 20 20 | token: WORD |
00000210 20 20 22 5b 41 2d 5a 61 2d 7a 30 2d 39 c0 c1 c2 | "[A-Za-z0-9...|
00000220 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf d0 d1 d2 |................|
00000230 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0 e1 e2 |................|
00000240 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 |................|
00000250 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff 5d 2b 28 |.............]+(|
The first character that mako complains about is 000001b4. If I remove this section, everything works fine. With the section inserted, mako complains:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
It's the same complaint whether I use 'ascii' or 'latin-1' in the magic comment line.
Thanks!
Greg
Short answer
Use cp437 as the encoding for some retro DOS fun. All byte values greater than or equal to 32 decimal, except 127, are mapped to displayable characters in this encoding. Then use cp037 as the encoding for a truly trippy time. And then ask yourself how do you really know which of these, if either of them, is "correct".
Long answer
There is something you must unlearn: the absolute equivalence of byte values and characters.
Many basic text editors and debugging tools today, and also the Python language specification, imply an absolute equivalence between bytes and characters when in reality none exists. It is not true that 74 6f 6b 65 6e
is "token". Only for ASCII-compatible character encodings is this correspondence valid. In EBCDIC, which is still quite common today, "token" corresponds to byte values a3 96 92 85 95
.
So while the Python 2.6 interpreter happily evaluates 'text' == u'text'
as True
, it shouldn't, because they are only equivalent under the assumption of ASCII or a compatible encoding, and even then they should not be considered equal. (At least '\xfd' == u'\xfd'
is False
and gets you a warning for trying.) Python 3.1 evaluates 'text' == b'text'
as False
. But even the acceptance of this expression by the interpreter implies an absolute equivalence of byte values and characters, because the expression b'text'
is taken to mean "the byte-string you get when you apply the ASCII encoding to 'text'
" by the interpreter.
As far as I know, every programming language in widespread use today carries an implicit use of ASCII or ISO-8859-1 (Latin-1) character encoding somewhere in its design. In C, the char
data type is really a byte. I saw one Java 1.4 VM where the constructor java.lang.String(byte[] data)
assumed ISO-8859-1 encoding. Most compilers and interpreters assume ASCII or ISO-8859-1 encoding of source code (some let you change it). In Java, string length is really the UTF-16 code unit length, which is arguably wrong for characters U+10000
and above. In Unix, filenames are byte-strings interpreted according to terminal settings, allowing you to open('a\x08b', 'w').write('Say my name!')
.
So we have all been trained and conditioned by the tools we have learned to trust, to believe that 'A' is 0x41. But it isn't. 'A' is a character and 0x41 is a byte and they are simply not equal.
Once you have become enlightened on this point, you will have no trouble resolving your issue. You have simply to decide what component in the software is assuming the ASCII encoding for these byte values, and how to either change that behavior or ensure that different byte values appear instead.
PS: The phrases "extended ASCII" and "ANSI character set" are misnomers.
Try
## -*- coding: UTF-8 -*-
or
## -*- coding: latin-1 -*-
or
## -*- coding: cp1252 -*-
depending on what you really need. The last two are similar except:
The Windows-1252 codepage coincides with ISO-8859-1 for all codes except the range 128 to 159 (hex 80 to 9F), where the little-used C1 controls are replaced with additional characters. Windows-28591 is the actual ISO-8859-1 codepage.
where ISO-8859-1
is the official name for latin-1
.
Try examining your data with a critical eye:
000001b0 39 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce |9...............|
000001c0 cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de |................|
000001d0 df e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee |................|
000001e0 ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe |................|
000001f0 ff 5d 2b 28 27 73 29 3f 22 0a 20 20 20 20 20 20 |.]+('s)?". |
00000200 20 20 74 6f 6b 65 6e 3a 20 57 4f 52 44 20 20 20 | token: WORD |
00000210 20 20 22 5b 41 2d 5a 61 2d 7a 30 2d 39 c0 c1 c2 | "[A-Za-z0-9...|
00000220 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf d0 d1 d2 |................|
00000230 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0 e1 e2 |................|
00000240 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 |................|
00000250 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff 5d 2b 28 |.............]+(|
The stuff in bold font is two lots of (each byte from 0xc0 to 0xff both inclusive). You appear to have a binary file (perhaps a dump of compiled regex(es)), not a text file. I suggest that you read it as a binary file, rather than paste it into your Python source file. You should also read the mako docs to find out what it is expecting.
Update after eyeballing the text part of your dump: You may well be able to express this in ASCII-only regexes e.g. you would have a line containing
token: WORD "[A-Za-z0-9\xc0-\xff]+(etc)etc"
精彩评论