Pyparsing - parse jascii text from mixed jascii/ascii text file?
I have a text file 开发者_运维技巧with mixed jascii/shift-jis and ascii text. I'm using pyparsing
and am unable to tokenize such strings.
Here is an example code:
from pyparsing import *
subrange = r"[\0x%x40-\0x%x7e\0x%x80-\0x%xFC]"
shiftJisChars = u''.join(srange(subrange % (i,i,i,i)) for i in range(0x81,0x9f+1) + range(0xe0,0xfc+1))
jasciistring = Word(shiftJisChars)
jasciistring.parseString(open('shiftjis.txt').read())
I get:
Traceback (most recent call last): File "test.py", line 7, in jasciistring.parseString(open('shiftjis.txt').read()) File "C:\python\lib\site-packages\pyparsing.py", line 1100, in parseString raise exc pyparsing.ParseException
This is the content of the text file:
"‚s‚ˆ‚‰‚“@‚‰‚“@‚@‚“‚ˆ‚‰‚†‚”[‚Š‚‰‚“@‚“‚”‚’‚‰‚Ž‚‡B"
(no quotation marks)
When you have a problem with non-ASCII characters/bytes, it is rather unhelpful to print them to your console and them copy/past that into your question. What you see is quite often NOT what you have got. You should use the built-in repr()
function [Python 3.x: ascii()
] to show your data as unambigously as possible.
Do this:
python -c "print repr(open('shiftjis.txt', 'rb').read())"
and copy/paste the results into an edit your question.
Reverse-engineering your data while awaiting enlightenment: A Windows code page would have to be a good suspect, with cp1252
the most usual. As @Mark Tolonen has shown, cp1252
almost fits, with one error. Further investigation shows that the other cp125x
encodings produce 2, 3, or 5 errors. AFAIK only the cp125x
encodings would map something that looks like a comma (actually U+201A SINGLE LOW-9 QUOTATION MARK) to the shift-jis lead byte \x82
. I conclude that the offender is cp1252
, and that the error is caused by damage in transit.
Another possibility is that the underlying original encoding is not shift-jis
but its superset, Microsoft's cp932
as used on Japanese Windows. However the problematic sequence '\x82@'
is not valid in cp932
either. In any case, if the file(s) that you want to process came from a Japanese Windows machine, it would be better to use cp932
than shift-jis
.
It is not obvious from your question and your code what you want to do nor why you want to do it with byte ranges instead of just decoding your data to Unicode. I don't use pyparsing
but it seems highly likely that the subranges that you are feeding it are malformed.
Below is an example of how you could tokenise your input using regular expressions. Note that the pyparsing syntax is slightly different (\0xff
instead of Python's `\xff').
Code:
import re, unicodedata
input_bytes = '\x82s\x82\x88\x82\x89\x82\x93@\x82\x89\x82\x93@\x82@\x82\x93\x82\x88\x82\x89\x82\x86\x82\x94[\x82\x8a\x82\x89\x82\x93@\x82\x93\x82\x94\x82\x92\x82\x89\x82\x8e\x82\x87B'
p_ascii = r'[\x00-\x7f]'
p_hw_katakana = r'[\xa1-\xdf]' # half-width Katakana
p_jis208 = r'[\x81-\x9f\xe0-\xef][\x40-\x7e\x80-\xfc]'
p_bad = r'.' # anything else
kinds = ['jis208', 'ascii', 'hwk', 'bad']
re_matcher = re.compile("(" + ")|(".join([p_jis208, p_ascii, p_hw_katakana, p_bad]) + ")")
for mobj in re_matcher.finditer(input_bytes):
s = mobj.group()
us = s.decode('shift-jis', 'replace')
print ("%-6s %-9s %-10r U+%04X %s"
% (kinds[mobj.lastindex - 1], mobj.span(), s, ord(us), unicodedata.name(us, '<no name>'))
)
Output:
jis208 (0, 2) '\x82s' U+FF34 FULLWIDTH LATIN CAPITAL LETTER T
jis208 (2, 4) '\x82\x88' U+FF48 FULLWIDTH LATIN SMALL LETTER H
jis208 (4, 6) '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (6, 8) '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii (8, 9) '@' U+0040 COMMERCIAL AT
jis208 (9, 11) '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (11, 13) '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii (13, 14) '@' U+0040 COMMERCIAL AT
jis208 (14, 16) '\x82@' U+FFFD REPLACEMENT CHARACTER
jis208 (16, 18) '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
jis208 (18, 20) '\x82\x88' U+FF48 FULLWIDTH LATIN SMALL LETTER H
jis208 (20, 22) '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (22, 24) '\x82\x86' U+FF46 FULLWIDTH LATIN SMALL LETTER F
jis208 (24, 26) '\x82\x94' U+FF54 FULLWIDTH LATIN SMALL LETTER T
ascii (26, 27) '[' U+005B LEFT SQUARE BRACKET
jis208 (27, 29) '\x82\x8a' U+FF4A FULLWIDTH LATIN SMALL LETTER J
jis208 (29, 31) '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (31, 33) '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii (33, 34) '@' U+0040 COMMERCIAL AT
jis208 (34, 36) '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
jis208 (36, 38) '\x82\x94' U+FF54 FULLWIDTH LATIN SMALL LETTER T
jis208 (38, 40) '\x82\x92' U+FF52 FULLWIDTH LATIN SMALL LETTER R
jis208 (40, 42) '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (42, 44) '\x82\x8e' U+FF4E FULLWIDTH LATIN SMALL LETTER N
jis208 (44, 46) '\x82\x87' U+FF47 FULLWIDTH LATIN SMALL LETTER G
ascii (46, 47) 'B' U+0042 LATIN CAPITAL LETTER B
Note 1: You DON'T need to loop around and join O(N**2) character ranges.
If "jascii" just means "FULLWIDTH LATIN (CAPITAL|SMALL) LETTER [A-Z]" (a) your net is far too large (b) you can do that easily using UNICODE character ranges instead of BYTE ranges (after of course decoding your data).
The first thing that jumps out at me is that you're not opening the file as a binary file. I recommend using code like open('shiftjis.txt', 'rb')
. You know that the file contains characters outside of the normal ASCII range, so it's usually best to open the file as a binary file and then decode the contents to Unicode. Perhaps something like that following will work (assuming that 'shift-jis' is the correct codec name):
text = open('shiftjis.txt', 'rb').read().decode('shift-jis')
jasciistring.parseString(text)
If parseString()
is expecting a str
object (as opposed to a unicode
object) then you could change the last line to encode text
using UTF-8:
jasciistring.parseString(text.encode('utf-8'))
The only other recommendation I have is to verify that jasciistring
contains the correct grammar; since you're constructing it using hex ranges, I would expect you need to first treat it as a binary str
and then decode it into a unicode
object.
You "text file content" is mojibake (garbage displayed from using the wrong codec to decode the file). I guessed at the wrong codec, re-encoded the text, decoded with ShiftJIS and got:
# coding: utf8
import codecs
s = u'‚s‚ˆ‚‰‚“@‚‰‚“@‚@‚“‚ˆ‚‰‚†‚”[‚Š‚‰‚“@‚“‚”‚’‚‰‚Ž‚‡B'
s = s.encode('cp1252').decode('shift-jis','replace')
print s
Output
This@is@�shift[jis@stringB
So the default US Windows codec isn't quite the right :^)
Very likely all you need to do is read the original file with the shift_jis codec:
import codecs
f = codecs.open('shiftjis.txt','rb','shift_jis')
data = f.read()
f.close
data
will be a Unicode string containing the decoded characters.
精彩评论