开发者

Pyparsing - parse jascii text from mixed jascii/ascii text file?

I have a text file 开发者_运维技巧with mixed jascii/shift-jis and ascii text. I'm using pyparsing and am unable to tokenize such strings.

Here is an example code:

from pyparsing import *

subrange = r"[\0x%x40-\0x%x7e\0x%x80-\0x%xFC]"
shiftJisChars = u''.join(srange(subrange % (i,i,i,i)) for i in range(0x81,0x9f+1) + range(0xe0,0xfc+1))
jasciistring = Word(shiftJisChars)

jasciistring.parseString(open('shiftjis.txt').read())

I get:

Traceback (most recent call last):
  File "test.py", line 7, in 
    jasciistring.parseString(open('shiftjis.txt').read())
  File "C:\python\lib\site-packages\pyparsing.py", line 1100, in parseString
    raise exc pyparsing.ParseException

This is the content of the text file:

"‚s‚ˆ‚‰‚“@‚‰‚“@‚@‚“‚ˆ‚‰‚†‚”[‚Š‚‰‚“@‚“‚”‚’‚‰‚Ž‚‡B"

(no quotation marks)


When you have a problem with non-ASCII characters/bytes, it is rather unhelpful to print them to your console and them copy/past that into your question. What you see is quite often NOT what you have got. You should use the built-in repr() function [Python 3.x: ascii()] to show your data as unambigously as possible.

Do this:

python -c "print repr(open('shiftjis.txt', 'rb').read())"

and copy/paste the results into an edit your question.

Reverse-engineering your data while awaiting enlightenment: A Windows code page would have to be a good suspect, with cp1252 the most usual. As @Mark Tolonen has shown, cp1252 almost fits, with one error. Further investigation shows that the other cp125x encodings produce 2, 3, or 5 errors. AFAIK only the cp125x encodings would map something that looks like a comma (actually U+201A SINGLE LOW-9 QUOTATION MARK) to the shift-jis lead byte \x82. I conclude that the offender is cp1252, and that the error is caused by damage in transit.

Another possibility is that the underlying original encoding is not shift-jis but its superset, Microsoft's cp932 as used on Japanese Windows. However the problematic sequence '\x82@' is not valid in cp932 either. In any case, if the file(s) that you want to process came from a Japanese Windows machine, it would be better to use cp932 than shift-jis.

It is not obvious from your question and your code what you want to do nor why you want to do it with byte ranges instead of just decoding your data to Unicode. I don't use pyparsing but it seems highly likely that the subranges that you are feeding it are malformed.

Below is an example of how you could tokenise your input using regular expressions. Note that the pyparsing syntax is slightly different (\0xff instead of Python's `\xff').

Code:

import re, unicodedata

input_bytes = '\x82s\x82\x88\x82\x89\x82\x93@\x82\x89\x82\x93@\x82@\x82\x93\x82\x88\x82\x89\x82\x86\x82\x94[\x82\x8a\x82\x89\x82\x93@\x82\x93\x82\x94\x82\x92\x82\x89\x82\x8e\x82\x87B'

p_ascii = r'[\x00-\x7f]'
p_hw_katakana = r'[\xa1-\xdf]' # half-width Katakana
p_jis208 = r'[\x81-\x9f\xe0-\xef][\x40-\x7e\x80-\xfc]'
p_bad = r'.' # anything else

kinds = ['jis208', 'ascii', 'hwk', 'bad']

re_matcher = re.compile("(" + ")|(".join([p_jis208, p_ascii, p_hw_katakana, p_bad]) + ")")

for mobj in re_matcher.finditer(input_bytes):
    s = mobj.group()
    us = s.decode('shift-jis', 'replace')
    print ("%-6s %-9s %-10r U+%04X %s"
        % (kinds[mobj.lastindex - 1], mobj.span(), s, ord(us), unicodedata.name(us, '<no name>'))
        )

Output:

jis208 (0, 2)    '\x82s'    U+FF34 FULLWIDTH LATIN CAPITAL LETTER T
jis208 (2, 4)    '\x82\x88' U+FF48 FULLWIDTH LATIN SMALL LETTER H
jis208 (4, 6)    '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (6, 8)    '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii  (8, 9)    '@'        U+0040 COMMERCIAL AT
jis208 (9, 11)   '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (11, 13)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii  (13, 14)  '@'        U+0040 COMMERCIAL AT
jis208 (14, 16)  '\x82@'    U+FFFD REPLACEMENT CHARACTER
jis208 (16, 18)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
jis208 (18, 20)  '\x82\x88' U+FF48 FULLWIDTH LATIN SMALL LETTER H
jis208 (20, 22)  '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (22, 24)  '\x82\x86' U+FF46 FULLWIDTH LATIN SMALL LETTER F
jis208 (24, 26)  '\x82\x94' U+FF54 FULLWIDTH LATIN SMALL LETTER T
ascii  (26, 27)  '['        U+005B LEFT SQUARE BRACKET
jis208 (27, 29)  '\x82\x8a' U+FF4A FULLWIDTH LATIN SMALL LETTER J
jis208 (29, 31)  '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (31, 33)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii  (33, 34)  '@'        U+0040 COMMERCIAL AT
jis208 (34, 36)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
jis208 (36, 38)  '\x82\x94' U+FF54 FULLWIDTH LATIN SMALL LETTER T
jis208 (38, 40)  '\x82\x92' U+FF52 FULLWIDTH LATIN SMALL LETTER R
jis208 (40, 42)  '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (42, 44)  '\x82\x8e' U+FF4E FULLWIDTH LATIN SMALL LETTER N
jis208 (44, 46)  '\x82\x87' U+FF47 FULLWIDTH LATIN SMALL LETTER G
ascii  (46, 47)  'B'        U+0042 LATIN CAPITAL LETTER B

Note 1: You DON'T need to loop around and join O(N**2) character ranges.

If "jascii" just means "FULLWIDTH LATIN (CAPITAL|SMALL) LETTER [A-Z]" (a) your net is far too large (b) you can do that easily using UNICODE character ranges instead of BYTE ranges (after of course decoding your data).


The first thing that jumps out at me is that you're not opening the file as a binary file. I recommend using code like open('shiftjis.txt', 'rb'). You know that the file contains characters outside of the normal ASCII range, so it's usually best to open the file as a binary file and then decode the contents to Unicode. Perhaps something like that following will work (assuming that 'shift-jis' is the correct codec name):

text = open('shiftjis.txt', 'rb').read().decode('shift-jis')
jasciistring.parseString(text)

If parseString() is expecting a str object (as opposed to a unicode object) then you could change the last line to encode text using UTF-8:

jasciistring.parseString(text.encode('utf-8'))

The only other recommendation I have is to verify that jasciistring contains the correct grammar; since you're constructing it using hex ranges, I would expect you need to first treat it as a binary str and then decode it into a unicode object.


You "text file content" is mojibake (garbage displayed from using the wrong codec to decode the file). I guessed at the wrong codec, re-encoded the text, decoded with ShiftJIS and got:

# coding: utf8
import codecs
s = u'‚s‚ˆ‚‰‚“@‚‰‚“@‚@‚“‚ˆ‚‰‚†‚”[‚Š‚‰‚“@‚“‚”‚’‚‰‚Ž‚‡B'
s = s.encode('cp1252').decode('shift-jis','replace')
print s

Output

This@is@�shift[jis@stringB

So the default US Windows codec isn't quite the right :^)

Very likely all you need to do is read the original file with the shift_jis codec:

import codecs
f = codecs.open('shiftjis.txt','rb','shift_jis')
data = f.read()
f.close

data will be a Unicode string containing the decoded characters.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜