开发者

Reference encoding error byte in Python

Suppose I type li开发者_JAVA百科ne = line.decode('gb18030;) and get the error

UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 142-143: illegal multibyte sequence

Is there a nice way to automatically get the error bytes? That is, is there a way to get 142 & 143 or line[142:144] from a built-in command or module? Since I'm fairly confident that there will be only one such error, at most, per line, my first thought was along the lines of:

for i in range(len(line)):
    try:    
        line[i].decode('gb18030')
    except UnicodeDecodeError:
        error = i

I don't know how to say this correctly, but gb18030 has variable byte length so this method fails once it gets to a Chinese character (2 bytes).


try:
    line = line.decode('gb18030')
except UnicodeDecodeError, e:
    print "Error in bytes %d through %d" % (e.start, e.end)


Access the start and end attributes of the caught exception object.

u = u'áiuê©'
try:
  l = u.encode('latin-1')
  print repr(l)
  l.decode('utf-8')
except UnicodeDecodeError, e:
  print e
  print e.start, e.end
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜