开发者

Why is the first line longer?

i'm using python to read a txt document with:

f = open(path,"r")
for line in f:
    line = line.decode('utf8').strip()
    length = len(line)
    firstLetter = line[:1]

it seems to work, but the first line's length is always longer by... 1

for example: the first line is "XXXX" where X denotes a chinese character then length will be 5, but not 4 and firstLetter will be nothing

but when i开发者_Python百科t goes to the second and after lines,it works properly

tks~


You have a UTF-8 BOM at the start of your file. Don't faff about inspecting the first character. Instead of the utf8 encoding, use the utf_8_sig encoding with either codecs.open() or your_byte_string.decode() ... this sucks up the BOM if it exists and you don't see it in your code.

>>> bom8 = u'\ufeff'.encode('utf8')
>>> bom8
'\xef\xbb\xbf'
>>> bom8.decode('utf8')
u'\ufeff'
>>> bom8.decode('utf_8_sig')
u'' # removes the BOM
>>> 'abcd'.decode('utf_8_sig')
u'abcd' # doesn't care if no BOM
>>>


You are probably getting the Byte Order Mark (BOM) as the first character on the first line.

Information about dealing with it is here

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜