Why is the first line longer?
i'm using python to read a txt document with:
f = open(path,"r")
for line in f:
line = line.decode('utf8').strip()
length = len(line)
firstLetter = line[:1]
it seems to work, but the first line's length is always longer by... 1
for example: the first line is "XXXX" where X denotes a chinese character then length will be 5, but not 4 and firstLetter will be nothing
but when i开发者_Python百科t goes to the second and after lines,it works properly
tks~
You have a UTF-8 BOM at the start of your file. Don't faff about inspecting the first character. Instead of the utf8
encoding, use the utf_8_sig encoding with either codecs.open()
or your_byte_string.decode()
... this sucks up the BOM if it exists and you don't see it in your code.
>>> bom8 = u'\ufeff'.encode('utf8')
>>> bom8
'\xef\xbb\xbf'
>>> bom8.decode('utf8')
u'\ufeff'
>>> bom8.decode('utf_8_sig')
u'' # removes the BOM
>>> 'abcd'.decode('utf_8_sig')
u'abcd' # doesn't care if no BOM
>>>
You are probably getting the Byte Order Mark (BOM) as the first character on the first line.
Information about dealing with it is here
精彩评论