开发者

Removing unknown characters from a text file

I have a large number of files containing data I am trying to process using a Python script.

The files a开发者_StackOverflow社区re in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).

In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:

stripped_data=[]
for root,dirs,files in os.walk(PATH):
    for rawfile in files:
        (dirName, fileName)= os.path.split(rawfile)
        (fileBaseName, fileExtension)=os.path.splitext(fileName)
        h=open(os.path.join(root, rawfile),'r')
        line=h.read()
        for raw_value in line.split('\x00'):
            try:
                test=float(raw_value)
                stripped_data.append(raw_value.strip())
            except ValueError:  
                pass

However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.

How can I remove all non-ASCII characters from these files prior to processing?


You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().


I don't know if this will work for sure, but you could try using the IO methods in the codec module:

import codec

inFile = codec.open(<SAME ARGS AS 'OPEN'>, 'utf-8')
for line in inFile.readline():
    do_stuff()

You can treat the inFile just like a normal FILE object.

This may or may not help you, but it probably will.

[EDIT]

Basically you'll replace: h=open(os.path.join(root, rawfile),'r') with h=open(os.path.join(root, rawfile),'r', 'utf-8')


The file.read() function will read until EOF. As you said it stops too early you want to continue reading the file even when hitting an EOF. Make sure to stop when you have read the entire file. You can do this by checking the position in the file via file.tell() when hitting an EOF and stopping when you hit the file-size (read file-size prior to reading).

As this is rather complex you may want to use file.next and iterate over bytes.

To remove non-ascii characters you can either use a white-list for specific characters or check the read Byte against a range your define. E.g. is the Byte between x30 and x39 (a number) -> keep it / save it somewhere / add it to a string. See an ASCII table.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜