Removing unknown characters from a text file

2023-01-27 03:09 问答作者：

I have a large number of files containing data I am trying to process using a Python script.

The files a开发者_StackOverflow社区re in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).

In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:

stripped_data=[]
for root,dirs,files in os.walk(PATH):
    for rawfile in files:
        (dirName, fileName)= os.path.split(rawfile)
        (fileBaseName, fileExtension)=os.path.splitext(fileName)
        h=open(os.path.join(root, rawfile),'r')
        line=h.read()
        for raw_value in line.split('\x00'):
            try:
                test=float(raw_value)
                stripped_data.append(raw_value.strip())
            except ValueError:  
                pass

However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.

How can I remove all non-ASCII characters from these files prior to processing?

You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().

I don't know if this will work for sure, but you could try using the IO methods in the codec module:

import codec

inFile = codec.open(<SAME ARGS AS 'OPEN'>, 'utf-8')
for line in inFile.readline():
    do_stuff()

You can treat the inFile just like a normal FILE object.

This may or may not help you, but it probably will.

[EDIT]

Basically you'll replace: h=open(os.path.join(root, rawfile),'r') with h=open(os.path.join(root, rawfile),'r', 'utf-8')

The file.read() function will read until EOF. As you said it stops too early you want to continue reading the file even when hitting an EOF. Make sure to stop when you have read the entire file. You can do this by checking the position in the file via file.tell() when hitting an EOF and stopping when you hit the file-size (read file-size prior to reading).

As this is rather complex you may want to use file.next and iterate over bytes.

To remove non-ascii characters you can either use a white-list for specific characters or check the read Byte against a range your define. E.g. is the Byte between x30 and x39 (a number) -> keep it / save it somewhere / add it to a string. See an ASCII table.

继续阅读：character-encoding python windows

Removing unknown characters from a text file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？