How to clean a data file from binary junk?

2023-03-12 05:18 问答作者：

I have this data file, which is supposed to be a normal ASCII file. However, it has some junk in the end of 开发者_JAVA百科the first line. It only shows when I look at it with vi or less -->

  y mon d  h XX11 XX22 XX33 XX44 XX55 XX66^@
2011  6 6 10 14.0 15.5 14.3 11.3 16.2 16.1

grep is also saying that it's a binary file: Binary file data.dat matches

This is causing some trouble in my parsing script. I'm splitting each line and putting them to array. The last element(XX66) in first array is somehow corrupted, because of the junk and I can't make a match to it.

How to clean that line or the array? I have tried dos2unix to the file and substituting array members with s/\s+$//. What is that junk anyway? Unfortunately I have no control over the data, it's a third party data.

Any ideas?

Grep is trying to be smart and, when it sees an unprintable character, switches to "binary" mode. Add "-a" or "--text" to force grep to stay in "text" mode.

As for sed, try sed -e 's/$[^ -~]*$//g', which says, "change everything not between space and tilde (chars 0x20 and 0x7E, respectively) into nothing". That'll strip tabs, too, but you can insert a tab character before the space to include them (or any other special character).

The "^@" is one way to represent an NUL (aka "ascii(0)" or "\0"). Some programs may also see that as an end-of-file if they were implemented in a naive way.

If it's always the same codes (eg ^@ or related) then you can find/replace them.

In Vim for example:

:%s/^@//g in edit mode will clear out any of those characters.

To enter a character such as ^@, press and hold down the Ctrl button, press 'v' and then press the character you need - in the above case, remember to hold shift down to get the @ key. The Ctrl key should be held down til the end.

The ^@ looks like it's a control character. I can't figure out what character it should be, but I guess that's not important.

You can use s/^@//g to get rid of them, but you have to actually COPY the character, just putting ^ and @ together won't do it.

e:f;b.

I created this small script to remove all binary, non-ASCII and some annoying characters from a file. Notice that the char are octal-based:

#!/usr/bin/perl
use strict;
use warnings;

my $filename = $ARGV[0];
open my $fh, '<', $filename or die "File not found: $!";
open my $fh2, '>', 'report.txt' ;
binmode($fh);

my ($xdr, $buffer) = "";

# read 1 byte at a time until end of file ...
while (read ($fh, $buffer, 1) != 0) {   
    # append the buffer value to xdr variable
    $xdr .= $buffer; 
    if (!($xdr =~ /[\0-\11]/) and (!($xdr =~ /[\13-\14]/))and (!($xdr =~ /[\16-\37]/)) and (!($xdr =~ /[\41-\55]/)) and (!($xdr =~ /[\176-\177]/))) {
        print $fh2 $xdr;
    }
    $xdr = "";
} 
# finaly, clean all the characters that are not ASCII.
system("perl -plne 's/[^[:ascii:]]//g' report.txt > $filename.clean.txt");

Stripping individual characters using sed is going to be very slow, perhaps several minutes for 100MB file.

As an alternative, if you know the format/structure of the file, e.g. a log file where the "good" lines of the file start with a timestamp, then you can grep out the good lines and redirect those to a new file.

For example, if we know that all good lines start with a timestamp with the year 2021, we can use this expression to only output those lines to a new file:

grep -a "^2021" mylog.log > mylog2.log

Note that you must use the -a or --text option with grep to force grep to output lines when it detects that the file is binary.

继续阅读：binary file perl

How to clean a data file from binary junk?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？