开发者

How to clean a data file from binary junk?

I have this data file, which is supposed to be a normal ASCII file. However, it has some junk in the end of 开发者_JAVA百科the first line. It only shows when I look at it with vi or less -->

  y mon d  h XX11 XX22 XX33 XX44 XX55 XX66^@
2011  6 6 10 14.0 15.5 14.3 11.3 16.2 16.1

grep is also saying that it's a binary file: Binary file data.dat matches

This is causing some trouble in my parsing script. I'm splitting each line and putting them to array. The last element(XX66) in first array is somehow corrupted, because of the junk and I can't make a match to it.

How to clean that line or the array? I have tried dos2unix to the file and substituting array members with s/\s+$//. What is that junk anyway? Unfortunately I have no control over the data, it's a third party data.

Any ideas?


Grep is trying to be smart and, when it sees an unprintable character, switches to "binary" mode. Add "-a" or "--text" to force grep to stay in "text" mode.

As for sed, try sed -e 's/\([^ -~]*\)//g', which says, "change everything not between space and tilde (chars 0x20 and 0x7E, respectively) into nothing". That'll strip tabs, too, but you can insert a tab character before the space to include them (or any other special character).

The "^@" is one way to represent an NUL (aka "ascii(0)" or "\0"). Some programs may also see that as an end-of-file if they were implemented in a naive way.


If it's always the same codes (eg ^@ or related) then you can find/replace them.

In Vim for example:

:%s/^@//g in edit mode will clear out any of those characters.

To enter a character such as ^@, press and hold down the Ctrl button, press 'v' and then press the character you need - in the above case, remember to hold shift down to get the @ key. The Ctrl key should be held down til the end.


The ^@ looks like it's a control character. I can't figure out what character it should be, but I guess that's not important.

You can use s/^@//g to get rid of them, but you have to actually COPY the character, just putting ^ and @ together won't do it.

e:f;b.


I created this small script to remove all binary, non-ASCII and some annoying characters from a file. Notice that the char are octal-based:

#!/usr/bin/perl
use strict;
use warnings;

my $filename = $ARGV[0];
open my $fh, '<', $filename or die "File not found: $!";
open my $fh2, '>', 'report.txt' ;
binmode($fh);

my ($xdr, $buffer) = "";

# read 1 byte at a time until end of file ...
while (read ($fh, $buffer, 1) != 0) {   
    # append the buffer value to xdr variable
    $xdr .= $buffer; 
    if (!($xdr =~ /[\0-\11]/) and (!($xdr =~ /[\13-\14]/))and (!($xdr =~ /[\16-\37]/)) and (!($xdr =~ /[\41-\55]/)) and (!($xdr =~ /[\176-\177]/))) {
        print $fh2 $xdr;
    }
    $xdr = "";
} 
# finaly, clean all the characters that are not ASCII.
system("perl -plne 's/[^[:ascii:]]//g' report.txt > $filename.clean.txt");


Stripping individual characters using sed is going to be very slow, perhaps several minutes for 100MB file.

As an alternative, if you know the format/structure of the file, e.g. a log file where the "good" lines of the file start with a timestamp, then you can grep out the good lines and redirect those to a new file.

For example, if we know that all good lines start with a timestamp with the year 2021, we can use this expression to only output those lines to a new file:

grep -a "^2021" mylog.log > mylog2.log

Note that you must use the -a or --text option with grep to force grep to output lines when it detects that the file is binary.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜