Parsing irregular text files in Perl
I am new to perl programming and would like to know about parsing text files with perl. I have a text file that has irregular formatting in it and I would like to parse it into three.
Basically the file includes text similar to these:
;out;asoljefsaiouerfas'pozsirt'z
mysql_query("SELECT * FROM Table WHERE (value='true') OR (value2='true') OR (value3='t开发者_运维问答rue') ");
1234 434 3454
4if[9put[e]9sd=09q]024s-q]3-=04i
select ta.somefield, tc.somefield
from TableA ta INNER JOIN TableC tc on tc.somefield=ta.somefield
INNER JOIN TableB tb on tb.somefield=ta.somefield
ORDER by tb.somefield
234 4536 234
and the list goes on with this format.
So what I need to do is to parse it in three. Namely the one on top, getting hash checks. The second is the mysql query and third would be to parse the three numbers. For some reason I do not get how to do this. I use the 'open' function in perl to get the data from the text file. And then I try to use the 'split' function for the line breaks but turns out the queries aren't in a single line or in a pattern so I can't use it that way as I have figured.
Assumptions:
- There will be a blank line between chunks of data.
- That blank line will consist of only a newline.
- In these chunks the hash checks will be the top single line, and the three numbers will be the bottom single line.
with that in mind:
use strict;
use warnings;
use English qw<$RS $OS_ERROR>;
local $RS = "\n\n";
open( my $fh, '<', $path_to_file )
or die "Could not open $path_to_file! - $OS_ERROR"
;
while ( <> ) {
chomp;
my ( $hash_check_line
, @inner_lines
)
= split /\n/
;
my @numbers = split /\D+/, pop @inner_lines;
my $sql = join( "\n", @inner_lines );
...
}
By changing the $RS
( $/
or $INPUT_RECORD_SEPARATOR
) to double newlines, we change how records are read in.
This is not so bizarre, but in my years with Perl, I have had to make the record separator some pretty interesting strings, but sometimes it's all it takes to read in just the chunk that you want to read.
Oh, oh GOD.
The algorithm I see is:
- Cache the first line.
- Read all the lines until a blank line.
- THe 'last' line will be numbers.
- All the rest will be the query.
With that in mind, I present the following code:
open my $fh, '<', $path_to_file
or die "Can't open $path_to_file: $!";
while (my ($checksum, $query, $numbers) = read_record($fh) ) {
# do something with record
}
close $fh or warn "$!";
sub read_record {
my $fh = shift;
my @lines;
LINE: while (my $line = <$fh>) {
chomp $line;
last LINE if $line eq q{}; # if empty, we're done with the record!
push @lines, $line; # store it :)
}
return unless @lines; # if we didn't get anything, eof!
my $checksum = shift @lines; # first was checksum.
my $numbers = pop @lines; # last thing read was numbers.
my $query = join ' ', @lines; # everything else, query.
return ($checksum, $query, $numbers);
}
Modify, of course, to suit boundary conditions.
The following seems to work:
while ($file_content =~ /\s*^(.+?)^(.*?)^(\d+\s+\d+\s+\d+)$/smg) {
my $checksum = $1;
my $query = $2;
my $numbers = $3;
# do stuff
}
Here is an explanation for the regex:
\s* # eat up empty lines
^(.+?) # save the checksum line to group 1
^(.+?) # save one or multiple query lines to group 2
^(\d+\s+\d+\s+\d+)$ # save number line to group 3
The first group will always only be one line, since it is lazy when the next line is encountered the regex will try to start matching at the second group. At that point if the rest of the match can be completed that second group will contain all subsequent lines before the numbers.
精彩评论