Parsing irregular text files in Perl

2023-03-07 07:24 问答作者：

I am new to perl programming and would like to know about parsing text files with perl. I have a text file that has irregular formatting in it and I would like to parse it into three.

Basically the file includes text similar to these:

;out;asoljefsaiouerfas'pozsirt'z
mysql_query("SELECT * FROM Table WHERE (value='true') OR (value2='true') OR (value3='t开发者_运维问答rue') ");
1234 434 3454

4if[9put[e]9sd=09q]024s-q]3-=04i
select ta.somefield, tc.somefield 
from TableA ta INNER JOIN TableC tc on tc.somefield=ta.somefield 
INNER JOIN TableB tb on tb.somefield=ta.somefield 
ORDER by tb.somefield
234 4536 234

and the list goes on with this format.

So what I need to do is to parse it in three. Namely the one on top, getting hash checks. The second is the mysql query and third would be to parse the three numbers. For some reason I do not get how to do this. I use the 'open' function in perl to get the data from the text file. And then I try to use the 'split' function for the line breaks but turns out the queries aren't in a single line or in a pattern so I can't use it that way as I have figured.

Assumptions:

There will be a blank line between chunks of data.
That blank line will consist of only a newline.
In these chunks the hash checks will be the top single line, and the three numbers will be the bottom single line.

with that in mind:

use strict;
use warnings;
use English qw<$RS $OS_ERROR>;

local $RS = "\n\n";

open( my $fh, '<', $path_to_file ) 
    or die "Could not open $path_to_file! - $OS_ERROR"
    ;
while ( <> ) { 
    chomp;
    my ( $hash_check_line
       , @inner_lines 
       )
       = split /\n/
       ;
    my @numbers = split /\D+/, pop @inner_lines;
    my $sql     = join( "\n", @inner_lines );

    ...
}

By changing the $RS ( $/ or $INPUT_RECORD_SEPARATOR ) to double newlines, we change how records are read in.

This is not so bizarre, but in my years with Perl, I have had to make the record separator some pretty interesting strings, but sometimes it's all it takes to read in just the chunk that you want to read.

Oh, oh GOD.

The algorithm I see is:

Cache the first line.
Read all the lines until a blank line.
THe 'last' line will be numbers.
All the rest will be the query.

With that in mind, I present the following code:

open my $fh, '<', $path_to_file
    or die "Can't open $path_to_file: $!";
while (my ($checksum, $query, $numbers) = read_record($fh) ) {
    # do something with record
}
close $fh or warn "$!";

sub read_record {
    my $fh = shift;
    my @lines;
    LINE: while (my $line = <$fh>) {
        chomp $line;
        last LINE if $line eq q{}; # if empty, we're done with the record!
        push @lines, $line;        # store it :)
    }
    return unless @lines;          # if we didn't get anything, eof!
    my $checksum = shift @lines;   # first was checksum.
    my $numbers = pop @lines;      # last thing read was numbers.
    my $query = join ' ', @lines;  # everything else, query.
    return ($checksum, $query, $numbers);
}

Modify, of course, to suit boundary conditions.

The following seems to work:

while ($file_content =~ /\s*^(.+?)^(.*?)^(\d+\s+\d+\s+\d+)$/smg) {
    my $checksum = $1;
    my $query = $2;
    my $numbers = $3;
    # do stuff
}

Here is an explanation for the regex:

\s*                   # eat up empty lines
^(.+?)                # save the checksum line to group 1
^(.+?)                # save one or multiple query lines to group 2
^(\d+\s+\d+\s+\d+)$   # save number line to group 3

The first group will always only be one line, since it is lazy when the next line is encountered the regex will try to start matching at the second group. At that point if the rest of the match can be completed that second group will contain all subsequent lines before the numbers.

继续阅读：parsing perl text

Parsing irregular text files in Perl

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？