开发者

How can I filter a large file into two separate files?

I've got a huge file (500 MB) that is organized like this:

<link type="1-1" xtargets="1;1">开发者_开发技巧;
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>

I'd like to transform this into a new format where s1 goes to a new file with each s1 on its own line with a line break, and s2 goes to a new file with each s2 on its own line.

Is Perl the way to go here? If so, can someone let me know how I can accomplish this?


I warmly recommend using XML::Twig, since it is capable of handling streams of XML data. You can use it something like this:

use XML::Twig;
my $xml = new XML::Twig( TwigHandlers => { link => \&process_link });

$xml->parsefile('Your file here');

sub process_link
{
    my($xml, $link) = @_;
    # You can now handle each individual block here..

One trick is to do something like:

my $structure = $link->simplify;

Now it's a mixture of hashrefs and arrayrefs depending on structure! Everything, including attributes are there,

print Dumper $structure; exit;

And you can use Data::Dumper to inspect it to take what you need.

Just remember to flush it out to free up memory when you're done.

    $link->flush;
}


Use an XML parser. This problem is quite well-suited to parsing with an event-based parser, so I'd recommend looking into how the built-in XML::Parser or XML::SAX modules work. You should be able to create two event handlers for each kind of tag you want to process and direct the matching content to two separate files.


You can use Perl, but it's NOT the only way. Here's one with gawk:

gawk -F">" '/<s[12]>/{o=$0;sub(/.*</,"",$1);print o > "file_"$1 }' file

Or, if your task is very simple, then:

awk '/<s1>/' file > file_s1
awk '/<s2>/' file > file_s2

or grep:

grep "<s1>" file > file_s1
grep "<s2>" file > file_s2


Yes, Perl is the (or maybe "a") way to go.

You need an XML parser. There are several choices on CPAN so have a look.

XML::LibXML::Parser looks like it has something for parsing parts of files, which sounds like what you need.


First, if you are going to ignore the fact that the input is XML, then there is no need for Perl or Python or gawk or any other language. Just use

$ grep '<s1>' input_file > s1.txt
$ grep '<s2>' input_file > s2.txt

and be done with it. This seems inefficient but given the time it takes to write a script and then invoke it, the inefficiency is insignificant. What's worse, if you do not know how to write that particularly simple script, you have to post on SO and wait for an answer which exceeds the inefficiency of the grep solution by many many many orders of magnitudes.

Now, if the fact that the input is XML matters in the slightest, you should use an XML parser. Contrary to the incorrect claim made elsethread, there are plenty of XML parser which do not have to load the whole file in to memory. Such a parser would have the advantage of being extensible and correct.

The example I give below is intended to replicate the structure of the answer you have already accepted to show you that it is no more complicated to use a proper solution.

Just to give fair warning, the script below is likely to be the slowest possible way. I wrote it to exactly mimic the accepted solution.

#!/usr/bin/perl

use strict; use warnings;
use autodie;

my %fh = map { open my $f, '>',  $_; $_ => $f } qw{ s1.txt s2.txt };

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);
$parser->xml_mode(1);

while ( my $tag = $parser->get_tag('s1',  's2') ) {
    my $type = $tag->get_tag;
    my $text = $parser->get_text("/$type");
    print { $fh{"$type.txt"} } $text,  "\n";
}    
__DATA__
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
    <s1>bunch of text here</s1>
    <s2>some more here</s2>
</link>

Output:

C:\Temp> cat s1.txt
bunch of text here
bunch of text here
bunch of text here

C:\Temp> cat s2.txt
some more here
some more here
some more here


You can use one of those methods to do this task:

  1. Regular expressions
  2. HTML::TreeBuilder module
  3. HTML::TokeParser module
  4. XML::LibXML module


>> Is perl the way to go here 

Definitely not always the way to go. Here's one in Python

f=open("xmlfile")
out1=open("file_s1","a")
out2=open("file_s2","a")
for line in f:    
    if "<s1>" in line:
        out1.write(line)
    elif "<s2>" in line:
        out2.write(line)
f.close()
out1.close()
out2.close()


If the file is huge an XML parser could result in significant slow down or even an application crash as XML parsers require the entire file in memory before any operations can be performed on the file (something high-level fluffy cloud developers often forget about recursive structures).

Instead you can be pragmatic. It appears that your data follows fairly consistent patterns. And this is a one-time transformation.

Try something like


BEGIN {
  open( FOUT1 ">s1.txt" ) or die( "Cannot open s1.txt: $!" );
  open( FOUT2 ">s2.txt" ) or die( "Cannot open s2.txt: $!" );
}
while ( defined( my $line = <> ) ) {
  if ( $line =~ m{<s1>(.+?)</s1>} ) {
    print( FOUT1 "$1\n" );
  } elsif ( $line =~ m{<s2>(.+?)</s2>} ) {
    print( FOUT2 "$1\n" );
  }
}
END {
  close( FOUT2 );
  close( FOUT1 );
}

Then run this script as perl myscript.pl <bigfile.txt.

Update 1: corrected reference to matched section as $1 from $2.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜