How do I merge specific columns from files in array or hash of multiple file handles, one line at a time?
I'll start by describing the files I am working with:
./groupA
./groupA/fileA.txt
./groupA/fileB.txt
./groupA/fileC.txt
./groupA/fileD.txt
./groupB
./groupB/fileA.txt
./groupB/fileB.txt
./groupB/fileC.txt
etc.
Here is what I would like to do:
I have a hash or array of file handles for each
groupI
, pointing to very large tab-delimited text filesfileJ
, each several hundreds of MB in size.I would like to loop through the file handles, reading in one tab-delimited line at a time. I cannot read all the files' lines into memory.
Once I finish looping through the file handles, I then would like to
split
each line, grab a specific column of data from each split-array (fifth field, for example), and merge the data into a line of output.Repeat step 2 — grabbing one line from each file handle — until EOF.
I will then end up with groupA/mergedOutput.mtx
, groupB/mergedOutput.mtx
, etc.
The problem is that I don't know how to do steps 2 and 3 correctly.
Here is the code I have so far:
#!/usr/bin/perl
use strict;
use warnings;
use File::Glob qw(glob);
my @groups = qw(groupA groupB groupC);
my ($mergedOutputFn, %fileHandles);
foreach my $group (@groups) {
$mergedOutputFn = "$group/mergedOutput.mtx";
# Step 1:
# Make hash table of file handles
foreach my $inputFn (<"$group/*.txt">) {
open my $handle, '< $inputFn' or die "could not open $inputFn\n";
$fileHandles{$inputFn} = $handle;
}
# Steps 2 and 3:
# Grab a line from each file handle
# Repeat until EOF
while(1) {
my @mergedOutputLineElements = ();
foreach (sort keys %handles) {
my $handle = $handles{$_};
my $line = <$handle>;
chomp($line);
my @lineElements = split("\t", $line);
push (@mergedOutputLineElements, $lineElements[4]);
last if (! defined $line); # jump out of while loop
}
print Dumper join("\t", @mergedOutputLineElements);
}
# Step 4:
# Close handles
foreach (sort keys %handles) {
close $handles{$_};
}
}
One issue seems to be that the following code doesn't work:
foreach (sort keys %handles) {
my $handle = $handles{$_};
my $line = <$handle>;
...
}
If I try to print out the value of $line
, then I get a GLOB
value:
print Dumper $line;
...
GLOB(0x1d769f80)
How am I mishandling $line
, or is there an easier way to do this within Perl?
Thanks for your advice.
EDIT
Here is the fixed code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Glob qw(glob);
my @groups = qw(groupA groupB groupC);
my ($mergedOutputFn, %fileHandles);
foreach my $group (@groups) {
$mergedOutputFn = "$group/mergedOutput.mtx";
open MERGE, "> $mergedOutputFn" or die "could not open handle to $mergedOutputFn\n";
# Step 1:
# Make hash table of file handles
foreach my $inputFn (<"$group/*.txt">) {
open my $handle, '< $inputFn' or die "could not open $inputFn\n";
$fileHandles{$inputFn} = $handle;
}
# Steps 2 and 3:
# Grab a line from each file handle
# Repeat until EOF
LINE: while(1) {
my @mergedOutputLineElements = ();
foreach (sort keys %handles) {
my $handle = $handles{$_};
my $line = readline $handle;
last LINE if (! defined $line); # jump out of while loop
chomp($line);
my @lineElements = split("\t", $line);
push (@mergedOutputLineElements, $lineElements[4]);
}
print MERGE join("\t", @mergedOutputLineElements);
}
# Step 4:
# Close handles
foreach (sort keys %handles) {
close $handles{$_};
}
close MERGE;
}
Thanks for the tip开发者_Python百科s!
You can read from filehandles like this:
foreach (sort keys %handles) {
my $line = readline $handles{$_};
...
}
精彩评论