How to Sort ip addresses and merge two files in efficent manner using perl or *nix commands?
(*) This problem should be done in perl or any *nix commands.
i'm working on a program and efficiency matters.The file1 consists ip addresses and some other data:
index ipsrc portsrc ip dest port src
8 128.3.45.10 2122 169.182.111.161 80 (same ip src and dst)
9 128.3.45.10 2123 169.182.111.161 22 (same ip src and dst)
10 128.3.45.10 2124开发者_JAVA技巧 169.182.111.161 80 (same ip src and dst)
19 128.3.45.128 62256 207.245.43.126 80
and other file2 looks like (file1 and file2 are in different order)
128.3.45.10 ioc-sea-lm 169.182.111.161 microsoft-ds 0 0 3 186 3 186
128.3.45.10 hypercube-lm 169.182.111.161 https 0 0 3 186 3 186
128.3.44.112 pay-per-view 148.184.171.6 netbios-ssn 0 0 3 186 3 186
128.3.45.12 cadabra-lm 148.184.171.6 microsoft-ds 0 0 3 186 3 186
1- SORT file1 using IP address in second column and SORT file2 using IP address in first column
2- Merge the 1st, 3rd and 5th columns of File1 with File 2
i need to create a new file which will look:
128.3.45.10 ioc-sea-lm 169.182.111.161 microsoft-ds 0 0 3 186 3 186 --> 2122 80 8
128.3.45.10 hypercube-lm 169.182.111.161 https 0 0 3 186 3 186 --> 2123 22 9
128.3.44.112 pay-per-view 148.184.171.6 netbios-ssn 0 0 3 186 3 186 --> * * *
128.3.45.12 cadabra-lm 148.184.171.6 microsoft-ds 0 0 3 186 3 186 --> * * *
basically port numbers and index number will be added.
Superficially, it seems an obvious application for sort
and join
:
sort -k2 file1 > sorted.1
sort -k1 file2 > sorted.2
join -1 2 -2 1 -a 2 -e '*' \
-o 2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,1.1,1.3,1.5 \
sorted.1 sorted.2
However, the output from that is:
128.3.44.112 pay-per-view 148.184.171.6 netbios-ssn 0 0 3 186 3 186 * * *
128.3.45.10 hypercube-lm 169.182.111.161 https 0 0 3 186 3 186 8 2122 80
128.3.45.10 ioc-sea-lm 169.182.111.161 microsoft-ds 0 0 3 186 3 186 8 2122 80
128.3.45.10 hypercube-lm 169.182.111.161 https 0 0 3 186 3 186 9 2123 22
128.3.45.10 ioc-sea-lm 169.182.111.161 microsoft-ds 0 0 3 186 3 186 9 2123 22
128.3.45.10 hypercube-lm 169.182.111.161 https 0 0 3 186 3 186 10 2124 80
128.3.45.10 ioc-sea-lm 169.182.111.161 microsoft-ds 0 0 3 186 3 186 10 2124 80
128.3.45.12 cadabra-lm 148.184.171.6 microsoft-ds 0 0 3 186 3 186 * * *
Close, but no dice: the problem is that the IP address 128.3.45.10 appears thrice in file1 and twice in file2 and join
therefore creates 6 rows (a Cartesian product).
This appears to be an application that needs to consume/destroy entries as they are used. This suggests we are going to need to use Perl (or a similar scripting language). It is then not clear that we need to sort file1; we need to read file1 and create a structure with a hash keyed on IP address (field 2) where the key points to an array of strings, each record containing just the three fields (1, 3, 5) that are needed.
Then we process file2 in sequence, finding the matching IP address in the hash, and using the first entry in the array - or stars if there is no such entry. We can add the '-->
' requested in the question too.
This leads to the fairly simple program:
#!/usr/bin/env perl
use strict;
use warnings;
my %file1 = read_file1("file1");
sub read_file1
{
my($file) = @_;
open my $fh, '<', $file or die "Failed to open $file for reading ($!)";
my %file1;
while (my $line = <$fh>)
{
my @fields = split / /, $line;
my $ip = $fields[1];
$file1{$ip} = [ ] unless defined $file1{$ip};
push @{$file1{$ip}}, "$fields[0] $fields[2] $fields[4]";
}
return %file1;
}
my $file2 = "file2";
open my $f2, '<', $file2 or die "Failed to open $file2 for reading ($!)";
while (my $line = <$f2>)
{
chomp $line;
my($ip) = ($line =~ m/^(\S+) /);
my $aux = "* * *";
if (defined $file1{$ip})
{
$aux = shift @{$file1{$ip}};
delete $file1{$ip} if scalar @{$file1{$ip}} == 0;
}
print "$line --> $aux\n";
}
And the output is this - exactly as requested:
128.3.45.10 ioc-sea-lm 169.182.111.161 microsoft-ds 0 0 3 186 3 186 --> 8 2122 80
128.3.45.10 hypercube-lm 169.182.111.161 https 0 0 3 186 3 186 --> 9 2123 22
128.3.44.112 pay-per-view 148.184.171.6 netbios-ssn 0 0 3 186 3 186 --> * * *
128.3.45.12 cadabra-lm 148.184.171.6 microsoft-ds 0 0 3 186 3 186 --> * * *
There is nary a sort in sight - so it is reasonably efficient.
精彩评论