开发者

INTERVIEW QUESTION : Perl Log file

There is a log file where each line contains fields separated by spaces. One of the fields is the IP address of the source node. We want to find the list of IP Addresses that have the most log entries. Lets say find the top 10 IP Addresses with most log entries.

This is Perl Interview Question. Interviewer wants to know how the candidate 开发者_如何转开发would proceed.

P.S: This question was asked to my friend


My response, Mr. or Ms. Interviewer, would be based on the answers to several questions. The first set of those questions follow. Of course, there may be additional questions generated by the answers to these.

  1. You say "one of the fields." Do we know which field? Is it always the same or does it vary?

  2. Will the logs have only IPv4, only IPv6, or a mixture of both? Is address mapping between IPv4 and IPv6 a concern in counting, or can the mappings be treated as unique source nodes?

  3. How big is the logfile? How much memory is available to solve the problem?

  4. Are CPAN modules available for use, or is the solution limited to only core modules or some other "approved" modules list?


Assume that the IP addresses appear in column N:

use strict;
use warnings;
use constant N => 3;

my %counts;
while (<>)
{
    my(@fields) = split /\s+/;
    $counts{$fields[N]}++;
}

That much gives you a hash of I/P addresses and the corresponding counts.

my %iplist;
foreach my $address (keys %counts)
{
     my $count = $counts{$address};
     push @{$iplist{$count}}, $address;
}

That much gives you a hash of counts, and associated with each count, the list of IP addresses that had that count.

use constant Wanted => 10;

my $printed = 0;
foreach my $count (sort { $b <=> $a } keys %iplist)
{
    print "$count: @{$iplist{$count}}\n";
    $printed += scalar(@{$iplist{$count}});
    last if $printed >= Wanted;
}

That sorts the counts into reverse (descending) order, and prints out the count and the list of IP addresses that appeared that many times. It also counts the number of addresses printed and stops the loop when that meets or exceeds the number required.


Ask if this is intended for one-off usage.

  • If no, Jonathan's answer is good.

  • If yes, use a one-liner.

Assuming that the first field contains the IPs:

perl -ane '$count{$F[0]}++ } END { print $_, "\n" for (sort { $b <=> $a } keys %count)[0..9]'

A good question that tests the candidate's knowledge of data structures, string-array manipulation, sorting and use of array slices.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜