how to get unique values set from a repeating values list
I need to parse a large log file 开发者_JAVA百科(flat file), which contains two column of values (column-A , column-B).
Values in both columns are repeating. I need to find for each unique value in column-A , I need to find a set of column-B values.
Is this can be done using unix shell command or need to write any perl or python script? What are the ways this can be done?
Example:
xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4
output:
xxxA - 2,1,3
xxxB - 2
xxxC - 3
xxxD - 4
Perl 'one-liner' intended/expanded out so that everything fits in the window:
$ perl -F -lane '
$hash{ $F[0] }{ $F[1] }++;
} END {
for my $columnA ( keys %hash ) {
print $columnA, " - ", join( ",", keys %$hash{$columnA} ), "\n";
}
'
Explanation will follow if I see a concerted attempt on the part of the original poster.
I would use Python dictionaries where the dictionary keys are column A values and the dictionary values are Python's built-in Set type holding column B values
def parse_the_file():
lower = str.lower
split = str.split
with open('f.txt') as f:
d = {}
lines = f.read().split('\n')
for A,B in [split(l) for l in lines]:
try:
d[lower(A)].add(B)
except KeyError:
d[lower(A)] = set(B)
for a in d:
print "%s - %s" % (a,",".join(list(d[a])))
if __name__ == "__main__":
parse_the_file()
The advantage of using a dictionary is that you'll have a single dictionary key per column A value. The advantage of using a set is that you'll have a unique set of column B values.
Efficiency notes:
- The use of try-catch is more efficient than using an if\else statement to check for initial cases.
- The evaluation and assignment of the str functions outside of the loop is more efficient than simply using them inside the loop.
- Depending on the proportion of new A values vs. reappearance of A values throughout the file, you may consider using
a = lower(A)
before the try catch statement - I used a function, as accessing local variables is more efficient in Python than accessing global variables
- Some of these performance tips are from here
Testing the code above on your input example yields:
xxxd - 4
xxxa - 1,3,2
xxxb - 2
xxxc - 3
You can use this simple multimap:
class MultiMap(object):
values = {}
def __getitem__(self, index):
return self.values[index]
def __setitem__(self, index, value):
if not self.values.has_key(index):
self.values[index] = []
self.values[index].append(value)
def __repr__(self):
return repr(self.values)
See it in action: http://codepad.org/xOOrlbnf
Simple Perl version:
#!/usr/bin/perl
use strict;
use warnings;
my (%v, @row);
foreach (<DATA>) {
chomp;
$_ = lc($_);
@row = split(/\s+/, $_);
push( @{ $v{$row[0]} }, $row[1]);
}
foreach (sort keys %v) {
print "$_ - ", join( ", ", @{ $v{$_} } ), "\n";
}
__DATA__
xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4
Did not focus on variable names. From example i see they are not case sensitive.
f = """xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4"""
d = {}
for line in f.split("\n"):
key, val = line.lower().split()
try:
d[key].append(val)
except KeyError:
d[key] = [val]
print d
Python
while() {
($key, $value) = split / /, $_;
$hash{lc($key)} = 1;
push(@array, "$key$value");
}
foreach $key (sort keys %hash) {
@arr = (grep /$key/i, @array);
chomp(@arr);
$val = join (", ", @arr);
$val =~ s#$key##gi;
print "$key\t$val\n";
}
Using Perl oneliner:
perl -lane'$F[0]=~s/.../lc$&/e;exists$s{$F[0]}and$s{$F[0]}.=",$F[1]"or push@v,$F[0]and$s{$F[0]}=$F[1]}{print"$_ $s{$_}"for@v'
You can remove $F[0]=~s/.../lc$&/e;
if your key is case sensitive (which is not true in your test data) or use $F[0]=lc$F[0];
or $F[0]=uc$F[0];
if you can unify your key to lower or upper case.
精彩评论