开发者

filter text file using script

I have a tab-delimited text file that is very large. Some lines have the same value in the file, some lines just have unique value, For example:

a   foo
a   bar
a   foo2
b   bar2
c   bar2
c   foo3
d   bar3
...

I also have another ID list file, which is just part of the whole list. For example:

a
b
d
...

I want to get th开发者_StackOverflow中文版e correspond value for those ID list, the ID list is unique. How can I do it using perl script or python or basic bash command? Appreciate it!


In perl:

use strict;
use warnings;
use autodie;

open my $id_list, '<', 'id_list_file';
my %ids = map { chomp; $_ => 1 } readline $id_list;
close $id_list;

open my $text_file, '<', 'text_file';
while ( my $line = readline $text_file ) {
    chomp $line;
    my ($id, $value) = split /\t/, $line, 2;
    if ( $ids{ $id } ) {
        print "got value $value for id $id\n";
    }
}


Quickie untested Python:

ids = set()
with open('id-list.txt') as f:
    for line in f:
        ids.add(line.strip())
with open('data.txt') as f:
    for line in f:
        parts = line.strip().split('\t', 1)
        if parts[0] in ids:
            print line,


You can also use the following code(sure that it should be rewritten(to not create list and dict but just perform some actions on the items found) the usual way especially in case when your files are too big):

ids = [row.strip() for row in open('c:\\ids.txt','r') if row.strip()]
data = dict(row.strip().split() for row in open('c:\\data.txt','r') if row.strip())
for id in ids:
    print data.get(id)

Sorry, have missed that there could be more than one value for id:

output = {}
for row in open('c:\\tst.txt','r'):
    if not row.strip():
        continue
    id, datavalue = row.strip().split()
    if not id in output:
        output[id] = []
    output[id].append(datavalue)

Or using defaultdict:

from collections import defaultdict

output = defaultdict(list)
for row in open('c:\\tst.txt','r'):
    if not row.strip():
        continue
    id, datavalue = row.strip().split()
    output[id].append(datavalue)


A quick look at your ID list:

a foo
a bar
a foo2
b bar2
c bar2
c foo3
d bar3

It appears that a can be both foo and bar. The second column is unique, but not the first column. However, your other list looks like this:

a
b
d

Which seems to say that the first column (which isn't unique) are the keys. Exactly what should I return when I read a in the first list. Do I return both foo and bar, or was this a mistake?

I need to know this before we can give you an answer.


Addendum

I need to return both of them. Sorry about the confusion

Okay, in Perl, the easiest way to store key information is to use a Hash. The problem with a Hash is that you only have a single value with each key. In your file, that's not the case, you have two separate values with each key. There are two ways of handling this:

Method #1: Append the value to the previous value

open (ID_FILE, "id_file.txt")
    or die qq(Can't open "id_file.txt" for reading\n);
my %idHash;
while (my $line = <ID_FILE>) {
    chomp $line;
    my ($key, $value) = split("/s+", $line);
    if (exists $idHash{$key}) {
        $idHash{$key} .= " " . $value;
    }
    else {
        $idHash{$key} = $value;
    }
}
close ID_FILE;

In the end of the loop, $idHash{'a'} = foo bar. Thus, in your second loop:

open (ID_LIST, 'list_of_ids.txt')
    or die qq(Can't open "list_of_ids.txt" for reading\n);
while (my $line = <ID_LIST>) {
    chomp $line;
    print qq("$line" keys are "$idHash{$line}"\n);
}

Method #2: Store a List of Hashes

This is dangerous territory. It adds confusion and I usually recommend that you think about object oriented programming when you get into list of hashes or hashes of lists, etc.

open (ID_FILE, "id_file.txt")
    or die qq(Can't open "id_file.txt" for reading\n);
my %idHash;
while (my $line = <ID_FILE>) {
    chomp $line;
    my ($key, $value) = split("/s+", $line);
    push(@{$idHash{$line}}, $value);
}
close ID_FILE;

The @{$idHash{$line}} is treating the hash value as a reference to a hash, you can separate it like this, if it's clearer:

open (ID_FILE, "id_file.txt")
    or die qq(Can't open "id_file.txt" for reading\n);
my %idHash;
while (my $line = <ID_FILE>) {
    chomp $line;
    my ($key, $value) = split("/s+", $line);
    my @tempList = \$idHash{$line};
    push(@tempList, $value);
}
close ID_FILE;

Now, when you do your lookup, you'll have to go through the list:

open (ID_LIST, 'list_of_ids.txt')
    or die qq(Can't open "list_of_ids.txt" for reading\n);
while (my $line = <ID_LIST>) {
    chomp $line;
    my @tempList = \$idHash{$line};
    print "The values for key $line are " . join(", ", @tempList) . "\n";;
    print "The values for key $line are " . join(", ", @{$idHash{$line}) . "\n"; 
}

Or, instead of doing a join, you could parse though the list item for each key:

open (ID_LIST, 'list_of_ids.txt')
    or die qq(Can't open "list_of_ids.txt" for reading\n);
while (my $line = <ID_LIST>) {
    chomp $line;
    foreach my $value (@{$idHash{$line}) {
        print qq(Value: $line" \t key "$value"\n);
    }
}

By the way, I am sorry, but I haven't tested the code due to a lack of time. Therefore, I can guarantee there are syntax errors and bugs all around. However, it does give you the general idea how you can use a Perl Hash to quickly pull up a value via a key, and how you can store multiple values for a single key.

It looks like the original Python answer suffered the same issue. However, the revised one looks correct.


You can create a hash by reading the first file. Put your id as key and array of the corresponding set of values as value. While reading second file just do lookup in the hash you created by using first file.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜