Convert Perl script to Python: dedupe 2 files based on hash keys

2022-12-12 10:49 问答作者：

I am new to Python and would like to know if someone would kindly convert an example of a fairly simple Perl script to Python?

The script takes 2 files and outputs only unique lines from the second file by comparing hash keys. It also outputs duplicate lines to a file. I have found that this method of deduping is extremely fast with Perl, and would like to see how Python compares.

#! /usr/bin/perl

## Compare file1 and file2 and output only the unique lines from file2.

## Opening file1.txt and store the data in a hash.
open my $file1, '<', "file1.txt" or die $!;
while ( <$file1> ) {
    my $name = $_;
    $file1hash{$name}=$_;
}
## Opening file2.txt and store the data in a hash.
open my $file2, '<', "file2.txt" or die $!;

while  ( <$file2> ) {
    $name = $_;
    $file2hash{$name}=$_;
}

open my $dfh, '>', "duplicate.txt";

## Compare the keys and remove the duplicate one in the file2 hash
foreach ( keys %file1hash ) {
    if ( exists ( $file2hash{$_} ))
    {
    print $dfh $file2hash{$_};
    delete $file2hash{$_};
    }
}

open my $ofh, '>', "file2_clean.txt";
print  $ofh values(%file2hash) ;

I have tested both perl and python scripts on 2 files of over 1 million lines and total time was less than 6 seconds. For the business purpose this served, the performance is outstanding!

I modified the script Kriss offered and I am very happy with bot开发者_开发知识库h results: 1) The performance of the script and 2) the ease with which I modified the script to be more flexible:

#!/usr/bin/env python

import os

filename1 = raw_input("What is the first file name to compare? ")
filename2 = raw_input("What is the second file name to compare? ")

file1set = set([line for line in file(filename1)])
file2set = set([line for line in file(filename2)])

for name, results in [
    (os.path.abspath(os.getcwd()) + "/duplicate.txt", file1set.intersection(file2set)),
    (os.path.abspath(os.getcwd()) + "/" + filename2 + "_clean.txt", file2set.difference(file1set))]:
    with file(name, 'w') as fh:
        for line in results:
            fh.write(line)

You can use sets in Python if you don't care about order:

file1=set(open("file1").readlines())
file2=set(open("file2").readlines())
intersection = file1 & file2 #common lines
non_intersection = file2 - file1  #uncommon lines (in file2 but not file1)
for items in intersection:
    print items
for nitems in non_intersection:
    print nitems

Other methods include using difflib, filecmp libraries.

The other way, only using list comparison.

# lines in file2 common with file1
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if line in data1:
        print line

# lines in file2 not in file1, use "not"
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if not line in data1:
        print line

Yet another variant (merely syntaxic changes from other proposals, there is also more than one way to do it using python).

file1set = set([line for line in file("file1.txt")])
file2set = set([line for line in file("file2.txt")])

for name, results in [
    ("duplicate.txt", file1set.intersection(file2set)),
    ("file2_clean.txt", file2set.difference(file1set))]:
    with file(name, 'w') as fh:
        for line in results:
            fh.write(line)

Side note: we should also contribute another perl version, the one proposed in not very perlish... below is the perl equivalent of my python version. Does not looks much like the initial one. What I want to point out is that in proposed answers issue is as much algorithmic and language independant that perl vs python.

use strict;

open my $file1, '<', "file1.txt" or die $!;
my %file1hash = map { $_ => 1 } <$file1>;

open my $file2, '<', "file2.txt" or die $!;
my %file2hash = map { $_ => 1 } <$file2>;

for (["duplicate.txt", [grep $file1hash{$_}, keys(%file2hash)]],
     ["file2_clean.txt", [grep !$file1hash{$_}, keys(%file2hash)]]){
    my ($name, $results) = @$_;
    open my $fh, ">$name" or die $!;
    print $fh @$results;
}

Here's a slightly different solution that's a little more memory friendly, should the files be very large. This only creates a set for the original file (as there doesn't seem to be a need to have all of file2 in memory at once):

with open("file1.txt", "r") as file1:
    file1set = set(line.rstrip() for line in file1)

with open("file2.txt", "r") as file2:
    with open("duplicate.txt", "w") as dfh:
        with open("file2_clean.txt", "w") as ofh:
            for line in file2:
                if line.rstrip() in file1set:
                    dfh.write(line)     # duplicate line
                else:
                    ofh.write(line)     # not duplicate

Note, if you want to include trailing whitespace and the end-of-line characters in the comparisons, you can replace the second line.rstrip() with just line and simplify the second line to:

    file1set = set(file1)

Also, as of Python 3.1, the with statement allows multiple items so the three with statements could be combined into one.

继续阅读：hash perl python

Convert Perl script to Python: dedupe 2 files based on hash keys

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？