Most efficient way to compute uniqueness (as %) of a file compared to several other, large files

2023-02-06 07:40 问答作者：

I have about 30 500MB files, one word per line. I have a script that does this, in pseudo-bash:

for i in *; do
    echo "" > everythingButI
    for j in *-except-$i; do
        cat $j >> everythingButI
        sort everythingButI | uniq > tmp
        mv tmp everythingButI
    done
    comm $i everythingButI -2 -3 > uniqueInI

    percentUnique=$(wc -l uniqueInI) / $(wc -l $i) * 100
    echo "$i is $percentUnique% Unique"
done

It computes the 'uniqueness' of each file (the files a开发者_高级运维re already sorted and unique within each file).

So if I had files:

file1    file2   file3
a        b       1
c        c       c
d        e       e
f        g
         h

file1 would be 75% unique (because 1/4 of it's lines are found in another file), file2 would be 60% unique, and file3 would be 33.33% unique. But make it 30 files at 500MB a pop, and it takes a bit to run.

I'd like to write a python script that does this much, much faster, but I'm wondering what the fastest algorithm for this would actually be. (I only have 2GB of RAM on the PC also.)

Anyone have opinions about algorithms, or know of a faster way to do this?

EDIT: Since each of the inputs are already internally sorted and deduplicated, you actually need an n-way merge for this, and the hash-building exercise in the previous version of this post is rather pointless.

The n-way merge is kind of intricate if you're not careful. Basically, it works something like this:

Read in the first line of each file, and initialize its unique lines counter and total lines counter to 0.
Do this loop body:
- Find the least value among the lines read.
- If that value is not the same as the one from any of the other files, increment that file's unique lines counter.
- For each file, if the least value equals the last value read, read in the next line and increment that file's total lines counter. If you hit end of file, you're done with that file: remove it from further consideration.
Loop until you have no files left under consideration. At that point, you should have an accurate unique lines counter and total lines counter for each file. Percentages are then a simple matter of multiplication and division.

I've left out the use of a priority queue that's in the full form of the merge algorithm; that only becomes significant if you have a large enough number of input files.

Use a modified N/K-way sort algorithm that treats the entire set of compare files in a pass. Only counting and advancing needs to be done; the merging portion itself can be skipped.

This utilizes the fact that the input is already sorted. If they aren't already sorted, sort them and store them on disk sorted :-) Let the operating system file buffers and read-ahead be your friend.

Happy coding.

With a little bit of cleverness I believe this could also be extended to tell the difference in percent between all the files in a single pass. Just need to keep track of the "trailing" input and counters for each set of relationships (m-m vs. 1-m).

Spoiler code that seems to work for me on the data provided in the question...

Of course, I haven't tested this on really large files or, really, at all. "It ran". The definition of "unique" above was simpler than I was initially thinking about so some of the previous answer doesn't apply much. This code is far from perfect. Use at your own risk (of both computer blowing up and boredom/disgust for not cranking something out better!). Runs on Python 3.1.

import os
import itertools

# see: http://docs.python.org/dev/library/itertools.html#itertools-recipes
# modified for 3.x and eager lists
def partition(pred, iterable):
    t1, t2 = itertools.tee(iterable)
    return list(itertools.filterfalse(pred, t1)), list(filter(pred, t2))

# all files here
base = "C:/code/temp"
names = os.listdir(base)

for n in names:
    print("analyzing {0}".format(n))

# {name => file}
# files are removed from here as they are exhausted
files = dict([n, open(os.path.join(base,n))] for n in names)

# {name => number of shared items in any other list}
shared_counts = {}
# {name => total items this list}
total_counts = {}
for n in names:
    shared_counts[n] = 0
    total_counts[n] = 0

# [name, currentvalue] -- remains mostly sorted and is
# always a very small n so sorting should be lickity-split
vals = []
for n, f in files.items():
    # assumes no files are empty
    vals.append([n, str.strip(f.readline())])
    total_counts[n] += 1

while len(vals):
    vals = sorted(vals, key=lambda x:x[1])
    # if two low values are the same then the value is not-unique
    # adjust the logic based on definition of unique, etc.
    low_value = vals[0][1]
    lows, highs = partition(lambda x: x[1] > low_value, vals)
    if len(lows) > 1:
        for lname, _ in lows:
            shared_counts[lname] += 1
    # all lowest items discarded and refetched
    vals = highs
    for name, _ in lows:
        f = files[name]
        val = f.readline()
        if val != "":
            vals.append([name, str.strip(val)])
            total_counts[name] += 1
        else:
            # close files as we go. eventually we'll
            # dry-up the 'vals' and quit this mess :p
            f.close()
            del files[name]

# and what we want...
for n in names:
    unique = 1 - (shared_counts[n]/total_counts[n])
    print("{0} is {1:.2%} unique!".format(n, unique))

Retrospectively I can already see the flaws! :-) The sorting of vals is in for a legacy reason that no longer really applies. In all practically just a min would work fine here (and likely be better for any relatively small set of files).

Here's some really ugly psuedo-code that does the n-way merge

#!/usr/bin/python

import sys, os, commands
from goto import goto, label

def findmin(linesread):
    min = ""
    indexes = []
    for i in range(len(linesread)):
        if linesread[i] != "":
            min = linesread[i]
            indexes.append(i)
            break
    for i in range(indexes[0]+1, len(linesread)):
        if linesread[i] < min and linesread[i] != "":
            min = linesread[i]
            indexes = [i]
        elif linesread[i] == min:
            indexes.append(i)
    return min, indexes

def genUniqueness(path):
    wordlists = []
    linecount = []

    log = open(path + ".fastuniqueness", 'w')

    for root, dirs, files in os.walk(path):
        if root.find(".git") > -1 or root == ".":
            continue
        if root.find("onlyuppercase") > -1:
            continue

        for i in files:
            if i.find('lvl') >= 0 or i.find('trimmed') >= 0:
                wordlists.append( root + "/" + i );
                linecount.append(int(commands.getoutput("cat " + root + "/" + i + " | wc -l")))
                print root + "/" + i


    whandles = []
    linesread = []
    numlines = []
    uniquelines = []
    for w in wordlists:
        whandles.append(open(w, 'r'))
        linesread.append("")
        numlines.append(0)
        uniquelines.append(0)

    count = range(len(whandles))
    for i in count:
        linesread[i] = whandles[i].readline().strip()
        numlines[i] += 1

    while True:
        (min, indexes) = findmin(linesread)
        if len(indexes) == 1:
            uniquelines[indexes[0]] += 1
        for i in indexes:
            linesread[i] = whandles[i].readline().strip()
            numlines[i] += 1
            if linesread[i] == "":
                numlines[i] -= 1
                whandles[i] = 0
                print "Expiring ", wordlists[i]
        if not any(linesread):
            break


    for i in count:
        log.write(wordlists[i] + "," + str(uniquelines[i]) + "," + str(numlines[i]) + "\n")
        print wordlists[i], uniquelines[i], numlines[i]

继续阅读：algorithm file

Most efficient way to compute uniqueness (as %) of a file compared to several other, large files

Spoiler code that seems to work for me on the data provided in the question...

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Spoiler code that seems to work for me on the data provided in the question...

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？