Most efficient way to compute uniqueness (as %) of a file compared to several other, large files
I have about 30 500MB files, one word per line. I have a script that does this, in pseudo-bash:
for i in *; do
echo "" > everythingButI
for j in *-except-$i; do
cat $j >> everythingButI
sort everythingButI | uniq > tmp
mv tmp everythingButI
done
comm $i everythingButI -2 -3 > uniqueInI
percentUnique=$(wc -l uniqueInI) / $(wc -l $i) * 100
echo "$i is $percentUnique% Unique"
done
It computes the 'uniqueness' of each file (the files a开发者_高级运维re already sorted and unique within each file).
So if I had files:
file1 file2 file3
a b 1
c c c
d e e
f g
h
file1 would be 75% unique (because 1/4 of it's lines are found in another file), file2 would be 60% unique, and file3 would be 33.33% unique. But make it 30 files at 500MB a pop, and it takes a bit to run.
I'd like to write a python script that does this much, much faster, but I'm wondering what the fastest algorithm for this would actually be. (I only have 2GB of RAM on the PC also.)
Anyone have opinions about algorithms, or know of a faster way to do this?
EDIT: Since each of the inputs are already internally sorted and deduplicated, you actually need an n-way merge for this, and the hash-building exercise in the previous version of this post is rather pointless.
The n-way merge is kind of intricate if you're not careful. Basically, it works something like this:
- Read in the first line of each file, and initialize its unique lines counter and total lines counter to 0.
- Do this loop body:
- Find the least value among the lines read.
- If that value is not the same as the one from any of the other files, increment that file's unique lines counter.
- For each file, if the least value equals the last value read, read in the next line and increment that file's total lines counter. If you hit end of file, you're done with that file: remove it from further consideration.
- Loop until you have no files left under consideration. At that point, you should have an accurate unique lines counter and total lines counter for each file. Percentages are then a simple matter of multiplication and division.
I've left out the use of a priority queue that's in the full form of the merge algorithm; that only becomes significant if you have a large enough number of input files.
Use a modified N/K-way sort algorithm that treats the entire set of compare files in a pass. Only counting and advancing needs to be done; the merging portion itself can be skipped.
This utilizes the fact that the input is already sorted. If they aren't already sorted, sort them and store them on disk sorted :-) Let the operating system file buffers and read-ahead be your friend.
Happy coding.
With a little bit of cleverness I believe this could also be extended to tell the difference in percent between all the files in a single pass. Just need to keep track of the "trailing" input and counters for each set of relationships (m-m vs. 1-m).
Spoiler code that seems to work for me on the data provided in the question...
Of course, I haven't tested this on really large files or, really, at all. "It ran". The definition of "unique" above was simpler than I was initially thinking about so some of the previous answer doesn't apply much. This code is far from perfect. Use at your own risk (of both computer blowing up and boredom/disgust for not cranking something out better!). Runs on Python 3.1.
import os
import itertools
# see: http://docs.python.org/dev/library/itertools.html#itertools-recipes
# modified for 3.x and eager lists
def partition(pred, iterable):
t1, t2 = itertools.tee(iterable)
return list(itertools.filterfalse(pred, t1)), list(filter(pred, t2))
# all files here
base = "C:/code/temp"
names = os.listdir(base)
for n in names:
print("analyzing {0}".format(n))
# {name => file}
# files are removed from here as they are exhausted
files = dict([n, open(os.path.join(base,n))] for n in names)
# {name => number of shared items in any other list}
shared_counts = {}
# {name => total items this list}
total_counts = {}
for n in names:
shared_counts[n] = 0
total_counts[n] = 0
# [name, currentvalue] -- remains mostly sorted and is
# always a very small n so sorting should be lickity-split
vals = []
for n, f in files.items():
# assumes no files are empty
vals.append([n, str.strip(f.readline())])
total_counts[n] += 1
while len(vals):
vals = sorted(vals, key=lambda x:x[1])
# if two low values are the same then the value is not-unique
# adjust the logic based on definition of unique, etc.
low_value = vals[0][1]
lows, highs = partition(lambda x: x[1] > low_value, vals)
if len(lows) > 1:
for lname, _ in lows:
shared_counts[lname] += 1
# all lowest items discarded and refetched
vals = highs
for name, _ in lows:
f = files[name]
val = f.readline()
if val != "":
vals.append([name, str.strip(val)])
total_counts[name] += 1
else:
# close files as we go. eventually we'll
# dry-up the 'vals' and quit this mess :p
f.close()
del files[name]
# and what we want...
for n in names:
unique = 1 - (shared_counts[n]/total_counts[n])
print("{0} is {1:.2%} unique!".format(n, unique))
Retrospectively I can already see the flaws! :-) The sorting of vals
is in for a legacy reason that no longer really applies. In all practically just a min
would work fine here (and likely be better for any relatively small set of files).
Here's some really ugly psuedo-code that does the n-way merge
#!/usr/bin/python
import sys, os, commands
from goto import goto, label
def findmin(linesread):
min = ""
indexes = []
for i in range(len(linesread)):
if linesread[i] != "":
min = linesread[i]
indexes.append(i)
break
for i in range(indexes[0]+1, len(linesread)):
if linesread[i] < min and linesread[i] != "":
min = linesread[i]
indexes = [i]
elif linesread[i] == min:
indexes.append(i)
return min, indexes
def genUniqueness(path):
wordlists = []
linecount = []
log = open(path + ".fastuniqueness", 'w')
for root, dirs, files in os.walk(path):
if root.find(".git") > -1 or root == ".":
continue
if root.find("onlyuppercase") > -1:
continue
for i in files:
if i.find('lvl') >= 0 or i.find('trimmed') >= 0:
wordlists.append( root + "/" + i );
linecount.append(int(commands.getoutput("cat " + root + "/" + i + " | wc -l")))
print root + "/" + i
whandles = []
linesread = []
numlines = []
uniquelines = []
for w in wordlists:
whandles.append(open(w, 'r'))
linesread.append("")
numlines.append(0)
uniquelines.append(0)
count = range(len(whandles))
for i in count:
linesread[i] = whandles[i].readline().strip()
numlines[i] += 1
while True:
(min, indexes) = findmin(linesread)
if len(indexes) == 1:
uniquelines[indexes[0]] += 1
for i in indexes:
linesread[i] = whandles[i].readline().strip()
numlines[i] += 1
if linesread[i] == "":
numlines[i] -= 1
whandles[i] = 0
print "Expiring ", wordlists[i]
if not any(linesread):
break
for i in count:
log.write(wordlists[i] + "," + str(uniquelines[i]) + "," + str(numlines[i]) + "\n")
print wordlists[i], uniquelines[i], numlines[i]
精彩评论