开发者

Checking for Duplicate Files without Storing their Checksums

For instance, you have an application which processes files that are sent by different clients. The clients send tons of files everyday and you load the content of those files into your system. The files have the same format. The only constraint that you are given is you are not allowed to run the same file twice.

In order to check if you ran a particular file is to create a checksum of the file and store it in another file. So when you get a new file, you can create the checksum of that file and compare against the checksums of others files that you have run and stored.

Now, the file that contains all the checksums of all开发者_如何学C the files that you have run so far is getting really, really huge. Searching and comparing is taking too much time.

NOTE: The application uses flat files as its database. Please do not suggest to use rdbms or the like. It is simply not possible at the moment.

Do you think there could be another way to check the duplicate files?


Keep them in different places: have one directory where the client(s) upload files for processing, have another where those files are stored.

Or are you in a situation where the client can upload the same file multiple times? If that's the case, then you pretty much have to do a full comparison each time.

And checksums, while they give you confidence that two files are different (and, depending on the checksum, a very high confidence), are not 100% guaranteed. You simply can't take a practically-infinite universe of possible multi-byte streams and reduce them to a 32 byte checksum, and be guaranteed uniqueness.

Also: consider a layered directory structure. For example, a file foobar.txt would be stored using the path /f/fo/foobar.txt. This will minimize the cost of scanning directories (a linear operation) for the specific file.

And if you retain checksums, this can be used for your layering: /1/21/321/myfile.txt (using least-significant digits for the structure; the checksum in this case might be 87654321).


Nope. You need to compare all files. Strictly, need to to compare the contents of each new file against all already seen files. You can approximate this with a checksum or hash function, but should you find a new file already listed in your index then you then need to do a full comparison to be sure, since hashes and checksums can have collisions.

So it comes down to how to store the file more efficiently.

I'd recommend you leave it to professional software such as berkleydb or memcached or voldemort or such.

If you must roll your own you could look at the principles behind binary searching (qsort, bsearch etc).

If you maintain the list of seen checksums (and the path to the full file, for that double-check I mentioned above) in sorted form, you can search for it using a binary search. However, the cost of inserting each new item in the correct order becomes increasingly expensive.

One mitigation for a large number of hashes is to bin-sort your hashes e.g. have 256 bins corresponding to the first byte of the hash. You obviously only have to search and insert in the list of hashes that start with that byte-code, and you omit the first byte from storage.

If you are managing hundreds of millions of hashes (in each bin), then you might consider a two-phase sort such that you have a main list for each hash and then a 'recent' list; once the recent list reaches some threshold, say 100000 items, then you do a merge into the main list (O(n)) and reset the recent list.


You need to compare any new document against all previous documents, the efficient way to do that is with hashes.

But you don't have to store all the hashes in a single unordered list, nor does the next step up have to be a full database. Instead you can have directories based on the first digit, or 2 digits of the hash, then files based on the next 2 digits, and those files containing sorted lists of hashes. (Or any similar scheme - you can even make it adaptive, increasing the levels when the files get too big)

That way searching for matches involves, a couple of directory lookups, followed by a binary search in a file.

If you get lots of quick repeats (the same file submitted at the same time), then a Look-aside cache might also be worth having.


I think you're going to have to redesign the system, if I understand your situation and requirements correctly.

Just to clarify, I'm working on the basis that clients send you files throughout the day, with filenames that we can assume are irrelevant, and when you receive a file you need to ensure its [i]contents[/i] are not the same as another file's contents.

In which case, you do need to compare every file against every other file. That's not really avoidable, and you're doing about the best you can manage at the moment. At the very least, asking for a way to avoid the checksum is asking the wrong question - you have to compare an incoming file against the entire corpus of files already processed today, and comparing the checksums is going to be much faster than comparing entire file bodies (not to mention the memory requirements for the latter...).

However, perhaps you can speed up the checking somewhat. If you store the already-processed checksums in something like a trie, it should be a lot quicker to see if a given file (rather, checksum) has already been processed. For a 32-character hash, you'd need to do a maximum of 32 lookups to see if that file had already been processed rather than comparing with potentially every other file. It's effectively a binary search of the existing checksums rather than a linear search.


You should at the very least move the checksums file into a proper database file (assuming it isn't already) - although SQLExpress with its 4GB limit might not be enough here. Then, along with each checksum store the filename, file size and date received, add indexes to file size and checksum, and run your query against only the checksums of files with an identical size. But as Will says, your method of checking for duplicates isn't guaranteed anyway.


Despite you asking not to suggets and RDBMS I still will suggest SQLite - if you store all checksums in one table with an index searches will be quite fast and integrating SQLite is not a problem at all.


As Will pointed out in his longer answer, you should not store all hashes in a single large file, but simply split them up into several files.

Let's say the alphanumeric-formatted hash is pIqxc9WI. You store that hash in a file named pI_hashes.db (based on the first two characters).

When a new file comes in, calculate the hash, take the first 2 characters, and only do the lookup in the CHARS_hashes.db file


After creating a checksum, create a directory with the checksum as the name and then put the file in there. If there are already files in there, compare your new file with the existing ones.

That way, you only have to check one (or a few) files.

I also suggest to add a header (a single line) to the file which explains what's inside: The date it was created, the IP address of the client, some business keys. The header should be selected in such a way that you can detect duplicates be reading this single line.

[EDIT] Some file systems bog down when you have a directory with many entries (in this case: the checksum directories). If this is an issue for you, create a second layer by using the first two characters of the checksum as the name of the parent directory. Repeat as necessary.

Don't cut off the two characters from the next level; this way, you can easily find files by checksum if something goes wrong without cutting checksums manually.


As mentioned by others, having a different data structure for storing the checksums is the correct way to go. Anyways, although you have mentioned that you dont want to go the RDBMS way, why not try sqlite? You can use it like a file, and it is lightning fast. It is also very simple to use - most languages has sqlite support built-in, too. It will take you less than 40 lines of code in say python.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜