开发者

Best Way To Pre-process And Search A Text File In .NET CF

I have a text file with about 100,000 lines (5 MB), which is updated once a day. It grows at a rate of about 30 lines a day. The lines are not sorted in any开发者_StackOverflow社区 way. Each line is 50 hex characters long and looks like this:

ABCDE9DAF1F66C10C02F25A1685821F8428422F5870F39A3FE

Given one of these strings, I need to figure out if it exists in this file. I am working with C# (.NET CF 2.0) on a handheld device, so memory is limited. I have the ability to process the file before hand on a Windows server. What is the fastest way to do this? Some of my initial ideas include: sorting the file, line by line string compare, creating a binary file to search, or using SQLite.

From OP's comments (an important one, which was left out from the question initially):

The file is read-only. No changes will ever be made by my programs. I get a new version of the file each day with more strings appended to the end


Optimal way to do this would probably be to pre-sort the file on the server, and use memory mapped files to do a binary search of the file. That being said, .NET CF 2.0 won't have support for memory mapped files.

You're probably best off just pre-sorting the file, and using stream access to perform a binary search on the file. It's not great because you don't have sequential reads, but seeing as you're on CF, there is a good chance your data store on the device is flash based, so the random access for the binary search probably won't be too bad...


Keep the file sorted on the server ((c) LorenVS), but do the binary search directly on the file by using the record length (50 hex characters + 2 for Cr Lf) to move the file pointer (seek) to the mid positions and read the strings to compare to. That should minimize the memory needed on the device.

Ok, I see now the second part of the idea is (c) LorenVS too.


Store the data in a base-256 DAWG - you'll get a reasonably compact data representation and fast searches.


if your application is running and has to prevent to append a duplicated string to the existing file, you could have the whole file content in memory in an hashtable or in a sortedlist. When you start your application you could optimize the loading of this collection in another thread to keep your UI responsive.

Consider that even using SQLite or SQL CE you then have the footprint of the embedded database, and I think 5 Mb don't scare anybody anymore nowadays.


There are already some suggestion for sorting the file.

Another idea might be to keep the main file in unsorted order and use a secondary file to check for duplicates.

Have a format that uses a small hash value and a fixed number of offset values. The hash value is the offset in the secondary file. From that offset is the array of offsets in the primary file. When any hash array fills up, you'd need to recalculate using a bigger hash value and a bigger secondary file. Or you could use some trick like a cuckoo hash.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜