Searching / Indexing huge file amounts
I'm struggeling to find an efficient way (< 0.5 sek) to search for specific files in a huge file system having only a little part of the desired file name.
Here's the scena开发者_高级运维rio:
Consider you have about 15.000.000 files all categorised by their type of information contained an batched within numbered directories containing 20.000 files each:
DATA
--TYPE_1_001
----ID_1234567_TYPE1.XML
----ID_2345678_TYPE1.XML
----[...]
--TYPE1_002
--[...]
--TYPE_1_097
--TYPE_2_001
----ID_1234567_TYPE2.JPG
----ID_2345678_TYPE2.JPG
----ID_2345679_TYPE2.JPG
----[...]
--[...]
--TYPE2_304
--[...]
and so on.
So, given the ID (i.e. 1234567), I'm trying to find all relevant filenames containing said id. This "find process" will be executed for each of the 7.000.000 ids given in another XML file.
The current process would take 405 days to process all 7.000.000 ids, which - who figures - is inacceptable ;)
Any suggestions?
Thanks in advance!
Is there any way you can extract the data into a database or index (such as Lucene) of some description?
That would take some time to do but would be much faster to search once it was available.
Using an SSD drive instead of a hard drive. A regular hardware can only perform around 120 IOs per second. This is because the head has to move to the location where the information is stored. A fast SSD drive cna perform 10,000 IO operations per second as there is no moving parts. However even with an SSD drive it going to take you about 2 seconds at best to scan the names of every directory.
If you want it to be faster than that you need to cache/index the names and look them up from memory.
BTW: If you had a SSD Raid 6 set, it could perform IO fast enough to scan 20K files in under 0.5 seconds.
精彩评论