search algorithm on large files
I need help in deciding with search algorithm to use for searching large files. here is what I am doing. Lets say file consists of time range t1 to t2. (t2>t1)
I need to get file offsets (fseek) of:
- time t3 that is bigger than t1
time t4 that is smaller than time t2
| ------| ---|----------------| t1 t3 t4 t2
Naive version is to iterate lines through whole file and return fseek when current time is t3, start from returned seek and iterate while current time is t4, return second fseek
Now lets say file is 100GB and i need to iterate while file just to get period of 2 seconds. Then this logic becomes too CPU and filesystem expensive. Looking for better solutions. Language in use is C. Lines are currently fixed size but I'd like to look into futur开发者_如何转开发e and deal with some algorithm that doesn't use fixed size lengths.
You can use a binary search if the times in the file are all sorted. Even better if the records in your file are of a fixed width, but you probably can make use of it even if they are not, with some work.
Since the values are fixed width, something like a binary search or an interpolation search sound like the best options. Also, if you plan on working with files in those size classes (100GB), you should consider using fgetpos/fsetpos due to the file size limits of fseek.
精彩评论