How can I search a string in a very big file with a specific format in java? [duplicate]
Possible Duplicate:
do searching in a very big ARPA file in a very short time in java
my file's format:
\data\
ngram 1=19
ngram 2=234
ngram 3=1013
\1-grams:
-1.7132 puluh -3.8008
-1.9782 satu -3.8368
\2-grams:
-1.5403 dalam dua -1.0560
-3.1626 dalam ini 0.0000
\3-grams:
-1.8726 itu dan tiga
-1.9654 itu dan untuk
\end\
As you can see I have a number of lines in ngram 1,2 and 3. There is no need to read the whole file. If an input string is a one-word string, the program can just search in \1-grams: part. If an input string is a two-word string, the program can just search in \2-grams: part and so on. At last if the program finds the input string in the file, it has to return two numbers which are located at the left and right sides of the string. Also, I have to say that each part of the file has been sorted. I am sure that I do not have to read the file completely, and using the index file can not solve my problem. These ways take a lot of time, and my lecturer said that searching has to be done in less than 1 minute for such a big file. I think the best thing is to find a way to jump to a specific line not byte of the file, but I do not know how I can do it. It will be great if someone can help me to solve my problem.
My file is almost 800MB. I have found that u开发者_运维百科sing BufferedReader is a good way to read a file very fast, but when I read such a big file and put it in an array line by line, it takes more than 30 minutes.
How big is your file? A minute is a very long time. I would suggest using a BufferedReader for efficiency (and also for its readLine
method).
If that really takes too long, two approaches come to mind that don't use indexes:
Force every line in the file to be the same length. Then you can jump to a specific line by calculating its start. If you don't know the line number you need, then at least you can use this to efficiently do a binary search of the entire file.
Jump to an arbitrary position and read forward until you get to a line that starts with a
\
. That will tell you whether you've found the right part or whether you need to jump forward from there or backward from the arbitrary position that you jumped to. This can also be used to create a binary search strategy for the data you need. It relies on the\
being a reliable indicator of the start of a part.
精彩评论