Jumping to a line and reading it
I have to work with big files (many GB) and need quick lookups to retrieve specific lines on request.
The idea has been to maintain a mapping:
some_key -> byte_location
Where the byte location represents where in the file the line starts.
Edit: the question changed a little bit:
First I used:
FileInputStream stream = new FileInputStream(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
FileChannel channel = stream.getChannel();
I noticed that FileChannel.position()
will not return the exact position where the reader is currently reading because it is a "buffered" reader. It reads chunks of a given size (16k here) so what I get from the FileChannel is a multiple of 16k, and not the exact position where the reader is actually reading开发者_运维技巧.
PS: the file is in UTF-8
Any reason not to create a FileInputStream
, call stream.skip(pos)
and then create an InputStreamReader
around that, and a BufferedReader
around the InputStreamReader
?
I would have tried something like this:
RandomAccessFile raf = new RandomAccessFile(file);
...
raf.seek(position);
raf.readLine();
...
The problem is that readLine()
turns each byte into a character with the top 8 bits zero. That's fine if your file is ASCII or Latin-1, but problematic for UTF-8.
However, if you are prepare to use RandomAccessFile to write the file, you can use readUTF()
and writeUTF()
to read and write "lines" encoded as modified UTF-8 Strings.
FOLLOWUP
dammit ...utf-8 characters are screwed
Yea ... see above.
Another idea for coping with UTF-8 with RandomAccessFile
:
- seek to desired position,
- use
readFully(byte[])
method to read a bunch of bytes into abyte[]
, - locate
pos
== position of the end of line in the buffer, - if not found, read more bytes, concatenate and go to step 2.
- if found, use
new String(bytes, 0, pos, UTF-8)
to convert to a Java String.
This is more cumbersome than using readLine()
, but it should be faster than using FileInputStream
and skip()
when reading multiple lines from the files in random order.
精彩评论