开发者

Jumping to a line and reading it

I have to work with big files (many GB) and need quick lookups to retrieve specific lines on request.

The idea has been to maintain a mapping:

some_key -> byte_location

Where the byte location represents where in the file the line starts.

Edit: the question changed a little bit:

First I used:

FileInputStream stream = new FileInputStream(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
FileChannel channel = stream.getChannel();

I noticed that FileChannel.position() will not return the exact position where the reader is currently reading because it is a "buffered" reader. It reads chunks of a given size (16k here) so what I get from the FileChannel is a multiple of 16k, and not the exact position where the reader is actually reading开发者_运维技巧.

PS: the file is in UTF-8


Any reason not to create a FileInputStream, call stream.skip(pos) and then create an InputStreamReader around that, and a BufferedReader around the InputStreamReader?


I would have tried something like this:

    RandomAccessFile raf = new RandomAccessFile(file);
    ...
    raf.seek(position);
    raf.readLine();
    ...

The problem is that readLine() turns each byte into a character with the top 8 bits zero. That's fine if your file is ASCII or Latin-1, but problematic for UTF-8.

However, if you are prepare to use RandomAccessFile to write the file, you can use readUTF() and writeUTF() to read and write "lines" encoded as modified UTF-8 Strings.

FOLLOWUP

dammit ...utf-8 characters are screwed

Yea ... see above.

Another idea for coping with UTF-8 with RandomAccessFile:

  1. seek to desired position,
  2. use readFully(byte[]) method to read a bunch of bytes into a byte[],
  3. locate pos == position of the end of line in the buffer,
  4. if not found, read more bytes, concatenate and go to step 2.
  5. if found, use new String(bytes, 0, pos, UTF-8) to convert to a Java String.

This is more cumbersome than using readLine(), but it should be faster than using FileInputStream and skip() when reading multiple lines from the files in random order.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜