GZIPInputStream end-of-file sequence in BufferedReader
I use a Java BufferedReader object read, line-by-line, a GZIPInputStream that points to a valid GZIP archive that contains 1,000 lines of ASCII text, in typical CSV format. The code looks like this:
BufferedReader buffer = new BufferedReader(new InputStreamReader(
new GZIPInputStream(new FileInputStream(file))));
where file is the actual File object pointing to the archive.
I read through all the file by calling
int count = 0;
String line = null;
while ((line = reader.readLine()) != null)
{
count++;
}
and the reader goes over the file as expected, but at the end it bypasses line #1000 and reads one more line (i.e., count = 1001 after ending the loop).
Calling line.length() on the last line reports a large number (4,000+) of characters, all of which are non-printable (Character.getNumericValue() returns -1).
Actually, if I do line.getBytes() the resulting byte[] array has an equal number of NULL characters ('\0').
Does this seem like a bug in BufferedReader?
In any case, can anyone please suggest a workaround to bypass this behavior?
EDIT: More weird behavior: The first line read is prefixed by the filename, seve开发者_运维技巧ral NULL characters ('\0') and things line username and group name, then the actual text follows!
EDIT: I have created a very simple test class that reproduces the effect I described above, at least on my platform.
EDIT: Apparently false alarm, the file I was getting was not plain GZIP but tarred GZIP, so this explains it, no need for further testing. Thanks everyone!
I think I found your problem.
I tried to reproduce it with your source in the question, and got this output:
-------------------------------------
Reading PLAIN file
-------------------------------------
Printable part of line 1: This, is, line, number, 1
Line start (<= 25 characters): This__is__line__number__1
No NULL characters in line 1
Other information on line 1:
Length: 25
Bytes: 25
First byte: 84
Printable part of line 10: This, is, line, number, 10
Line start (<= 26 characters): This__is__line__number__10
No NULL characters in line 10
Other information on line 10:
Length: 26
Bytes: 26
First byte: 84
File lines read: 10
-------------------------------------
Reading GZIP file
-------------------------------------
Printable part of line 1: This, is, line, number, 1
Line start (<= 25 characters): This__is__line__number__1
No NULL characters in line 1
Other information on line 1:
Length: 25
Bytes: 25
First byte: 84
Printable part of line 10: This, is, line, number, 10
Line start (<= 26 characters): This__is__line__number__10
No NULL characters in line 10
Other information on line 10:
Length: 26
Bytes: 26
First byte: 84
File lines read: 10
-------------------------------------
TOTAL READ
-------------------------------------
Plain: 10, GZIP: 10
I think this is not what you are having. Why? You are using a tar.gz
file. This is the tar
archive format, and additionally the gzip
compression. The GZipInputStream undoes the gzip compression, but knows nothing about the tar
archive format.
tar is normally used to pack multiple files together - in an uncompressed format, but together with some metadata, which is what you observe:
EDIT: More weird behavior: The first line read is prefixed by the filename, several NULL characters ('\0') and things line username and group name, then the actual text follows!
If you have a tar
file, you need to use a tar decoder. How do I extract a tar file in Java? gives some links (like using the Tar task from Ant), also there is JTar.
If you want to send only one file, better use the gzip
format directly (this was what I did in my test).
But there is no bug anywhere, apart from you expecting the gzip-stream to read the tar format.
精彩评论