Issues with parsing huge file
I'm parsing a document and writing to disk pairs such as these ones:
0 vs 1, true 0 vs 2, false 0 vs 3, true 1 vs 2, true 1 vs 3, false ..
and so on.
Successively i'm balancing the trues and falses rows for each instance, by removing random lines (lines with true value if they exceed, and viceversa) and I end up with a file such as this one:
0 vs 1 true 0 vs 2 false 1 vs 2 true 1 vs 3 true 1 vs 4 false 1 vs 5 false
The falses are usually much much more than trues, so in the previous example, I could keep only 1 false for isntance 0, and only 2 falses for instance 1.
开发者_如何学JAVAI'm doing this process in 2 steps, before parsing and then balancing.
Now, my issue is that the unbalanced file is too big: more than 1GB, and most of its rows are going to be removed by the balancing step.
My question is: can I balance the rows while parsing ?
My guess is no, because I don't know which items are arriving and I can't delete any row until when all rows for a specific instance have been discovered.
I hope it is clear. thanks
What would happen if you used a lightweight database for this -- derby, h2 etc? I imagine you can write sorting, filtering etc queries to arrive at what you want...
Couple of ideas -
1) If the file is 1GB, you may be able to load it into a data structure, but you've probably already tried this 2) If the data is sorted or grouped by row, you can read each line until you hit a new row and rebalance 3) If the data isn't sorted you could sort the file in-place with a random access IO class and then do 2) 4) If that's not possible, you could always make several passes over the file for each row, this will obviously be slow.
It sounds like you only need to load an instance's data at a time, and you only need to record a number and a boolean for each instance value.
I suggest you read the data until the instance number changes (or the end of file is reached) That should be far less than 1 GB and fix in memory.
If you use TIntArrayList (or an int[]) and BitSet, this will store the data more efficiently. You can clear them after processing each instance.
EDIT: If the data is randomly arranged you may need to read the file once to count the number of true/false for each instance and then read the file again to produce the result.
Another option is try to load the whole file into memory a different way. You should be able to load 1 GB of data in this format and get it to use less than 1 GB.
You need to loow at how you can minimise the overhead you get for each row of data and you can reduce consumption significantly.
class Row { // uses a total of 80 bytes in a 32-bit JVM
// 16 byte header
Integer x; // 4 + 24 bytes.
Integer y; // 4 + 24 bytes.
Boolean b; // 1 byte
// 7 bytes of padding.
}
class Row { // uses a total of 32 bytes in a 32-bit JVM
// 16 byte header
int x; // 4 bytes.
int y; // 4 bytes.
boolean b; // 1 byte
// 7 bytes of padding.
}
class Rows { // uses a total of 8-9 bytes/row
// 16 byte header
int[] x; // 4 bytes/row, TIntArrayList is easier to use.
int[] y; // 4 bytes/row
BitSet b; // 1 bit/row
// 7 bytes of padding.
}
// if your numbers are between -32,768 and 32,767
class Rows { // uses a total of 4-5 bytes/row
// 16 byte header
short[] x; // 4 bytes/row, TShortArrayList is easier to use.
short[] y; // 4 bytes/row
BitSet b; // 1 bit/row
// 7 bytes of padding.
}
精彩评论