Issues with parsing huge file

2023-02-20 12:55 问答作者：

I'm parsing a document and writing to disk pairs such as these ones:

0 vs 1, true
0 vs 2, false
0 vs 3, true
1 vs 2, true
1 vs 3, false
..

and so on.

Successively i'm balancing the trues and falses rows for each instance, by removing random lines (lines with true value if they exceed, and viceversa) and I end up with a file such as this one:

0 vs 1 true
0 vs 2 false
1 vs 2 true
1 vs 3 true
1 vs 4 false
1 vs 5 false

The falses are usually much much more than trues, so in the previous example, I could keep only 1 false for isntance 0, and only 2 falses for instance 1.

开发者_如何学JAVA

I'm doing this process in 2 steps, before parsing and then balancing.

Now, my issue is that the unbalanced file is too big: more than 1GB, and most of its rows are going to be removed by the balancing step.

My question is: can I balance the rows while parsing ?

My guess is no, because I don't know which items are arriving and I can't delete any row until when all rows for a specific instance have been discovered.

I hope it is clear. thanks

What would happen if you used a lightweight database for this -- derby, h2 etc? I imagine you can write sorting, filtering etc queries to arrive at what you want...

Couple of ideas -

1) If the file is 1GB, you may be able to load it into a data structure, but you've probably already tried this 2) If the data is sorted or grouped by row, you can read each line until you hit a new row and rebalance 3) If the data isn't sorted you could sort the file in-place with a random access IO class and then do 2) 4) If that's not possible, you could always make several passes over the file for each row, this will obviously be slow.

It sounds like you only need to load an instance's data at a time, and you only need to record a number and a boolean for each instance value.

I suggest you read the data until the instance number changes (or the end of file is reached) That should be far less than 1 GB and fix in memory.

If you use TIntArrayList (or an int[]) and BitSet, this will store the data more efficiently. You can clear them after processing each instance.

EDIT: If the data is randomly arranged you may need to read the file once to count the number of true/false for each instance and then read the file again to produce the result.

Another option is try to load the whole file into memory a different way. You should be able to load 1 GB of data in this format and get it to use less than 1 GB.

You need to loow at how you can minimise the overhead you get for each row of data and you can reduce consumption significantly.

class Row { // uses a total of 80 bytes in a 32-bit JVM
    // 16 byte header
    Integer x; // 4 + 24 bytes.
    Integer y; // 4 + 24 bytes.
    Boolean b; // 1 byte
    // 7 bytes of padding.
}

class Row { // uses a total of 32 bytes in a 32-bit JVM
    // 16 byte header
    int x; // 4  bytes.
    int y; // 4 bytes.
    boolean b; // 1 byte
    // 7 bytes of padding.
}

class Rows { // uses a total of 8-9 bytes/row
    // 16 byte header
    int[] x; // 4 bytes/row, TIntArrayList is easier to use.
    int[] y; // 4 bytes/row
    BitSet b; // 1 bit/row
    // 7 bytes of padding.
}

// if your numbers are between -32,768 and 32,767
class Rows { // uses a total of 4-5 bytes/row
    // 16 byte header
    short[] x; // 4 bytes/row, TShortArrayList is easier to use.
    short[] y; // 4 bytes/row
    BitSet b; // 1 bit/row
    // 7 bytes of padding.
}

继续阅读：parsing

Issues with parsing huge file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？