Need advice in Efficiency: Scanning 2 very large files worth of information

2022-12-19 11:30 问答作者：

I have a relatively strange question.

I have a file that is 6 gigabytes long. What I need to do, is scan the entire file, line by line, and determine all rows that match an id number of any other row in the file. Essentially, its like analyzing a web log file where there are many session ids that are organized by the time of each click rather than by userID.

I tried to do the simple (dumb) thing, which was to create 2 file readers. One that scans the file line by line getting the userID, and the next to 1. verify that the userID has not been processed already and 2. If it hasn't been processed, read ever开发者_JAVA百科y line that begins with the userID that is contained in the file and store (some value X, related to the rows)

Any advice or tips on how I can make this process work more efficiently?

Import file into SQL database
Use SQL
Performance!

Seriously, that's it. Databases are optimized exactly for this kind of thing. Alternatively, if you have a machine with enough RAM, just put all the data into a HashMap for easy lookup.

Easiest: create a datamodel and import the file in a database and take benefit of JDBC and SQL powers. You can if necessary (when the file format is pretty specific) write a some Java which does import line by line with help of under each BufferedReader#readLine() and PreparedStatement#addBatch().

Hardest: write your Java code so that it doesn't unnecessarily keep large amounts of data in the memory. You're then basically reinventing what the average database already does.

For each row R in the file {

Let N be the number that you need to extract from R.
Check if there is a file called N. If not, create it.
Append R to the file called N

}

How much data are you storing about each line, compared with the size of the line? Do you have enough memory to maintain the state for each distinct ID (e.g. number of log lines seen, number of exceptions or whatever)? That's what I'd do if possible.

Otherwise, you'll either need to break the log file into separate chunks (e.g. split it based on the first character of the ID) and then parse each file separately, or perhaps have some way of pretending you have enough memory to maintain the state for each distinct ID: have an in-memory cache which dumps values to disk (or reads them back) only when it has to.

You don't mention whether or not this is a regular, ongoing thing or an occasional check.

Have you considered pre-processing the data? Not practical for dynamic data, but if you can sort it based on the field you're interested in, it makes solving the problem much easier. Extracting only the fields you care about may reduce the data volume to a more manageable size as well.

Alot of the other advice here is good but assumes that you'll be able to load what you need into memory without running out of memory. If you can do that that would be better than the 'worst case' solution I'm mentioning.

If you have large files you may end up needing to sort them first. In the past I've dealt with multiple large files where I needed to match them up based on a key (sometimes matches were in all files, sometimes only in a couple, etc). If this is the case the first thing you need to do is sort your files. Hopefully you're on a box where you can easily do this (for example there are many good Unix scripts for this). After you've sorted each file read each file until you get matching IDs then process.

I'd suggest:
1. Open both files and read the first record
2. See if you have matching IDs and processing accordingly
3. Read the file(s) for the key just processed and do step 2 again until EOF.

For example if you had a key of 1,2,5,8 in FILE1 and 2,3,5,9 in FILE2 you'd:
1. Open and read both files (FILE1 has ID 1, FILE2 had ID2).
2. Process 1.
3. Read FILE1 (FILE1 has ID 2)
4. Process 2.
5. Read FILE1 (ID 5) and FILE2 (ID 3)
6. Process 3.
7. Read FILE 2 (ID 5)
8. Process 5.
9. Read FILE1 (ID 8) and FILE2 (ID 9).
10. Process 8.
11. Read FILE1 (EOF....no more FILE1 processing).
12. Process 9.
13. Read FILE2 (EOF....no more FILE2 processing).

Make sense?

继续阅读：large-files

Need advice in Efficiency: Scanning 2 very large files worth of information

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？