dealing with a large flat data files with a very big record length
I have a large data file that is created from a shell script. The next script processes it by sorting and reading several times. That takes more than 14 hours; it is not viable. I want to replace th开发者_C百科is long running script with a program, probably in JAVA, C, or COBOL, that can run on Windows or on Sun Solaris. I have to read a group of records every time, sort and process and write to the output sort file and at the same time insert into db2/sql tables.
If you insert them into a database anyway it might be much simpler to not do the sorting yourself, but just receive the data ordered from the database once you've inserted it all.
Something that might speed up your sorting is alter your data producing script to place the data into different files based on all or the prefix of the key you will be used to sort the entries.
Then when you actually sort the entries you can limit your sort to only work on the smaller files, which will (pretty much) turn your sort time from O( f(N) )
to O( f(n0) + f(n1) + ... )
, which for any f()
more complex than f(x)=x
should be smaller (faster).
This will also open up the possibility of sorting your files concurrently because the disk IO wait time for one sorting thread would be a great time for another thread to actually sort the records that it has loaded.
You will need to find a happy balance between too many files and too bit files. 256 files is a good starting point.
Another thing you might want to investigate is your sorting algorithm. Merge sort is good for secondary storage sorting. Replacement selection sort is also a good algorithm to use for secondary storage sorting.
http://www.cs.auckland.ac.nz/software/AlgAnim/niemann/s_ext.htm
Doing your file IO in large chunks (file system block sized aligned chunks are best) will also help in most cases.
If you do need to use a relational database anyway you might as well just go ahead and put everything in there to start with, though. RDBMSes typically have very good algorithms to handle all of this tricky stuff.
精彩评论