Sorting a text file with over 100,000,000 records
I h开发者_如何学JAVAave a 5gig text file that needs to be sorted in alphabetical order What is the best algorithm to use?
constraints:
Speed - As fast as possible
Memory - A Pc with 1 Gig Ram running windows XP
I routinely sort text files >2GB with the sort
linux command. Usually takes 15 - 30 seconds, depending on server load.
Just do it, it won't take as long as you think.
Update Since you're using Windows XP, you can get the sort command in UnxUtils. I use that one probably more than the linux version, and it's equally as fast.
The bottleneck for huge files really disk speed .. my server above has a fast sata raid. If your machine is a desktop (or laptop), then your 7200 RPM (or 5400) RPM IDE drives will add a few minutes to the job.
For text files, sort
, at least the GNU Coreutils version in Linux and others, works surprisingly fast.
Take a look at the --buffer-size
and related options, and set --temporary-directory
if your /tmp
directory is too small.
Alternatively, if you're really worried how long it might take, you can split up the file into smaller chunks, sort then individually, then merge them together (with sort --merge
). Sorting each chunk can be done on different systems in parallel.
I would say take a smaller subset of the data and try a few to see which work best, then go with that. This article might help you get started.
What are the parameters of the sort? Do you have time constraints or space constraints? How close to ordered is the file already? Do you have to do it in one pass?
Merge Sort is your best bet.
How about importing the data into SQL Server using the Bulk Insert command?
link text
This gets the data into the SQL Server quite quickly and then allows you to perform all manner of efficient SQL Sorting based on the data imported.
You can also set this up as an automated task using SQL Server SSIS.
精彩评论