开发者

any quick sorting for a huge csv file

I am looking some java implementation of sorting algorithm. The file could be HUGE, say 20000*600=12,000,000 lines of records. The line is comma delimited with 37 fields and we use 5 fields as keys. Is it possible to sort it quickly, say 30 minutes?

If you got other approach other than java, it is welcome if it can be easily integrat开发者_开发问答ed into java system. For example, unix utility.

Thanks.

Edit: The lines need to be sort is dispersed into 600 files, with 20000 lines each, 4mb for each file. Finally I would like them to be 1 big sorted file.

I am trying to time a unix sort, would update that afterwards.

Edit:

I appended all the files into a big one, and tried the unix sort function, it is pretty good. The time to sort a 2gb file is 12-13 minutes. The append action require 4 minutes for 600 files.

sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r big.txt -o sorted.txt


How does the data get in the CSV format? Does it come from a relational database? You can make it such that whatever process creates the file writes its entries in the right order so you don't have to solve this problem down the line.

If you are doing a simple lexicographic order you can try the unix sort, but I am not sure how that will perform on a file with that size.


Calling unix sort program should be efficient. It does multiple passes to ensure it is not a memory hog. You can fork a process with java's Runtime, but the outputs of the process are redirected, so you have to some juggling to get the redirect to work right:

public static void sortInUnix(File fileIn, File sortedFile)
        throws IOException, InterruptedException {
    String[] cmd = {
           "cmd", "/c", 
           // above should be changed to "sh", "-c" if on Unix system
           "sort " + fileIn.getAbsolutePath() + " > "
               + sortedFile.getAbsolutePath() };

    Process sortProcess = Runtime.getRuntime().exec(cmd);

    // capture error messages (if any)
    BufferedReader reader = new BufferedReader(new InputStreamReader(
        sortProcess.getErrorStream()));
    String outputS = reader.readLine();
    while (outputS != null) {
        System.err.println(outputS);
        outputS = reader.readLine();
    }

    sortProcess.waitFor();
}


Use the java library big-sorter which is published to Maven Central and has an optional dependency on commons-csv for CSV processing. It handles files of any size by splitting to intermediate files, sorting and merging the intermediate files repeatedly until there is only one left. Note also that the max group size for a merge is configurable (useful for when there are a large number of input files).

Here's an example:

Given the CSV file below, we will sort on the second column (the "number" column):

name,number,cost
WIPER BLADE,35,12.55
ALLEN KEY 5MM,27,3.80
Serializer<CSVRecord> serializer = Serializer.csv(
  CSVFormat.DEFAULT
     .withFirstRecordAsHeader()
     .withRecordSeparator("\n"),
  StandardCharsets.UTF_8);
Comparator<CSVRecord> comparator = (x, y) -> {
    int a = Integer.parseInt(x.get("number"));
    int b = Integer.parseInt(y.get("number"));
    return Integer.compare(a, b);
};
Sorter 
  .serializer(serializer) 
  .comparator(comparator) 
  .input(inputFile) 
  .output(outputFile)
  .sort();

The result is:

name,number,cost
ALLEN KEY 5MM,27,3.80
WIPER BLADE,35,12.55

I created a CSV file with 12 million rows and 37 columns and filled the grid with random integers between 0 and 100,000. I then sorted the 2.7GB file on the 11th column using big-sorter and it took 8 mins to do single-threaded on an i7 with SSD and max heap set at 512m (-Xmx512m).

See the project README for more details.


Java Lists can be sorted, you can try starting there.


Python on a big server.

import csv
def sort_key( aRow ):
    return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
   rdr= csv.DictReader( source )
   data = [ row for row in rdr ]
   data.sort( key=sort_key )
   fields= rdr.fieldnames
with open('some_file_sorted.csv', 'wb') as target:
   wtr= csv.DictWriter( target, fields }
   wtr.writerows( data )

This should be reasonably quick. And it's very flexible.

On a small machine, break this into three passes: decorate, sort, undecorate

Decorate:

import csv
def sort_key( aRow ):
    return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
   rdr= csv.DictReader( source )
   with open('temp.txt','w') as target:
       for row in rdr:
           target.write( "|".join( map(str,sort_key(row)) ) + "|" + row )

Part 2 is the operating system sort using "|" as the field separator

Undecorate:

with open('sorted_temp.txt','r') as source:
   with open('sorted.csv','w') as target:
       for row in rdr:
           keys, _, data = row.rpartition('|')
           target.write( data )


You don't mention platform, so it is hard to come to terms with the time specified. 12x10^6 records isn't that many, but sorting is a pretty intensive task. Let's say 37 fields, say 100bytes/field would be 45GB? That's a bit much for most machines, but if the records average 10bytes/field your server should be able to fit the entire file in RAM, which would be ideal.

My suggestion: Break the file into chunks that are 1/2 the available RAM, sort each chunk, then merge-sort the resulting sorted chunks. This lets you do all of your sorting in memory rather than hitting swap, which is what I suspect of causing any slow-down.

Say (1G chunks, in a directory you can play around in):

split --line-bytes=1000000000 original_file chunk
for each in chunk* 
do
  sort $each > $each.sorted
done
sort -m chunk*.sorted > original_file.sorted


As your data set is huge as you have mentioned. Sorting it all at one go will be time consuming depending on your machine (If you try QuickSort). But since you would like it to be done within 30 mins. I would suggest that you have a look at Map Reduce using Apache Hadoop as your application server.

Please keep in mind it's not an easy approach, but in the longer run you can easily scale up depending upon your data size. I am also pointing you to an excellent link on Hadoop setup

Work your way through single node setup and move to Hadoop cluster. I would be glad to help you if you get stuck anywhere.


You really do need to make sure you have the right tools for the job. ( Today, I am hoping to get a 3.8 GHz PC with 24 GB memory for home use. It been a while since I bought myself a new toy. ;)

However, if you want to sort these lines and you don't have enough hardware, you don't need to break up the data because its in 600 files already.

Sort each file individually, then do a 600-way merge sort (you only need to keep 600 lines in memory at once) Its not as simple as doing them all at once, but you could probably do it on a mobile phone. ;)


Since you have 600 smaller files, it could be faster to sort all of them concurrently. This will eat up 100% of the CPU. That's the point, correct?

waitlist= 
for f in ${SOURCE}/*
do 
    sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o ${f}.srt ${f} &
    waitlist="$waitlist $!"
done
wait $waitlist
LIST=`echo $SOURCE/*.srt`
sort --merge -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o sorted.txt ${LIST}

This will sort 600 small files all at the same time and then merge the sorted files. It may be faster than trying to sort a single large file.


Use Map/Reduce Hadoop to do the sorting.. i recommend Spring Data Hadoop. Java.


Well since you're talking about HUGE datasets this means you'll need some external sorting algorithm anyhow. There are some for java and pretty much any other language out there - since the result will have to be stored on the disk anyhow which language you're using is pretty uninteresting.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜