开发者

Sort big quantity of strings with same lengths

I hav开发者_开发百科e a very big sequence of strings. Length of each string is 50. Each string includes only chars from english ABC. What is the best(the fastest) way to sort this sequence?


If I had to code that, I'd probably make one pass that split the input into many output files depending on the first couple of characters or so; the goal being to make each output file small enough to fit in main memory. Then I would open each file in order, sort it in memory, and append it to the output. First pass is O(n), second is more or less O(n log n), and you have to do disk I/O four times per record. It might be possible to do better with some arcane algorithm, but probably not by much, and this is easy to understand and code.

If the system limits how many files you can have open at once, you might have to split up the first pass. If the strings aren't well-distributed, some intermediate files might be too large.

In pseudocode:

open input file (r)
for i in ['aa', 'ab', 'ac', ..., 'zz']:
    open output file[i] (w)
for record in input file:
    write record to output file[record[0:2]]
close all files
open main output file (w)
for i in ['aa', 'ab', 'ac', ..., 'zz']:
    open input file[i] (r)
    slurp whole file into memory
    close input file
    sort data
    append whole sorted file to main output file

EDIT: Wait, do you mean the records only contain the characters A, B, and C? No other letters? In that case you would probably have to split on an initial substring longer than 2. Splitting on the first 3 characters would divide it into 27 files, each of size 370 MB on average.


Something like this?

List<string> list = new List<string>();
/* fill the list */
list.Sort();

The Sort() method has different overloads that allow you to customize the way the string comparison is performed.

EDIT Oh, by "big" you mean 500GB worth of strings then this probably isn't going to cut it.


The algorithm you are looking for is probably the Merge Sort

http://en.wikipedia.org/wiki/Merge_sort

and this

http://en.wikipedia.org/wiki/External_sorting

BUT in your specific case, read this:

Need a way to sort a 100 GB log file by date

It could work for you!


Since 500 MB is not a lot of data, you can simply load the entire file into memory, sort it, and write the result back to disk.

I assume that the file contents are laid out like this:

ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWX\r\n
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWX\r\n
    :
    :
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWX\r\n

Code:

// Load
var data = File.ReadAllBytes("file.txt");
var itemCount = data.Length / 52;
var indexes = new int[itemCount];
for (int i = 0; i < itemCount; i++)
{
    indexes[i] = i;
}

// Sort
Array.Sort<int>(indexes, (x, y) =>
{
    for (int i = 0; i < 50; i++)
    {
        if (data[x * 52 + i] > data[y * 52 + i]) return 1;
        if (data[x * 52 + i] < data[y * 52 + i]) return -1;
    }
    return 0;
});

// Store
using (var stream = new Stream("result.txt"))
{
    for (int i = 0; i < itemCount; i++)
    {
        stream.Write(data, indexes[i] * 52, 52);
    }
}


Quicksort (if used correctly) can be very effective in sorting strings.

The trick is to modify the partition method. The main idea is that at every partition step, the keys in a particular partition have the same prefix. When you partition again, you donot need to compare that prefix for the keys.

Example: Lets say the input is {"hello", "world", "house", "homly" } and the first partition is around the key "world"

You get: {"hello", "house", "homly"}, {"world"}

When you want to repartition the first set, you donot have to compare the first character of the strings as you already know that the first character is same in all of them !

In general, the length of the common prefix in a set would be the number of times partitioning was done to get to the set.

If you are interested to dive deep, do read Fast Algorithms for sorting and searching strings

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜