开发者

Which sorting algorithm should I use in this scenario?

A researcher has a database of 100 million records of people. The researcher wants to study the distribution of given names according to other criteria such as zodiac sign, birth year, etc, so wants to sort by name with the option of further sorting later.

Which sort should I use?

A. selection

B. quick

C. heap

D. inser开发者_如何学编程tion

E. merge

Thanks!


It's not really my answer since you reached it yourself, but here it is for better visibility:

  1. Selection and insertion can be ruled out because they have O(n^2) average running time, which isn't going to cut it for 100M items.
  2. Heapsort and quicksort are ruled out because they are not stable. This problem needs a stable sort because the problem definition implies that when sorting further, the original order (by name) needs to be maintained.
  3. This only leaves mergesort as a suitable candidate.

Update: Exam-related advice

I have to admit that point 2 above (preserve the sort by name) is not totally clear from the problem description. However, this is an exam question and there has to be some way of trimming the options down to one. This is only made possible by demanding a stable sort, so the requirement is there even if the wording is not ironclad.

This way of practical thinking makes it IMHO much easier to reach definitive answers for some types of exam questions.


Try mapping your requirements to the comparison table at http://en.wikipedia.org/wiki/Sort_algorithms#Comparison_of_algorithms.


Someone posted a duplicate, and this was going to be my answer. Since I went through the effort to type all this, I may as well share it for future readers.

Each sorting algorithm has its best and worse use cases. This is how I try to think about it:

  • Selection Sort: I rarely / never use selection sort because almost always insertion sort out performs it. This is best on small data sets and nearly sorted lists
  • Quick Sort: Looking for the best average case senario
  • Heap Sort: Best possible worst case
  • Insertion Sort: (See Selection)
  • Merge sort: Merge sort is slightly slower than quick sort but has guaranteed O(n log n) behavior. The key point here is that merge sort is much more stable than quick sort.

Obviously that is a very breif overview. You can find a lot more info on Wikipedia and through a Google search like: "When to use [Insert Algorithm Here]"

Hope that helps!


If you want to get a histogram, I wouldn't sort the data. I would just go through all the data counting all the combinations of interest. This is an O(N) operation.

Sorting the data first is unlikely to improve speed. This is an O(N*log(N)) operation.


If wanted to sort all the record, I would use a Collection.sort() with a custom comparator which has all the fields you need to compare. You would have to load all the records into memory which will take a few GB, but once you have done this it should be pretty fast.

The only way to make this faster would be to filter down the criteria. If you do this, I would create a Collection which has a copy of the records of interest and sort that.


The most efficient sorting algorithm, wouldn't be a traditional one.

Since you are sorting based on criteria such as birth year and zodiac sign, I would do a "stack sort" (I just made that up).

It would work this way.

Create a data structure for each possible sorted value. Let's use birth year for example. In birth year, there are only going to be ~100 different values that it could be.

  1. Declare a data structure for each possible value for Birth year (100 pointer arrays, one for each year)
  2. Loop through each record, and place a pointer to the record in that array.

When you're done looping through each record, you've now got 100 arrays, each filled with records that have that particular birth year. The great part about this is that you've done it in O(n) time, so it's much faster than any other sorting algorithm. This also works for zodiac signs etc...

Think outside of the box. This approach is very useful when sorting a large dataset (n) with possible values (m), where m << n.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜