Reading streamed data into a sorted list
We know that, in general, the "smarter" comparison sorts on arbitrary data run in worst case complexity O(N * log(N)).
My question is what happens if we are asked not to sort a collection, but a stream of data. That is, values are given to us one by one with no indicator of what comes next (other than that the data is valid/in range). Intuitively, one might think that it is superior then to sort data as it comes in (like picking up a poker hand one开发者_如何学编程 by one) rather than gathering all of it and sorting later (sorting a poker hand after it's dealt). Is this actually the case?
Gathering and sorting would be O(N + N * log(N)) = O(N * log(N)). However if we sort it as it comes in, it is O(N * K), where K = time to find the proper index + time to insert the element. This complicates things, since the value of K now depends on our choice of data structure. An array is superior in finding the index but wastes time inserting the element. A linked list can insert more easily but cannot binary search to find the index.
Is there a complete discussion on this issue? When should we use one method or another? Might there be a desirable in-between strategy of sorting every once in a while?
Balanced tree sort has O(N log N)
complexity and maintains the list in sorted order while elements are added.
Absolutely not!
Firstly, if I can sort in-streaming data, I can just accept all my data in O(N)
and then stream it to myself and sort it using the quicker method. I.e. you can perform a reduction from all-data to stream, which means it cannot be faster.
Secondly, you're describing an insertion sort, which actually runs in O(N^2)
time (i.e. your description of O(NK)
was right, but K
is not constant, rather a function of N
), since it might take O(N)
time to find the appropriate index. You could improve it to be a binary insertion sort, but that would run in O(NlogN)
(assuming you're using a linked list, an array would still take O(N^2)
even with the binary optimisation), so you haven't really saved anything.
Probably also worth mentioning the general principle; that as long as you're in the comparison model (i.e. you don't have any non-trivial and helpful information about the data which you're sorting, which is the general case) any sorting algorithm will be at best O(NlogN)
. I.e. the worst-case running time for a sorting algorithm in this model is omega(NlogN)
. That's not an hypothesis, but a theorem. So it is impossible to find anything faster (under the same assumptions).
Ok, if the timing of the stream is relatively slow, you will have a completely sorted list (minus the last element) when your last element arrives. Then, all that remains to do is a single binary search cycle, O(log n) not a complete binary sort, O(n log n). Potentially, there is a perceived performance gain, since you are getting a head-start on the other sort algorithms.
Managing, queuing, and extracting data from a stream is a completely different issue and might be counter-productive to your intentions. I would not recommend this unless you can sort the complete data set in about the same time it takes to stream one or maybe two elements (and you feel good about coding the streaming portion).
Use Heap Sort in those cases where Tree Sort will behave badly i.e. large data set since Tree sort needs additional space to store the tree structure.
精彩评论