20120218, 21:26  #1 
∂^{2}ω=0
Sep 2002
República de California
2·5·13·89 Posts 
MergeSort: To recurse or not?
One of the very first CSstyle 'classic algorithms' I implemented after moving to Silicon Valley in the late 90s and making myself more employable by learning C (later C++) and the basics of CSfundamental data structures, was a heapsort. The startup I was working for needed some decently optimized basicinfrastructure code, including a classic heap; the resulting O(n lg n) sort capability is of course a useful side effect of the heap functionality, which can be used for lots of other purposes.
More recently I finally got round to coding up one of the other classic asymptotically fast sorting algos (and probably the codingsimplest of the lot), mergesort. After getting basic recursive version working and verifying the runtime trend up to array sizes limited by my iMac's 4GB of RAM, I proceeded to study several further optimizations. One of these is the ability to sortinplace. Mergesort needs the sublist pairs being divideandconquersortedandremerged at each step to first be copied, which can be done efficiently via stdlib memcpy(), but avoiding it entirely would be even better. That theme leads in the direction of the other of the "big three" sorting algos, quicksort, but I didn't pursue it to that end yet because I was more interested in another kind of optimization, namely a nonrecursive implementation. It seems to be the general wisdom that if one can take a recursive algorithm and reimplement it nonrecursively without incurring overmuch bookkeeping overhead  the FFT is a famous example of this  the nonrecursive version should run faster, due to elimination of the system overhead needed for managing stack frames in the recursion, and such. I found a nonrecursive scheme for mergesort which has notegregious bookkeeping, though one does need more bookkeeping to do things via that route. (One of the beautiful things about the recursive mergesort is that the recursion itself manages all the indexing, which for general nonpowerof2 list lengths is nontrivial). Interestingly, I was not able to exactly duplicate the recursiveversion indexing in nonrecursive fashion (which is not to say it is impossible to do  I just didn't see a reasonable way of doing so), but I did come up with a nonrecursive scheme which still has "nearly as good" sublistmerge properties: In the recursive version each pair of sorted sublists which gets merged differ in length by at most one element (e.g. we might merge sorted sublists of length 6 and 7 into a length13 sorted output list), whereas in the nonrecursive implementation the best I came up with (so far) is that the 2 sublists never differ in length by more than a multiplicative factor of 2. (The large comment block following the code sample I give below shows several worked examples using both versions). In order to get a decent "control" for my timing tests I also follow each timing of my handrolled implementation with a call to the C++ standard template library vector::sort, which presumably uses a very highly optimized vendorprovided version of one of (or some hybrid of) the above algos. The results can be summarized as follows: 1. My handrolled (but still all highlevel code) recursive implementation of mergesort sorts a largelength list of quasirandom doubles roughly an order of magnitude more slowly than STL vector::sort. Here is a timing sample with n = 10^6 on my iMac  In my source the timing code surrounds just the sort() call, so none of the inputcreation overhead pollutes the timing: Code:
./merge0 1000000 Generating random inputs; RAND_MAX = 2147483647 RECURSIVE, Clocks = 00:00:00.274 STL vector, Clocks = 00:00:00.034 Code:
./merge0 100000000 RECURSIVE, Clocks = 00:00:33.978 STL vector, Clocks = 00:00:04.691 The speed difference between my code and STL is not terribly surprising ... STL surely has a veryoptimized algo. What *is* surprising to me is that my nonrecursive version of the algo runs consistently 1020% slower than its recursive brother, on both my iMac (x86_64, OS X, g++) and Win32/VisualStudio builds. Again comparing the above 2 runlengths, but now recursive vs non: Code:
./merge0 1000000 RECURSIVE, Clocks = 00:00:00.274 NONRECUR, Clocks = 00:00:00.340 Code:
./merge0 100000000 RECURSIVE, Clocks = 00:00:33.978 NONRECUR, Clocks = 00:00:40.478 At bottom of the source file are some algodesign notes, in the form of a large C commentary block. It's clear much time and money has been put into efficiently supporting recursion by chip/OS/compiler vendors, but even so I found the second set of the above timing results in to be quite surprising. Last fiddled with by ewmayer on 20120219 at 19:51 Reason: update attachment to include shufflecode fix 
20120219, 19:49  #2 
∂^{2}ω=0
Sep 2002
República de California
2·5·13·89 Posts 
Corrigendum: When I added the C++ vectorsort to what had started as strictly C code for my own testimplementation, I simply inited the C++ vector with the data from the same arrayofdoubles used by the original code ... but that array was in sorted order at the point of the init, and I neglected to reshuffle it. In other words the C++ vectorsort was getting an alreadysorted array ... for some sort implementations that doesn't matter, but here it does.
When I insert the needed random_shuffle() in between the vector init and sort, the STL std::vector::sort timings change like so: n = 10^6: Code:
RECURSIVE, Clocks = 00:00:00.275 STL vector, Clocks = 00:00:00.117 Code:
RECURSIVE, Clocks = 00:00:36.606 STL vector, Clocks = 00:00:14.164 I've updated the gzip attachment above to include the code patch. 
20120219, 23:11  #3 
Undefined
"The unspeakable one"
Jun 2006
My evil lair
1011011111101_{2} Posts 
IME merge sort is crappy. I've tried implementing it in many different ways on many different systems and could never get it to be anywhere as fast as a basic binary sort.
But you don't mention some of the faster sorts than quicksort. Radix sort is usually a decent speed demon for data that is well dispersed. Also, I would have expected that during speed comparisons you would give each algorithm the exact same data to work with. Why not generate a data set from a PRNG and use the same seed again for the PRNG to generate the exact same data set for the next algorithm? Doing a randomisation seems wrong to me. And what is that state of the cache after the randomisation step? 
20120219, 23:25  #4 
Jan 2008
France
3×179 Posts 
I am not surprised that a derecursived program might be slower. I remember 20 years ago when I showed to some disbilievers that a stupid recursive implementation of FFT was faster than the iterative CooleyTukey variant. Back then people were not familiar with cache effects
You are perhaps experiencing such a memory hierarchy artefact. 
20120220, 02:03  #5  
∂^{2}ω=0
Sep 2002
República de California
2·5·13·89 Posts 
Quote:
Quote:


20120220, 07:32  #6 
Bamboozled!
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across
19×547 Posts 
Ah, then in that case you want the spaghetti sort. It takes O(1) to perform the sorting but, of course, O(N) to read in the operands and O(N) to write the results. And, yes, it is a truly generic sort.

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
White 05 (to be easy to sort...)  LaurV  Game 2  ♔♕♙♘♖♙  Shaolin Pirates  3  20130823 02:50 
Why NPLB and RPS should merge.  cipher  No Prime Left Behind  7  20090610 12:05 
Too many merge attempts etc.  hhh  Aliquot Sequences  28  20090514 19:54 
can't seem to merge old account  ixfd64  PrimeNet  2  20080830 22:27 
Connection through a proxy, sort of ...  bayanne  Software  0  20030224 14:39 