mersenneforum.org Merge-Sort: To recurse or not?
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

2012-02-18, 21:26   #1
ewmayer
2ω=0

Sep 2002
República de California

100110001111112 Posts
Merge-Sort: To recurse or not?

One of the very first CS-style 'classic algorithms' I implemented after moving to Silicon Valley in the late 90s and making myself more employable by learning C (later C++) and the basics of CS-fundamental data structures, was a heapsort. The startup I was working for needed some decently optimized basic-infrastructure code, including a classic heap; the resulting O(n lg n) sort capability is of course a useful side effect of the heap functionality, which can be used for lots of other purposes.

More recently I finally got round to coding up one of the other classic asymptotically fast sorting algos (and probably the coding-simplest of the lot), mergesort. After getting basic recursive version working and verifying the runtime trend up to array sizes limited by my iMac's 4GB of RAM, I proceeded to study several further optimizations. One of these is the ability to sort-in-place. Mergesort needs the sublist pairs being divide-and-conquer-sorted-and-remerged at each step to first be copied, which can be done efficiently via stdlib memcpy(), but avoiding it entirely would be even better. That theme leads in the direction of the other of the "big three" sorting algos, quicksort, but I didn't pursue it to that end yet because I was more interested in another kind of optimization, namely a nonrecursive implementation.

It seems to be the general wisdom that if one can take a recursive algorithm and re-implement it nonrecursively without incurring overmuch bookkeeping overhead - the FFT is a famous example of this - the nonrecursive version should run faster, due to elimination of the system overhead needed for managing stack frames in the recursion, and such. I found a nonrecursive scheme for mergesort which has not-egregious bookkeeping, though one does need more bookkeeping to do things via that route. (One of the beautiful things about the recursive mergesort is that the recursion itself manages all the indexing, which for general non-power-of-2 list lengths is nontrivial). Interestingly, I was not able to exactly duplicate the recursive-version indexing in nonrecursive fashion (which is not to say it is impossible to do - I just didn't see a reasonable way of doing so), but I did come up with a nonrecursive scheme which still has "nearly as good" sublist-merge properties: In the recursive version each pair of sorted sublists which gets merged differ in length by at most one element (e.g. we might merge sorted sublists of length 6 and 7 into a length-13 sorted output list), whereas in the nonrecursive implementation the best I came up with (so far) is that the 2 sublists never differ in length by more than a multiplicative factor of 2. (The large comment block following the code sample I give below shows several worked examples using both versions).

In order to get a decent "control" for my timing tests I also follow each timing of my hand-rolled implementation with a call to the C++ standard template library vector::sort, which presumably uses a very highly optimized vendor-provided version of one of (or some hybrid of) the above algos.

The results can be summarized as follows:

1. My hand-rolled (but still all high-level code) recursive implementation of mergesort sorts a large-length list of quasi-random doubles roughly an order of magnitude more slowly than STL vector::sort. Here is a timing sample with n = 10^6 on my iMac - In my source the timing code surrounds just the sort() call, so none of the input-creation overhead pollutes the timing:
Code:
./merge0 1000000
Generating random inputs; RAND_MAX = 2147483647
RECURSIVE, Clocks = 00:00:00.274
STL vector, Clocks = 00:00:00.034
Thus STL sort is 8x faster. At n = 10^8, STL sort is a little over 7x faster:
Code:
./merge0 100000000
RECURSIVE, Clocks = 00:00:33.978
STL vector, Clocks = 00:00:04.691
The timing ratios for the 2 runlengths are very close to what strict O(n lg n) scaling predicts, likely because mergesort (and whatever algo the Gnu compiler STL uses) tends to be a cache-friendly algorithm.

The speed difference between my code and STL is not terribly surprising ... STL surely has a very-optimized algo. What *is* surprising to me is that my nonrecursive version of the algo runs consistently 10-20% slower than its recursive brother, on both my iMac (x86_64, OS X, g++) and Win32/Visual-Studio builds. Again comparing the above 2 runlengths, but now recursive vs non:

Code:
./merge0 1000000
RECURSIVE, Clocks = 00:00:00.274
NON-RECUR, Clocks = 00:00:00.340
And at n = 10^8:
Code:
./merge0 100000000
RECURSIVE, Clocks = 00:00:33.978
NON-RECUR, Clocks = 00:00:40.478
Source code is attached, if anyone wants to play with it on their own system. The GPL-comment-block at top is standard boilerplate (I don't really expect any one in their right mind would want to use this for commercial purposes); the simple 1-line build instructions are in the following comment block. The heart of the algo is the merge_sort_dfloat (divide) and merge_lists_dfloat (conquer) functions, the rest is all just test/pretty-print/util code.

At bottom of the source file are some algo-design notes, in the form of a large C commentary block.

It's clear much time and money has been put into efficiently supporting recursion by chip/OS/compiler vendors, but even so I found the second set of the above timing results in to be quite surprising.
Attached Files
 merge_sort.cpp.gz (5.5 KB, 87 views)

Last fiddled with by ewmayer on 2012-02-19 at 19:51 Reason: update attachment to include shuffle-code fix

 2012-02-19, 19:49 #2 ewmayer ∂2ω=0     Sep 2002 República de California 230778 Posts Corrigendum: When I added the C++ vector-sort to what had started as strictly C code for my own test-implementation, I simply inited the C++ vector with the data from the same array-of-doubles used by the original code ... but that array was in sorted order at the point of the init, and I neglected to reshuffle it. In other words the C++ vector-sort was getting an already-sorted array ... for some sort implementations that doesn't matter, but here it does. When I insert the needed random_shuffle() in between the vector init and sort, the STL std::vector::sort timings change like so: n = 10^6: Code: RECURSIVE, Clocks = 00:00:00.275 STL vector, Clocks = 00:00:00.117 n = 10^8: Code: RECURSIVE, Clocks = 00:00:36.606 STL vector, Clocks = 00:00:14.164 Still faster than my hand-rolled test code, but now only by a factor ~2.5x. I've updated the gzip attachment above to include the code patch.
 2012-02-19, 23:11 #3 retina Undefined     "The unspeakable one" Jun 2006 My evil lair 579310 Posts IME merge sort is crappy. I've tried implementing it in many different ways on many different systems and could never get it to be anywhere as fast as a basic binary sort. But you don't mention some of the faster sorts than quicksort. Radix sort is usually a decent speed demon for data that is well dispersed. Also, I would have expected that during speed comparisons you would give each algorithm the exact same data to work with. Why not generate a data set from a PRNG and use the same seed again for the PRNG to generate the exact same data set for the next algorithm? Doing a randomisation seems wrong to me. And what is that state of the cache after the randomisation step?
 2012-02-19, 23:25 #4 ldesnogu     Jan 2008 France 21016 Posts I am not surprised that a derecursived program might be slower. I remember 20 years ago when I showed to some disbilievers that a stupid recursive implementation of FFT was faster than the iterative Cooley-Tukey variant. Back then people were not familiar with cache effects You are perhaps experiencing such a memory hierarchy artefact.
2012-02-20, 02:03   #5
ewmayer
2ω=0

Sep 2002
República de California

9,791 Posts

Quote:
 Originally Posted by retina IME merge sort is crappy. I've tried implementing it in many different ways on many different systems and could never get it to be anywhere as fast as a basic binary sort. But you don't mention some of the faster sorts than quicksort. Radix sort is usually a decent speed demon for data that is well dispersed.
I am interested here only in truly generic sorting algos, not ones which rely on assumptions about the range or statistical properties of the inputs to allow for faster than O(n lg n) best-case speed.

Quote:
 Also, I would have expected that during speed comparisons you would give each algorithm the exact same data to work with. Why not generate a data set from a PRNG and use the same seed again for the PRNG to generate the exact same data set for the next algorithm? Doing a randomisation seems wrong to me. And what is that state of the cache after the randomisation step?
As long as the data resulting from the srand/rand() sequence I use to generate the initial dataset and the random_shuffle of same are similarly well randomized the sort speed should be unaffected by the precise data sequencing. But just to be sure, I implemented your suggestion and tried both sort options using the same data ... no discernible difference in timings from the 2.5x ratio of my previous 'corrigendum' post.

2012-02-20, 07:32   #6
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

22·32·281 Posts

Quote:
 Originally Posted by ewmayer I am interested here only in truly generic sorting algos, not ones which rely on assumptions about the range or statistical properties of the inputs to allow for faster than O(n lg n) best-case speed.
Ah, then in that case you want the spaghetti sort. It takes O(1) to perform the sorting but, of course, O(N) to read in the operands and O(N) to write the results. And, yes, it is a truly generic sort.

 Similar Threads Thread Thread Starter Forum Replies Last Post LaurV Game 2 - ♔♕♙♘♖♙ - Shaolin Pirates 3 2013-08-23 02:50 cipher No Prime Left Behind 7 2009-06-10 12:05 hhh Aliquot Sequences 28 2009-05-14 19:54 ixfd64 PrimeNet 2 2008-08-30 22:27 bayanne Software 0 2003-02-24 14:39

All times are UTC. The time now is 03:57.

Sat Oct 24 03:57:58 UTC 2020 up 44 days, 1:08, 1 user, load averages: 1.47, 1.31, 1.27