mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2012-02-18, 21:26   #1
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

100110001111112 Posts
Default Merge-Sort: To recurse or not?

One of the very first CS-style 'classic algorithms' I implemented after moving to Silicon Valley in the late 90s and making myself more employable by learning C (later C++) and the basics of CS-fundamental data structures, was a heapsort. The startup I was working for needed some decently optimized basic-infrastructure code, including a classic heap; the resulting O(n lg n) sort capability is of course a useful side effect of the heap functionality, which can be used for lots of other purposes.

More recently I finally got round to coding up one of the other classic asymptotically fast sorting algos (and probably the coding-simplest of the lot), mergesort. After getting basic recursive version working and verifying the runtime trend up to array sizes limited by my iMac's 4GB of RAM, I proceeded to study several further optimizations. One of these is the ability to sort-in-place. Mergesort needs the sublist pairs being divide-and-conquer-sorted-and-remerged at each step to first be copied, which can be done efficiently via stdlib memcpy(), but avoiding it entirely would be even better. That theme leads in the direction of the other of the "big three" sorting algos, quicksort, but I didn't pursue it to that end yet because I was more interested in another kind of optimization, namely a nonrecursive implementation.

It seems to be the general wisdom that if one can take a recursive algorithm and re-implement it nonrecursively without incurring overmuch bookkeeping overhead - the FFT is a famous example of this - the nonrecursive version should run faster, due to elimination of the system overhead needed for managing stack frames in the recursion, and such. I found a nonrecursive scheme for mergesort which has not-egregious bookkeeping, though one does need more bookkeeping to do things via that route. (One of the beautiful things about the recursive mergesort is that the recursion itself manages all the indexing, which for general non-power-of-2 list lengths is nontrivial). Interestingly, I was not able to exactly duplicate the recursive-version indexing in nonrecursive fashion (which is not to say it is impossible to do - I just didn't see a reasonable way of doing so), but I did come up with a nonrecursive scheme which still has "nearly as good" sublist-merge properties: In the recursive version each pair of sorted sublists which gets merged differ in length by at most one element (e.g. we might merge sorted sublists of length 6 and 7 into a length-13 sorted output list), whereas in the nonrecursive implementation the best I came up with (so far) is that the 2 sublists never differ in length by more than a multiplicative factor of 2. (The large comment block following the code sample I give below shows several worked examples using both versions).

In order to get a decent "control" for my timing tests I also follow each timing of my hand-rolled implementation with a call to the C++ standard template library vector::sort, which presumably uses a very highly optimized vendor-provided version of one of (or some hybrid of) the above algos.

The results can be summarized as follows:

1. My hand-rolled (but still all high-level code) recursive implementation of mergesort sorts a large-length list of quasi-random doubles roughly an order of magnitude more slowly than STL vector::sort. Here is a timing sample with n = 10^6 on my iMac - In my source the timing code surrounds just the sort() call, so none of the input-creation overhead pollutes the timing:
Code:
./merge0 1000000
Generating random inputs; RAND_MAX = 2147483647
RECURSIVE, Clocks = 00:00:00.274
STL vector, Clocks = 00:00:00.034
Thus STL sort is 8x faster. At n = 10^8, STL sort is a little over 7x faster:
Code:
./merge0 100000000
RECURSIVE, Clocks = 00:00:33.978
STL vector, Clocks = 00:00:04.691
The timing ratios for the 2 runlengths are very close to what strict O(n lg n) scaling predicts, likely because mergesort (and whatever algo the Gnu compiler STL uses) tends to be a cache-friendly algorithm.

The speed difference between my code and STL is not terribly surprising ... STL surely has a very-optimized algo. What *is* surprising to me is that my nonrecursive version of the algo runs consistently 10-20% slower than its recursive brother, on both my iMac (x86_64, OS X, g++) and Win32/Visual-Studio builds. Again comparing the above 2 runlengths, but now recursive vs non:

Code:
./merge0 1000000
RECURSIVE, Clocks = 00:00:00.274
NON-RECUR, Clocks = 00:00:00.340
And at n = 10^8:
Code:
./merge0 100000000
RECURSIVE, Clocks = 00:00:33.978
NON-RECUR, Clocks = 00:00:40.478
Source code is attached, if anyone wants to play with it on their own system. The GPL-comment-block at top is standard boilerplate (I don't really expect any one in their right mind would want to use this for commercial purposes); the simple 1-line build instructions are in the following comment block. The heart of the algo is the merge_sort_dfloat (divide) and merge_lists_dfloat (conquer) functions, the rest is all just test/pretty-print/util code.

At bottom of the source file are some algo-design notes, in the form of a large C commentary block.

It's clear much time and money has been put into efficiently supporting recursion by chip/OS/compiler vendors, but even so I found the second set of the above timing results in to be quite surprising.
Attached Files
File Type: gz merge_sort.cpp.gz (5.5 KB, 87 views)

Last fiddled with by ewmayer on 2012-02-19 at 19:51 Reason: update attachment to include shuffle-code fix
ewmayer is offline   Reply With Quote
Old 2012-02-19, 19:49   #2
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

230778 Posts
Default

Corrigendum: When I added the C++ vector-sort to what had started as strictly C code for my own test-implementation, I simply inited the C++ vector with the data from the same array-of-doubles used by the original code ... but that array was in sorted order at the point of the init, and I neglected to reshuffle it. In other words the C++ vector-sort was getting an already-sorted array ... for some sort implementations that doesn't matter, but here it does.

When I insert the needed random_shuffle() in between the vector init and sort, the STL std::vector::sort timings change like so:

n = 10^6:
Code:
RECURSIVE, Clocks = 00:00:00.275
STL vector, Clocks = 00:00:00.117
n = 10^8:
Code:
RECURSIVE, Clocks = 00:00:36.606
STL vector, Clocks = 00:00:14.164
Still faster than my hand-rolled test code, but now only by a factor ~2.5x.

I've updated the gzip attachment above to include the code patch.
ewmayer is offline   Reply With Quote
Old 2012-02-19, 23:11   #3
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

579310 Posts
Default

IME merge sort is crappy. I've tried implementing it in many different ways on many different systems and could never get it to be anywhere as fast as a basic binary sort.

But you don't mention some of the faster sorts than quicksort. Radix sort is usually a decent speed demon for data that is well dispersed.

Also, I would have expected that during speed comparisons you would give each algorithm the exact same data to work with. Why not generate a data set from a PRNG and use the same seed again for the PRNG to generate the exact same data set for the next algorithm? Doing a randomisation seems wrong to me. And what is that state of the cache after the randomisation step?
retina is online now   Reply With Quote
Old 2012-02-19, 23:25   #4
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

21016 Posts
Default

I am not surprised that a derecursived program might be slower. I remember 20 years ago when I showed to some disbilievers that a stupid recursive implementation of FFT was faster than the iterative Cooley-Tukey variant. Back then people were not familiar with cache effects

You are perhaps experiencing such a memory hierarchy artefact.
ldesnogu is offline   Reply With Quote
Old 2012-02-20, 02:03   #5
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

Quote:
Originally Posted by retina View Post
IME merge sort is crappy. I've tried implementing it in many different ways on many different systems and could never get it to be anywhere as fast as a basic binary sort.

But you don't mention some of the faster sorts than quicksort. Radix sort is usually a decent speed demon for data that is well dispersed.
I am interested here only in truly generic sorting algos, not ones which rely on assumptions about the range or statistical properties of the inputs to allow for faster than O(n lg n) best-case speed.

Quote:
Also, I would have expected that during speed comparisons you would give each algorithm the exact same data to work with. Why not generate a data set from a PRNG and use the same seed again for the PRNG to generate the exact same data set for the next algorithm? Doing a randomisation seems wrong to me. And what is that state of the cache after the randomisation step?
As long as the data resulting from the srand/rand() sequence I use to generate the initial dataset and the random_shuffle of same are similarly well randomized the sort speed should be unaffected by the precise data sequencing. But just to be sure, I implemented your suggestion and tried both sort options using the same data ... no discernible difference in timings from the 2.5x ratio of my previous 'corrigendum' post.
ewmayer is offline   Reply With Quote
Old 2012-02-20, 07:32   #6
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

22·32·281 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I am interested here only in truly generic sorting algos, not ones which rely on assumptions about the range or statistical properties of the inputs to allow for faster than O(n lg n) best-case speed.
Ah, then in that case you want the spaghetti sort. It takes O(1) to perform the sorting but, of course, O(N) to read in the operands and O(N) to write the results. And, yes, it is a truly generic sort.
xilman is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
White 05 (to be easy to sort...) LaurV Game 2 - ♔♕♙♘♖♙ - Shaolin Pirates 3 2013-08-23 02:50
Why NPLB and RPS should merge. cipher No Prime Left Behind 7 2009-06-10 12:05
Too many merge attempts etc. hhh Aliquot Sequences 28 2009-05-14 19:54
can't seem to merge old account ixfd64 PrimeNet 2 2008-08-30 22:27
Connection through a proxy, sort of ... bayanne Software 0 2003-02-24 14:39

All times are UTC. The time now is 03:57.

Sat Oct 24 03:57:58 UTC 2020 up 44 days, 1:08, 1 user, load averages: 1.47, 1.31, 1.27

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.