![]() |
|
|
#1 |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
One of the very first CS-style 'classic algorithms' I implemented after moving to Silicon Valley in the late 90s and making myself more employable by learning C (later C++) and the basics of CS-fundamental data structures, was a heapsort. The startup I was working for needed some decently optimized basic-infrastructure code, including a classic heap; the resulting O(n lg n) sort capability is of course a useful side effect of the heap functionality, which can be used for lots of other purposes.
More recently I finally got round to coding up one of the other classic asymptotically fast sorting algos (and probably the coding-simplest of the lot), mergesort. After getting basic recursive version working and verifying the runtime trend up to array sizes limited by my iMac's 4GB of RAM, I proceeded to study several further optimizations. One of these is the ability to sort-in-place. Mergesort needs the sublist pairs being divide-and-conquer-sorted-and-remerged at each step to first be copied, which can be done efficiently via stdlib memcpy(), but avoiding it entirely would be even better. That theme leads in the direction of the other of the "big three" sorting algos, quicksort, but I didn't pursue it to that end yet because I was more interested in another kind of optimization, namely a nonrecursive implementation. It seems to be the general wisdom that if one can take a recursive algorithm and re-implement it nonrecursively without incurring overmuch bookkeeping overhead - the FFT is a famous example of this - the nonrecursive version should run faster, due to elimination of the system overhead needed for managing stack frames in the recursion, and such. I found a nonrecursive scheme for mergesort which has not-egregious bookkeeping, though one does need more bookkeeping to do things via that route. (One of the beautiful things about the recursive mergesort is that the recursion itself manages all the indexing, which for general non-power-of-2 list lengths is nontrivial). Interestingly, I was not able to exactly duplicate the recursive-version indexing in nonrecursive fashion (which is not to say it is impossible to do - I just didn't see a reasonable way of doing so), but I did come up with a nonrecursive scheme which still has "nearly as good" sublist-merge properties: In the recursive version each pair of sorted sublists which gets merged differ in length by at most one element (e.g. we might merge sorted sublists of length 6 and 7 into a length-13 sorted output list), whereas in the nonrecursive implementation the best I came up with (so far) is that the 2 sublists never differ in length by more than a multiplicative factor of 2. (The large comment block following the code sample I give below shows several worked examples using both versions). In order to get a decent "control" for my timing tests I also follow each timing of my hand-rolled implementation with a call to the C++ standard template library vector::sort, which presumably uses a very highly optimized vendor-provided version of one of (or some hybrid of) the above algos. The results can be summarized as follows: 1. My hand-rolled (but still all high-level code) recursive implementation of mergesort sorts a large-length list of quasi-random doubles roughly an order of magnitude more slowly than STL vector::sort. Here is a timing sample with n = 10^6 on my iMac - In my source the timing code surrounds just the sort() call, so none of the input-creation overhead pollutes the timing: Code:
./merge0 1000000 Generating random inputs; RAND_MAX = 2147483647 RECURSIVE, Clocks = 00:00:00.274 STL vector, Clocks = 00:00:00.034 Code:
./merge0 100000000 RECURSIVE, Clocks = 00:00:33.978 STL vector, Clocks = 00:00:04.691 The speed difference between my code and STL is not terribly surprising ... STL surely has a very-optimized algo. What *is* surprising to me is that my nonrecursive version of the algo runs consistently 10-20% slower than its recursive brother, on both my iMac (x86_64, OS X, g++) and Win32/Visual-Studio builds. Again comparing the above 2 runlengths, but now recursive vs non: Code:
./merge0 1000000 RECURSIVE, Clocks = 00:00:00.274 NON-RECUR, Clocks = 00:00:00.340 Code:
./merge0 100000000 RECURSIVE, Clocks = 00:00:33.978 NON-RECUR, Clocks = 00:00:40.478 At bottom of the source file are some algo-design notes, in the form of a large C commentary block. It's clear much time and money has been put into efficiently supporting recursion by chip/OS/compiler vendors, but even so I found the second set of the above timing results in to be quite surprising. Last fiddled with by ewmayer on 2012-02-19 at 19:51 Reason: update attachment to include shuffle-code fix |
|
|
|
|
|
#2 |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Corrigendum: When I added the C++ vector-sort to what had started as strictly C code for my own test-implementation, I simply inited the C++ vector with the data from the same array-of-doubles used by the original code ... but that array was in sorted order at the point of the init, and I neglected to reshuffle it. In other words the C++ vector-sort was getting an already-sorted array ... for some sort implementations that doesn't matter, but here it does.
When I insert the needed random_shuffle() in between the vector init and sort, the STL std::vector::sort timings change like so: n = 10^6: Code:
RECURSIVE, Clocks = 00:00:00.275 STL vector, Clocks = 00:00:00.117 Code:
RECURSIVE, Clocks = 00:00:36.606 STL vector, Clocks = 00:00:14.164 I've updated the gzip attachment above to include the code patch. |
|
|
|
|
|
#3 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
2×19×163 Posts |
IME merge sort is crappy. I've tried implementing it in many different ways on many different systems and could never get it to be anywhere as fast as a basic binary sort.
But you don't mention some of the faster sorts than quicksort. Radix sort is usually a decent speed demon for data that is well dispersed. Also, I would have expected that during speed comparisons you would give each algorithm the exact same data to work with. Why not generate a data set from a PRNG and use the same seed again for the PRNG to generate the exact same data set for the next algorithm? Doing a randomisation seems wrong to me. And what is that state of the cache after the randomisation step? |
|
|
|
|
|
#4 |
|
Jan 2008
France
2×52×11 Posts |
I am not surprised that a derecursived program might be slower. I remember 20 years ago when I showed to some disbilievers that a stupid recursive implementation of FFT was faster than the iterative Cooley-Tukey variant. Back then people were not familiar with cache effects
![]() You are perhaps experiencing such a memory hierarchy artefact. |
|
|
|
|
|
#5 | ||
|
∂2ω=0
Sep 2002
República de California
1163910 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#6 |
|
Bamboozled!
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across
2A0016 Posts |
Ah, then in that case you want the spaghetti sort. It takes O(1) to perform the sorting but, of course, O(N) to read in the operands and O(N) to write the results. And, yes, it is a truly generic sort.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| White 05 (to be easy to sort...) | LaurV | Game 2 - ♔♕♙♘♖♙ - Shaolin Pirates | 3 | 2013-08-23 02:50 |
| Why NPLB and RPS should merge. | cipher | No Prime Left Behind | 7 | 2009-06-10 12:05 |
| Too many merge attempts etc. | hhh | Aliquot Sequences | 28 | 2009-05-14 19:54 |
| can't seem to merge old account | ixfd64 | PrimeNet | 2 | 2008-08-30 22:27 |
| Connection through a proxy, sort of ... | bayanne | Software | 0 | 2003-02-24 14:39 |