![]() |
|
|
#56 |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
David Stanfill (airsquirrels) kindly gave me a user account on his Ryzen in order to do Mlucas builds/tests using my current code snapshot which I am preparing for release. Here are the first 2 sets of timing results, unthreaded builds (what I call '0-thread' to differentiate from multithread-capable builds run with just 1 thread).
Note the radices in the rightmost columns are *complex* FFT radices, thus their product in each case equals one-half the real-vector length (in Kdoubles) in the leftmost column. There was no AMD-specific optimization involved - this is all code developed and tuned for Intel CPUs. [Edit: See ||-build notes below about the 100 iters used for these timings likely being insufficient] Code:
Ryzen, AVX/0-thread: 1024 msec/iter = 16.143230 ROE[avg,max] = [0.237048340, 0.269531250] radices = 32 16 32 32 1152 msec/iter = 18.393270 ROE[avg,max] = [0.273577009, 0.312500000] radices = 36 16 32 32 1280 msec/iter = 20.434270 ROE[avg,max] = [0.278939383, 0.343750000] radices = 40 16 32 32 1408 msec/iter = 23.969040 ROE[avg,max] = [0.311523438, 0.406250000] radices = 44 16 32 32 1536 msec/iter = 23.938600 ROE[avg,max] = [0.251722935, 0.281250000] radices = 48 16 32 32 1664 msec/iter = 28.809070 ROE[avg,max] = [0.308928571, 0.375000000] radices = 52 16 32 32 1792 msec/iter = 30.127000 ROE[avg,max] = [0.351534598, 0.437500000] radices = 56 16 32 32 1920 msec/iter = 33.393400 ROE[avg,max] = [0.297321429, 0.406250000] radices = 60 16 32 32 2048 msec/iter = 34.487110 ROE[avg,max] = [0.240848214, 0.281250000] radices = 64 16 32 32 2304 msec/iter = 40.226720 ROE[avg,max] = [0.249302455, 0.281250000] radices = 36 32 32 32 2560 msec/iter = 44.287860 ROE[avg,max] = [0.256849888, 0.312500000] radices = 160 16 16 32 2816 msec/iter = 50.539970 ROE[avg,max] = [0.281724330, 0.328125000] radices = 176 16 16 32 3072 msec/iter = 52.569620 ROE[avg,max] = [0.245962960, 0.281250000] radices = 48 32 32 32 3328 msec/iter = 60.861210 ROE[avg,max] = [0.316964286, 0.375000000] radices = 52 32 32 32 3584 msec/iter = 62.958160 ROE[avg,max] = [0.286432757, 0.343750000] radices = 224 16 16 32 3840 msec/iter = 69.900850 ROE[avg,max] = [0.253655134, 0.281250000] radices = 240 16 16 32 4096 msec/iter = 73.305030 ROE[avg,max] = [0.259765625, 0.312500000] radices = 256 16 16 32 4608 msec/iter = 82.375850 ROE[avg,max] = [0.279478237, 0.375000000] radices = 288 16 16 32 5120 msec/iter = 92.422200 ROE[avg,max] = [0.303348214, 0.375000000] radices = 160 16 32 32 5632 msec/iter = 103.692050 ROE[avg,max] = [0.287374442, 0.343750000] radices = 176 16 32 32 6144 msec/iter = 114.081960 ROE[avg,max] = [0.279017857, 0.312500000] radices = 192 16 32 32 6656 msec/iter = 141.714380 ROE[avg,max] = [0.347767857, 0.375000000] radices = 52 16 16 16 16 7168 msec/iter = 131.530090 ROE[avg,max] = [0.286830357, 0.328125000] radices = 224 16 32 32 7680 msec/iter = 140.589520 ROE[avg,max] = [0.265318080, 0.312500000] radices = 240 16 32 32 Code:
Ryzen, AVX2/0-thread: 1024 msec/iter = 14.473480 ROE[avg,max] = [0.249674770, 0.312500000] radices = 32 16 32 32 1152 msec/iter = 16.941660 ROE[avg,max] = [0.304101562, 0.375000000] radices = 36 16 32 32 1280 msec/iter = 18.400400 ROE[avg,max] = [0.285825893, 0.375000000] radices = 40 16 32 32 1408 msec/iter = 21.812400 ROE[avg,max] = [0.299107143, 0.375000000] radices = 44 16 32 32 1536 msec/iter = 22.641650 ROE[avg,max] = [0.264965820, 0.312500000] radices = 48 16 32 32 1664 msec/iter = 26.051310 ROE[avg,max] = [0.303417969, 0.375000000] radices = 52 16 32 32 1792 msec/iter = 27.311240 ROE[avg,max] = [0.305301339, 0.375000000] radices = 56 16 32 32 1920 msec/iter = 30.567500 ROE[avg,max] = [0.323883929, 0.437500000] radices = 60 16 32 32 2048 msec/iter = 31.450460 ROE[avg,max] = [0.258858817, 0.312500000] radices = 64 16 32 32 2304 msec/iter = 35.497940 ROE[avg,max] = [0.365848214, 0.437500000] radices = 144 16 16 32 2560 msec/iter = 39.911440 ROE[avg,max] = [0.294642857, 0.375000000] radices = 40 32 32 32 2816 msec/iter = 46.300510 ROE[avg,max] = [0.286802455, 0.343750000] radices = 176 16 16 32 3072 msec/iter = 48.691550 ROE[avg,max] = [0.235825893, 0.281250000] radices = 48 32 32 32 3328 msec/iter = 55.515420 ROE[avg,max] = [0.278913225, 0.343750000] radices = 208 16 16 32 3584 msec/iter = 55.566890 ROE[avg,max] = [0.286143276, 0.328125000] radices = 224 16 16 32 3840 msec/iter = 62.801760 ROE[avg,max] = [0.288204520, 0.347656250] radices = 240 16 16 32 4096 msec/iter = 64.375370 ROE[avg,max] = [0.295214844, 0.343750000] radices = 256 16 16 32 4608 msec/iter = 72.954530 ROE[avg,max] = [0.311607143, 0.375000000] radices = 288 16 16 32 5120 msec/iter = 82.275550 ROE[avg,max] = [0.306975446, 0.375000000] radices = 160 16 32 32 5632 msec/iter = 95.040700 ROE[avg,max] = [0.255600412, 0.281250000] radices = 176 16 32 32 6144 msec/iter = 103.228320 ROE[avg,max] = [0.273018973, 0.343750000] radices = 192 16 32 32 6656 msec/iter = 115.045360 ROE[avg,max] = [0.268750000, 0.312500000] radices = 208 16 32 32 7168 msec/iter = 114.919310 ROE[avg,max] = [0.273074777, 0.312500000] radices = 224 16 32 32 7680 msec/iter = 128.601060 ROE[avg,max] = [0.289223807, 0.343750000] radices = 240 16 32 32 Last fiddled with by ewmayer on 2017-05-14 at 04:47 |
|
|
|
|
|
#57 |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Here are benchmark timings for multithreaded builds of Mlucas on Ryzen. Some notes:
1. My above 'unthreaded' timings were for 100-iteration runs. It seems that was insufficient on Ryzen, because when I went to 1000-iter timings to allow for the timing decreases which accompany use of more than 1 thread, even the 1-thread timings drop significantly versus the 100-iteration ones. For example, the per-iteration time for the AVX build @7168K drops from the 131 msec in the unthreaded-build-100-iter table to just 91 msec in the 1-thread column of the threaded-build-1000-iter table which follows. 2. Again due to the deeper 1000-iter runs, the roundoff errors captured in the table are larger. It's clear that I also need to fiddle my timing-test code to omit results having ROEs appreciably > 0.4 from the best-radix-set entries that get printed to the mlucas.cfg file. 0.40625 is probably OK (though maybe not for 100-iter runs), but 0.4375 is dangerously high, and e.g. 0.46875 is "right out", as the Monty Pythons would say. [Cf. Holy Hand Grenade scene in MP & The Holy Grail.] 3. Mlucas allows non-power-of-2 threadcounts but greatly prefers the power-of-2 ones, so I only did the latter. 4. AMD apparently has a different core numbering scheme than Intel - when I ran the first 2-thread benchmarks using the '-nthread 2' option, which sets affinities to cores 0 and 1, the timings were slower than 1-thread. Using the new-in-the-coming-release -cpu option I forced affinities to cores 0 and 2 via '-cpu 0,2', and got the expected 2-thread speedup. For 4 and 8-threads I used '-cpu 0:7:2' [equivalent to '-cpu 0,2,4,6'] and '-cpu 0:15:2' [equivalent to '-cpu 0,2,4,6,8,10,12,14'], respectively. 5. The 8-thread timings, especially for the smaller FFT lengths, are likely pessimistic, since startup overhead is non-neglible for that many threads even using 1000 iterations. Ryzen, AVX build, msec/iter vs FFT length (Kdouble) for various threadcounts: Code:
FFTlen 1-thr 2-thr 4-thr 8-thr 1024 11.67 6.24 3.77 2.40 ROE[avg,max] = [0.242096600, 0.312500000] 1152 13.47 7.14 4.13 2.96 ROE[avg,max] = [0.275115778, 0.375000000] 1280 14.81 7.88 4.64 3.24 ROE[avg,max] = [0.284061770, 0.406250000] 1408 16.94 8.88 5.26 3.51 ROE[avg,max] = [0.310743194, 0.468750000] 1536 17.75 9.34 5.20 3.57 ROE[avg,max] = [0.252182723, 0.343750000] 1664 20.45 10.74 6.30 4.29 ROE[avg,max] = [0.310800580, 0.406250000] 1792 21.24 11.13 6.10 4.20 ROE[avg,max] = [0.348934528, 0.468750000] 1920 23.62 12.28 6.87 4.74 ROE[avg,max] = [0.295699098, 0.406250000] 2048 24.02 12.59 6.95 4.83 ROE[avg,max] = [0.248437626, 0.320312500] 2304 27.91 14.72 7.96 5.43 ROE[avg,max] = [0.248899291, 0.312500000] 2560 30.55 16.07 8.90 5.94 ROE[avg,max] = [0.302806862, 0.375000000] 2816 34.92 18.18 10.09 6.61 ROE[avg,max] = [0.284329255, 0.375000000] 3072 36.52 19.12 10.83 7.23 ROE[avg,max] = [0.244108896, 0.312500000] 3328 42.00 22.02 12.74 8.42 ROE[avg,max] = [0.316897552, 0.437500000] 3584 43.51 22.70 12.51 8.34 ROE[avg,max] = [0.289033555, 0.437500000] 3840 48.30 25.20 13.63 8.83 ROE[avg,max] = [0.301240335, 0.375000000] 4096 50.49 26.29 14.41 10.01 ROE[avg,max] = [0.293798325, 0.437500000] 4608 57.54 29.73 16.26 10.75 ROE[avg,max] = [0.301216173, 0.406250000] 5120 64.50 33.36 18.01 12.04 ROE[avg,max] = [0.321669620, 0.406250000] 5632 72.71 37.62 20.33 13.39 ROE[avg,max] = [0.284785005, 0.375000000] 6144 77.17 40.38 22.42 14.81 ROE[avg,max] = [0.254623948, 0.343750000] 6656 88.26 46.11 27.05 18.87 ROE[avg,max] = [0.353221649, 0.437500000] 7168 90.96 47.11 25.80 16.98 ROE[avg,max] = [0.289598351, 0.375000000] 7680 99.00 50.93 27.62 18.63 ROE[avg,max] = [0.267126056, 0.437500000 |
|
|
|
|
|
#58 |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Ryzen, AVX2/FMA3 build, msec/iter vs FFT length (Kdouble) for various threadcounts:
Code:
FFTlen 1-thr 2-thr 4-thr 8-thr 1024 10.42 5.34 3.36 2.20 ROE[avg,max] = [0.249404939, 0.328125000] 1152 12.14 6.40 3.72 2.80 ROE[avg,max] = [0.302253644, 0.375000000] 1280 13.23 6.84 4.07 2.88 ROE[avg,max] = [0.285753262, 0.375000000] 1408 15.40 8.01 4.87 3.09 ROE[avg,max] = [0.300879913, 0.375000000] 1536 15.96 8.31 4.80 3.11 ROE[avg,max] = [0.265940841, 0.375000000] 1664 18.57 9.60 5.64 3.92 ROE[avg,max] = [0.310388813, 0.406250000] 1792 18.67 9.77 5.47 3.83 ROE[avg,max] = [0.310203065, 0.437500000] 1920 21.53 11.29 6.25 4.26 ROE[avg,max] = [0.324257007, 0.437500000] 2048 21.68 11.34 6.39 4.39 ROE[avg,max] = [0.241334140, 0.312500000] 2304 25.47 13.26 7.37 5.02 ROE[avg,max] = [0.234688230, 0.281250000] 2560 27.57 14.42 8.03 5.35 ROE[avg,max] = [0.297289787, 0.406250000] 2816 32.14 16.67 9.26 6.24 ROE[avg,max] = [0.241656117, 0.343750000] 3072 33.18 17.27 9.93 6.80 ROE[avg,max] = [0.234802388, 0.289062500] 3328 38.69 20.04 10.95 7.34 ROE[avg,max] = [0.308062178, 0.375000000] 3584 39.13 20.18 11.10 7.49 ROE[avg,max] = [0.287800268, 0.375000000] 3840 44.07 22.50 12.36 8.48 ROE[avg,max] = [0.288700568, 0.355468750] 4096 44.67 23.29 13.33 9.33 ROE[avg,max] = [0.284906635, 0.359375000] 4608 51.83 26.58 14.70 9.78 ROE[avg,max] = [0.294995369, 0.375000000] 5120 56.91 29.36 16.57 11.11 ROE[avg,max] = [0.340822043, 0.437500000] 5632 66.01 34.16 18.99 12.39 ROE[avg,max] = [0.296337954, 0.406250000] 6144 68.74 35.72 20.63 13.88 ROE[avg,max] = [0.303176707, 0.390625000] 6656 79.48 40.54 22.12 15.02 ROE[avg,max] = [0.270511965, 0.375000000] 7168 80.03 40.97 23.16 15.74 ROE[avg,max] = [0.272298848, 0.343750000] 7680 89.57 45.75 25.08 17.17 ROE[avg,max] = [0.287253405, 0.375000000] Off to bed ... Last fiddled with by ewmayer on 2017-05-14 at 04:48 |
|
|
|
|
|
#59 | |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24×199 Posts |
Quote:
|
|
|
|
|
|
|
#60 |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
In relation to the verify runs of the new M-prime candidate, forumites Andreas Höglund [ATH] and Gord Palameta [GP2] both hit errors in building 17.1 for avx-512 - turns out some preprocessor-logic I added in relation to supporting ARMv8 SIMD (see the "ARM builds..." thread) broke an assumption implicit in several of the carry-radix files when built in avx-512 mode. Clearly, I need to do more thorough QA work going forward.
Patched 17.1 version has been successfully built by Andreas and uploaded by me. |
|
|
|
|
|
#61 |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
2×112×47 Posts |
|
|
|
|
|
|
#62 |
|
Oct 2017
++41
53 Posts |
compiling mlucas 17.1 fails on my raspberry pi 3 running raspbian stretch:
Code:
../src/util.c: In function ‘has_asimd’:
../src/util.c:1806:16: error: ‘HWCAP_ASIMD’ undeclared (first use in this function)
if (hwcaps & HWCAP_ASIMD) {
^~~~~~~~~~~
../src/util.c:1806:16: note: each undeclared identifier is reported only once for each function it appears in
|
|
|
|
|
|
#63 |
|
Jan 2008
France
3×199 Posts |
Is raspbian 64-bit? If not then it's possible HWCAP_ASIMD might not be defined.
|
|
|
|
|
|
#64 |
|
Oct 2017
++41
53 Posts |
No it's 32-Bit. I've read that Raspbian sticks to 32-Bit, so no 64-Bit Raspbian in near future.
|
|
|
|
|
|
#65 | |
|
Banned
"Luigi"
Aug 2002
Team Italia
5×7×139 Posts |
Quote:
https://github.com/sakaki-/gentoo-on-rpi3-64bit It made me compile Mlucas and helped another forumite to perform the compilation. |
|
|
|
|
|
|
#66 | |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
Code:
int has_asimd(void)
{
unsigned long hwcaps = getauxval(AT_HWCAP);
#ifndef HWCAP_ASIMD // This is not def'd on pre-ASIMD platforms
const unsigned long HWCAP_ASIMD = 0;
#endif
if (hwcaps & HWCAP_ASIMD) {
return 1;
}
return 0;
}
Last fiddled with by ewmayer on 2017-12-29 at 23:27 |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Mlucas v18 available | ewmayer | Mlucas | 48 | 2019-11-28 02:53 |
| Mlucas on ubuntu | Damian | Mlucas | 17 | 2017-11-13 18:12 |
| Mlucas version 17 | ewmayer | Mlucas | 3 | 2017-06-17 11:18 |
| MLucas on IBM Mainframe | Lorenzo | Mlucas | 52 | 2016-03-13 08:45 |
| mlucas on sun | delta_t | Mlucas | 14 | 2007-10-04 05:45 |