![]() |
|
|
#111 |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
v17.1 is uploaded - see the readme page for details and newly-added ARM SIMD build instructions. Again, no new code for x86 in this release. The ARM inline-asm is can be found inside USE_ARM_V8_SIMD-tagged preprocessor sections in the various .h files. Use a 4-column tab to view, since most of my inline-asm is in a multicolumn (typically 2 columns of independent instructions, e.g. one for real parts of complex FFT computation, one for imaginary, let's meet up occasionally and swap dance partners, that sort of thing) format which needs 4-column tabbing to properly line up.
TomW and other users of higher-end ARM cores (e,g. v9, A57, and such), I'd be appreciative if you could build both a SIMD and a scalar-double binary in separate obj-code directories, and run 4-threaded self-tests via 'Mlucas -sm -cpu 0:3' using each binary, so we can see not only how your absolute runtimes compare to my A53 ones, but also how the SIMD-associated speedup compares to mine of ~1.5x. (Non-SIMD build means just omitting the -DUSE_ARM_V8_SIMD flag from your compile line; both builds still need -DUSE_THREADS.) Also if anyone finds any ARM-specific -march flags to give an appreciable speed boost over the non-march-using build, I'm all ears, Off to bed! Last fiddled with by ewmayer on 2017-11-09 at 08:37 |
|
|
|
|
|
#112 |
|
Sep 2003
2×5×7×37 Posts |
I guess when compiling with gcc and targeting the ODROID-C2 you would use these flags:
Code:
-march=armv8-a -mtune=cortex-a53 -mcpu=cortex-a53 For platforms other than ODROID-C2, the allowable values for -march are: armv8-a, armv8.1-a the allowable values for -mtune and -mcpu are: generic, cortex-a35, cortex-a53, cortex-a57, cortex-a72, exynos-m1, qdf24xx, thunderx, xgene1 Last fiddled with by GP2 on 2017-11-09 at 09:24 |
|
|
|
|
|
#113 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2·7·461 Posts |
On Cortex-A57, compiling with
> gcc -c -O3 -DUSE_ARM_V8_SIMD -DUSE_THREADS ../src/*.c > gcc -o Mlucas *.o -lm -lpthread -lrt > gcc --version gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11) Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. I get a segfault at line 5051 of ../src/dft_macro.c The faulting instruction is a load from x22, which equals zero at this stage; Code:
=> 0x0000000000435974 <+3684>: ldr x13, [x22] Code:
435854: 9e6602d6 fmov x22, d22 ./Mlucas.72 -fftlen 192 -iters 100 -radset 0 -cpu 0:3 (I was initially getting problems with nanosleep failing, but that turned out to be because I'd built with -g -pg in order to debug the problem above) |
|
|
|
|
|
#114 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2·7·461 Posts |
17.1
Code:
1024 msec/iter = 13.05 ROE[avg,max] = [0.244866071, 0.281250000] radices = 32 16 32 32 0 0 0 0 0 0
1152 msec/iter = 16.18 ROE[avg,max] = [0.221044922, 0.250000000] radices = 288 8 16 16 0 0 0 0 0 0
1280 msec/iter = 17.13 ROE[avg,max] = [0.240101842, 0.281250000] radices = 40 32 32 16 0 0 0 0 0 0
1408 msec/iter = 20.89 ROE[avg,max] = [0.227343750, 0.265625000] radices = 176 16 16 16 0 0 0 0 0 0
1536 msec/iter = 21.52 ROE[avg,max] = [0.272656250, 0.312500000] radices = 48 16 32 32 0 0 0 0 0 0
1664 msec/iter = 25.03 ROE[avg,max] = [0.270758929, 0.312500000] radices = 208 16 16 16 0 0 0 0 0 0
1792 msec/iter = 25.81 ROE[avg,max] = [0.230078125, 0.281250000] radices = 56 16 32 32 0 0 0 0 0 0
1920 msec/iter = 29.41 ROE[avg,max] = [0.257756696, 0.312500000] radices = 240 16 16 16 0 0 0 0 0 0
2048 msec/iter = 29.59 ROE[avg,max] = [0.228236607, 0.281250000] radices = 32 32 32 32 0 0 0 0 0 0
2304 msec/iter = 35.03 ROE[avg,max] = [0.272405134, 0.343750000] radices = 144 16 16 32 0 0 0 0 0 0
2560 msec/iter = 38.10 ROE[avg,max] = [0.236383929, 0.312500000] radices = 160 16 16 32 0 0 0 0 0 0
2816 msec/iter = 43.48 ROE[avg,max] = [0.260044643, 0.312500000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 46.82 ROE[avg,max] = [0.224818638, 0.251953125] radices = 48 32 32 32 0 0 0 0 0 0
3328 msec/iter = 51.19 ROE[avg,max] = [0.279017857, 0.343750000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 55.20 ROE[avg,max] = [0.252566964, 0.312500000] radices = 224 16 16 32 0 0 0 0 0 0
3840 msec/iter = 59.54 ROE[avg,max] = [0.249302455, 0.343750000] radices = 240 16 16 32 0 0 0 0 0 0
4096 msec/iter = 63.40 ROE[avg,max] = [0.229129464, 0.281250000] radices = 256 16 16 32 0 0 0 0 0 0
4608 msec/iter = 71.86 ROE[avg,max] = [0.249079241, 0.281250000] radices = 288 16 16 32 0 0 0 0 0 0
5120 msec/iter = 78.94 ROE[avg,max] = [0.237137277, 0.281250000] radices = 160 32 32 16 0 0 0 0 0 0
5632 msec/iter = 90.14 ROE[avg,max] = [0.256919643, 0.312500000] radices = 176 32 32 16 0 0 0 0 0 0
6144 msec/iter = 99.56 ROE[avg,max] = [0.246651786, 0.281250000] radices = 192 32 32 16 0 0 0 0 0 0
6656 msec/iter = 106.15 ROE[avg,max] = [0.262500000, 0.312500000] radices = 208 32 32 16 0 0 0 0 0 0
7168 msec/iter = 113.98 ROE[avg,max] = [0.224874442, 0.281250000] radices = 224 32 32 16 0 0 0 0 0 0
7680 msec/iter = 123.41 ROE[avg,max] = [0.237053571, 0.281250000] radices = 240 32 32 16 0 0 0 0 0 0
|
|
|
|
|
|
#115 |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
10111111111012 Posts |
What is the clockspeed of that A57? It beats mprime on a core 2
|
|
|
|
|
|
#116 | |
|
Sep 2003
2×5×7×37 Posts |
Quote:
gcc -c -O3 -DUSE_ARM_V8_SIMD -DUSE_THREADS -march=armv8-a -mtune=cortex-a57 -mcpu=cortex-a57 ../src/*.c (or armv8.1-a if applicable) Last fiddled with by GP2 on 2017-11-09 at 12:13 |
|
|
|
|
|
|
#117 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
144668 Posts |
Code:
17.1
1024 msec/iter = 6.75 ROE[avg,max] = [0.241730428, 0.343750000] radices = 32 32 32 16 0 0 0 0 0 0
1152 msec/iter = 8.60 ROE[avg,max] = [0.223343084, 0.312500000] radices = 288 8 16 16 0 0 0 0 0 0
1280 msec/iter = 9.45 ROE[avg,max] = [0.264203447, 0.375000000] radices = 160 16 16 16 0 0 0 0 0 0
1408 msec/iter = 10.83 ROE[avg,max] = [0.228562219, 0.312500000] radices = 176 16 16 16 0 0 0 0 0 0
1536 msec/iter = 11.14 ROE[avg,max] = [0.265776015, 0.343750000] radices = 48 32 32 16 0 0 0 0 0 0
1664 msec/iter = 12.98 ROE[avg,max] = [0.272265625, 0.406250000] radices = 208 16 16 16 0 0 0 0 0 0
1792 msec/iter = 13.89 ROE[avg,max] = [0.222711150, 0.312500000] radices = 224 16 16 16 0 0 0 0 0 0
1920 msec/iter = 15.09 ROE[avg,max] = [0.255116130, 0.375000000] radices = 240 16 16 16 0 0 0 0 0 0
2048 msec/iter = 15.53 ROE[avg,max] = [0.225500362, 0.281250000] radices = 64 16 32 32 0 0 0 0 0 0
2304 msec/iter = 18.08 ROE[avg,max] = [0.270892397, 0.375000000] radices = 144 32 16 16 0 0 0 0 0 0
2560 msec/iter = 19.53 ROE[avg,max] = [0.236825121, 0.312500000] radices = 160 16 16 32 0 0 0 0 0 0
2816 msec/iter = 22.30 ROE[avg,max] = [0.260065641, 0.375000000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 24.82 ROE[avg,max] = [0.266442494, 0.375000000] radices = 192 16 16 32 0 0 0 0 0 0
3328 msec/iter = 26.32 ROE[avg,max] = [0.280374114, 0.375000000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 28.36 ROE[avg,max] = [0.251112846, 0.375000000] radices = 224 32 16 16 0 0 0 0 0 0
3840 msec/iter = 30.63 ROE[avg,max] = [0.247425197, 0.343750000] radices = 240 16 16 32 0 0 0 0 0 0
4096 msec/iter = 32.55 ROE[avg,max] = [0.227451789, 0.281250000] radices = 256 16 16 32 0 0 0 0 0 0
4608 msec/iter = 36.98 ROE[avg,max] = [0.248973603, 0.312500000] radices = 288 16 16 32 0 0 0 0 0 0
5120 msec/iter = 41.51 ROE[avg,max] = [0.236476812, 0.312500000] radices = 160 32 32 16 0 0 0 0 0 0
5632 msec/iter = 46.98 ROE[avg,max] = [0.259536082, 0.343750000] radices = 176 32 32 16 0 0 0 0 0 0
6144 msec/iter = 52.13 ROE[avg,max] = [0.245962619, 0.343750000] radices = 192 32 32 16 0 0 0 0 0 0
6656 msec/iter = 55.55 ROE[avg,max] = [0.266108247, 0.375000000] radices = 208 32 32 16 0 0 0 0 0 0
7168 msec/iter = 59.58 ROE[avg,max] = [0.225737707, 0.312500000] radices = 224 32 32 16 0 0 0 0 0 0
7680 msec/iter = 74.81 ROE[avg,max] = [0.245511100, 0.312500000] radices = 960 16 16 16 0 0 0 0 0 0
|
|
|
|
|
|
#118 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2×7×461 Posts |
I think these are 2GHz Cortex-A57 cores, but I'm not sure
|
|
|
|
|
|
#119 |
|
Jan 2008
France
3·199 Posts |
I tested the program on an ARM server with gcc 7.2 and it works. Can't comment on the CPU and the performance, sorry.
The SIMD speedup is no better than what Ernst found on the A53. |
|
|
|
|
|
#120 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2·7·461 Posts |
A57, no SIMD code
Code:
17.1
1024 msec/iter = 73.81 ROE[avg,max] = [0.227287946, 0.250000000] radices = 16 8 16 16 16 0 0 0 0 0
1152 msec/iter = 84.95 ROE[avg,max] = [0.208060128, 0.250000000] radices = 18 8 16 16 16 0 0 0 0 0
1280 msec/iter = 94.46 ROE[avg,max] = [0.312946429, 0.375000000] radices = 20 8 16 16 16 0 0 0 0 0
1408 msec/iter = 110.92 ROE[avg,max] = [0.218861607, 0.250000000] radices = 22 8 16 16 16 0 0 0 0 0
1536 msec/iter = 120.33 ROE[avg,max] = [0.211012486, 0.250000000] radices = 12 16 16 16 16 0 0 0 0 0
1664 msec/iter = 131.91 ROE[avg,max] = [0.245089286, 0.312500000] radices = 26 8 16 16 16 0 0 0 0 0
1792 msec/iter = 146.44 ROE[avg,max] = [0.218750000, 0.281250000] radices = 14 16 16 16 16 0 0 0 0 0
1920 msec/iter = 171.90 ROE[avg,max] = [0.258705357, 0.312500000] radices = 240 16 16 16 0 0 0 0 0 0
2048 msec/iter = 164.37 ROE[avg,max] = [0.258705357, 0.281250000] radices = 16 16 16 16 16 0 0 0 0 0
2304 msec/iter = 189.44 ROE[avg,max] = [0.211614118, 0.265625000] radices = 18 16 16 16 16 0 0 0 0 0
2560 msec/iter = 212.29 ROE[avg,max] = [0.228222656, 0.312500000] radices = 20 16 16 16 16 0 0 0 0 0
2816 msec/iter = 251.39 ROE[avg,max] = [0.213448661, 0.250000000] radices = 22 16 16 16 16 0 0 0 0 0
3072 msec/iter = 256.01 ROE[avg,max] = [0.226297433, 0.281250000] radices = 24 16 16 16 16 0 0 0 0 0
3328 msec/iter = 297.24 ROE[avg,max] = [0.249005999, 0.312500000] radices = 26 16 16 16 16 0 0 0 0 0
3584 msec/iter = 317.12 ROE[avg,max] = [0.216671317, 0.250000000] radices = 28 16 16 16 16 0 0 0 0 0
3840 msec/iter = 354.41 ROE[avg,max] = [0.220647321, 0.250000000] radices = 960 8 16 16 0 0 0 0 0 0
4096 msec/iter = 363.13 ROE[avg,max] = [0.215354701, 0.250000000] radices = 1024 8 16 16 0 0 0 0 0 0
4608 msec/iter = 425.03 ROE[avg,max] = [0.233621652, 0.312500000] radices = 144 8 8 16 16 0 0 0 0 0
5120 msec/iter = 475.89 ROE[avg,max] = [0.325892857, 0.375000000] radices = 20 16 16 16 32 0 0 0 0 0
5632 msec/iter = 525.06 ROE[avg,max] = [0.216992187, 0.265625000] radices = 176 8 8 16 16 0 0 0 0 0
6144 msec/iter = 555.82 ROE[avg,max] = [0.213281250, 0.250000000] radices = 24 16 16 16 32 0 0 0 0 0
6656 msec/iter = 623.24 ROE[avg,max] = [0.319866071, 0.375000000] radices = 208 8 8 16 16 0 0 0 0 0
7168 msec/iter = 650.94 ROE[avg,max] = [0.330357143, 0.375000000] radices = 224 8 8 16 16 0 0 0 0 0
7680 msec/iter = 765.44 ROE[avg,max] = [0.227678571, 0.281250000] radices = 60 16 16 16 16 0 0 0 0 0
Code:
1024 msec/iter = 51.50 ROE[avg,max] = [0.241406250, 0.312500000] radices = 32 32 32 16 0 0 0 0 0 0
1152 msec/iter = 62.95 ROE[avg,max] = [0.221268136, 0.250000000] radices = 288 8 16 16 0 0 0 0 0 0
1280 msec/iter = 67.31 ROE[avg,max] = [0.240101842, 0.281250000] radices = 40 32 32 16 0 0 0 0 0 0
1408 msec/iter = 80.51 ROE[avg,max] = [0.244977679, 0.312500000] radices = 44 32 32 16 0 0 0 0 0 0
1536 msec/iter = 82.41 ROE[avg,max] = [0.227845982, 0.281250000] radices = 24 32 32 32 0 0 0 0 0 0
1664 msec/iter = 96.59 ROE[avg,max] = [0.222488839, 0.250000000] radices = 52 32 32 16 0 0 0 0 0 0
1792 msec/iter = 98.71 ROE[avg,max] = [0.225558036, 0.250000000] radices = 56 32 32 16 0 0 0 0 0 0
1920 msec/iter = 111.71 ROE[avg,max] = [0.234137835, 0.265625000] radices = 60 32 32 16 0 0 0 0 0 0
2048 msec/iter = 109.88 ROE[avg,max] = [0.228236607, 0.281250000] radices = 32 32 32 32 0 0 0 0 0 0
2304 msec/iter = 133.25 ROE[avg,max] = [0.250456892, 0.312500000] radices = 36 32 32 32 0 0 0 0 0 0
2560 msec/iter = 141.38 ROE[avg,max] = [0.245999581, 0.312500000] radices = 40 32 32 32 0 0 0 0 0 0
2816 msec/iter = 166.02 ROE[avg,max] = [0.263392857, 0.312500000] radices = 176 32 16 16 0 0 0 0 0 0
3072 msec/iter = 174.12 ROE[avg,max] = [0.224818638, 0.251953125] radices = 48 32 32 32 0 0 0 0 0 0
3328 msec/iter = 195.51 ROE[avg,max] = [0.280803571, 0.375000000] radices = 208 32 16 16 0 0 0 0 0 0
3584 msec/iter = 205.38 ROE[avg,max] = [0.223172433, 0.250000000] radices = 56 32 32 32 0 0 0 0 0 0
3840 msec/iter = 226.93 ROE[avg,max] = [0.249302455, 0.343750000] radices = 240 16 16 32 0 0 0 0 0 0
4096 msec/iter = 245.07 ROE[avg,max] = [0.228655134, 0.281250000] radices = 256 32 16 16 0 0 0 0 0 0
4608 msec/iter = 290.88 ROE[avg,max] = [0.249079241, 0.281250000] radices = 288 16 16 32 0 0 0 0 0 0
5120 msec/iter = 305.20 ROE[avg,max] = [0.237137277, 0.281250000] radices = 160 32 32 16 0 0 0 0 0 0
5632 msec/iter = 379.67 ROE[avg,max] = [0.232352121, 0.281250000] radices = 176 8 8 16 16 0 0 0 0 0
6144 msec/iter = 399.26 ROE[avg,max] = [0.234695871, 0.281250000] radices = 48 16 16 16 16 0 0 0 0 0
6656 msec/iter = 402.55 ROE[avg,max] = [0.262500000, 0.312500000] radices = 208 32 32 16 0 0 0 0 0 0
7168 msec/iter = 430.83 ROE[avg,max] = [0.225097656, 0.281250000] radices = 224 32 32 16 0 0 0 0 0 0
7680 msec/iter = 466.83 ROE[avg,max] = [0.237053571, 0.281250000] radices = 240 32 32 16 0 0 0 0 0 0
Last fiddled with by fivemack on 2017-11-09 at 15:57 |
|
|
|
|
|
#121 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2×7×461 Posts |
Compare:
(e5/2690v3, AVX2) Code:
1c 1024 msec/iter = 10.73 ROE[avg,max] = [0.231110491, 0.281250000] radices = 32 16 32 32 0 0 0 0 0 0 4c 1024 msec/iter = 4.33 ROE[avg,max] = [0.234014452, 0.312500000] radices = 32 16 32 32 0 0 0 0 0 0 8c 1024 msec/iter = 2.44 ROE[avg,max] = [0.234014452, 0.312500000] radices = 32 16 32 32 0 0 0 0 0 0 16c 1024 msec/iter = 1.28 ROE[avg,max] = [0.234014452, 0.312500000] radices = 32 16 32 32 0 0 0 0 0 0 Last fiddled with by fivemack on 2017-11-09 at 16:02 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Economic prospects for solar photovoltaic power | cheesehead | Science & Technology | 137 | 2018-06-26 15:46 |
| Which SIMD flag to use for Raspberry Pi | BrainStone | Mlucas | 14 | 2017-11-19 00:59 |
| compiler/assembler optimizations possible? | ixfd64 | Software | 7 | 2011-02-25 20:05 |
| Running 32-bit builds on a Win7 system | ewmayer | Programming | 34 | 2010-10-18 22:36 |
| SIMD string->int | fivemack | Software | 7 | 2009-03-23 18:15 |