![]() |
|
|
#23 |
|
Aug 2010
Republic of Belarus
2×89 Posts |
Hello! Benchmark for v18 on Ampere eMAG 32-Core @ 3.3GHz using pre-built Mlucas_v18_c2simd.
Code:
root@lorenzoArm:~/mersenne/arm8# lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 1 NUMA node(s): 1 CPU max MHz: 3300.0000 CPU min MHz: 363.9700 L1d cache: 32K L1i cache: 32K L2 cache: 256K NUMA node0 CPU(s): 0-31 Code:
root@lorenzoArm:~/mersenne/arm8# cat /proc/cpuinfo processor : 0 BogoMIPS : 90.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid CPU implementer : 0x50 CPU architecture: 8 CPU variant : 0x3 CPU part : 0x000 CPU revision : 2 Code:
root@lorenzoArm:~/mersenne/arm8# cat mlucas.cfg
18.0
2048 msec/iter = 19.63 ROE[avg,max] = [0.000307249, 0.375000000] radices = 128 32 16 16 0 0 0 0 0 0
2304 msec/iter = 19.88 ROE[avg,max] = [0.000272423, 0.375000000] radices = 144 32 16 16 0 0 0 0 0 0
2560 msec/iter = 22.07 ROE[avg,max] = [0.000281943, 0.375000000] radices = 160 8 8 8 16 0 0 0 0 0
2816 msec/iter = 22.07 ROE[avg,max] = [0.000260572, 0.312500000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 22.24 ROE[avg,max] = [0.000265834, 0.375000000] radices = 192 16 16 32 0 0 0 0 0 0
3328 msec/iter = 23.63 ROE[avg,max] = [0.000281118, 0.375000000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 25.02 ROE[avg,max] = [0.000250660, 0.343750000] radices = 224 32 16 16 0 0 0 0 0 0
3840 msec/iter = 26.60 ROE[avg,max] = [0.000222911, 0.312500000] radices = 60 32 32 32 0 0 0 0 0 0
4096 msec/iter = 25.42 ROE[avg,max] = [0.000244299, 0.312500000] radices = 64 32 32 32 0 0 0 0 0 0
4608 msec/iter = 30.73 ROE[avg,max] = [0.000298148, 0.375000000] radices = 144 8 8 16 16 0 0 0 0 0
5120 msec/iter = 31.50 ROE[avg,max] = [0.000235369, 0.312500000] radices = 160 32 32 16 0 0 0 0 0 0
5632 msec/iter = 33.74 ROE[avg,max] = [0.000257523, 0.343750000] radices = 176 32 32 16 0 0 0 0 0 0
6144 msec/iter = 36.94 ROE[avg,max] = [0.000247058, 0.312500000] radices = 192 32 32 16 0 0 0 0 0 0
6656 msec/iter = 36.74 ROE[avg,max] = [0.000313628, 0.406250000] radices = 208 8 8 16 16 0 0 0 0 0
7168 msec/iter = 36.94 ROE[avg,max] = [0.000233152, 0.312500000] radices = 224 8 8 16 16 0 0 0 0 0
7680 msec/iter = 36.94 ROE[avg,max] = [0.000246354, 0.312500000] radices = 240 8 8 16 16 0 0 0 0 0
|
|
|
|
|
|
#24 |
|
Aug 2010
Republic of Belarus
2·89 Posts |
Just FYI
Code:
root@lorenzoArm:~/mersenne/arm8# ./Mlucas_v18_c2simd -fftlen 18432 -iters 100 -cpu 0:7
Mlucas 18.0
http://www.mersenneforum.org/mayer/README.html
INFO: testing qfloat routines...
CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 5.4.0 20160609.
INFO: Build uses ARMv8 advanced-SIMD instruction set.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: MLUCAS_PATH is set to ""
INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
Setting DAT_BITS = 10, PAD_BITS = 2
INFO: testing IMUL routines...
INFO: System has 32 available processor cores.
INFO: testing FFT radix tables...
Set affinity for the following 8 cores: 0.1.2.3.4.5.6.7.
Mlucas selftest running.....
/****************************************************************************/
NTHREADS = 8
M337615261: using FFT length 18432K = 18874368 8-byte floats, initial residue shift count = 49407158
this gives an average 17.887500180138481 bits per digit
Using complex FFT radices 288 32 32 32
mers_mod_square: Init threadpool of 8 threads
radix16_dif_dit_pass pfetch_dist = 32
radix16_wrapper_square: pfetch_dist = 1024
Using 8 threads in carry step
100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 321038982
Res64: 69FF742497F16902. AvgMaxErr = 0.003191964. MaxErr = 0.375000000. Program: E18.0
Res mod 2^36 = 19729049858
Res mod 2^35 - 1 = 20161851329
Res mod 2^36 - 1 = 1044285462
Clocks = 00:00:21.067
NTHREADS = 8
M337615261: using FFT length 18432K = 18874368 8-byte floats, initial residue shift count = 321038982
this gives an average 17.887500180138481 bits per digit
Using complex FFT radices 144 16 16 16 16
mers_mod_square: Init threadpool of 8 threads
Using 8 threads in carry step
100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 171176556
Res64: 2258A7342961B652. AvgMaxErr = 0.002428013. MaxErr = 0.281250000. Program: E18.0
Res mod 2^36 = 17874138706
Res mod 2^35 - 1 = 28069471175
Res mod 2^36 - 1 = 53816329185
Clocks = 00:00:21.009
NTHREADS = 8
ERROR: at line 1540 of file ../src/Mlucas.c
Assertion failed: Return value of shift_word(): unpadded-array-index out of range!
|
|
|
|
|
|
#25 |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Thanks, Lorenzo: Could you also try the '-s m' tests using just -cpu 0:3 on that 32-core system and post the resulting cfg-file here? I'd like to see what kind of degradation of parallelism results from using more than one socket on that system.
I will look into the residue-shift assertion issue you hit in your 18432K FFT length test - first I need to see if I can reproduce it on any of the hardware I have. By way of a workaround, '-shift 0' should bypass all the new residue-shift code and allow you get a best-radix-set timing at 18432K. Edit: I was able to reproduce the assertion on my 2-core Macbook using an x86 SSE2 build, so the bug appears to be code-logic-related rather than anything platform specific. Last fiddled with by ewmayer on 2019-03-29 at 19:29 |
|
|
|
|
|
#26 | |
|
Aug 2010
Republic of Belarus
101100102 Posts |
Quote:
|
|
|
|
|
|
|
#27 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
32·11·79 Posts |
Quote:
Last fiddled with by kriesel on 2019-03-30 at 19:22 |
|
|
|
|
|
|
#28 | |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
Quote:
1. All 4 cores on 1 socket: '-s m -iters 100 -cpu 0:3' 2. If there are differences between the CPUs on various sockets (use /proc/cpuinfo as your guide here), run the same self-tests on each distinct-CPU-type socket. If it's e.g. a BIG socket with a high-perf CPU having just 2 cores, fiddle the -cpu args to use just those 2 cores; 3. All cores across all sockets: Like the 32-core test you did above; 4. One program instance per socket: This can get tricky in self-test mode if the runspeed varies appreciably between sockets. Better is to create a rundir for each socket, e.g. run0-run7 on an 8-socket 32-core system, copy the mlucas.cfg files from your -cpu 0:3 self-test to each rundir, create a worktodo.ini file containing one exponent of the size range of interest (you can use a single-shot invocation of the primenet.py script to grab such an assignment), then copy that to each rundir. Then cd to run0 and fire up a production run using -cpu 0:3, let that get to the first 10000-iter checkpoint (you will see a pair of p|q-named binary savefiles get created, and the p*.stat file updated with a checkpoint entry), that gives you a production-run timing for 1 socket. At that point cd to each of the other rundirs in turn and start up an instance in each. Let those runs get through a couple checkpoints and average the last-line-of-statfile timings, compare that average to the 1-socket-used timing. I have found and fixed the bug your 18432K self-test exposed, will post update on that once I finish creating new ARM binaries from the updated source tarball and uploading to the server. |
|
|
|
|
|
|
#29 | |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
But manycore tests are always interesting because we hope to see signs of one manufacturer or another achieving a breakthrough in parallelism. Though in that regard, nearly all the action the past 5 years has been on the GPU side of the ledger. |
|
|
|
|
|
|
#30 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
782110 Posts |
Quote:
Another way to go at it would be to make 1-core, 4-core, and 8-core benchmark runs and compare to the 32. Last fiddled with by kriesel on 2019-03-31 at 00:24 |
|
|
|
|
|
|
#31 | |
|
Jan 2008
France
3·199 Posts |
Quote:
For the record, the CPU from Ampere is not that great from a performance point of view, in particular its FP performance is less than Amazon Cortex-A72 chip despite running at 3.3 GHz vs 2.3 GHz: http://browser.geekbench.com/v4/cpu/...eline=11678329 It's not even that much faster than an S7: http://browser.geekbench.com/v4/cpu/...eline=12621230 BTW Ernst, I'm afraid I don't get why you're talking about multiple sockets. That system has a single socket. |
|
|
|
|
|
|
#32 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
32·11·79 Posts |
Quote:
I have no reason to expect that figure to be optimal among cpu choices. It's just one of the better among my little fleet. (Then there's curtisc's and others' $0/exponent, when the participant is using someone else's hardware and electricity.) |
|
|
|
|
|
|
#33 |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Mlucas version 17.1 | ewmayer | Mlucas | 96 | 2019-10-16 12:55 |
| Mlucas on ubuntu | Damian | Mlucas | 17 | 2017-11-13 18:12 |
| Mlucas version 17 | ewmayer | Mlucas | 3 | 2017-06-17 11:18 |
| MLucas on IBM Mainframe | Lorenzo | Mlucas | 52 | 2016-03-13 08:45 |
| mlucas on sun | delta_t | Mlucas | 14 | 2007-10-04 05:45 |