mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2018-03-03, 13:34   #155
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

32·131 Posts
Default

Odroid-U2 (4x Cortex-A9, armv7)

17.1 compiled with GCC 6.3.0
4 thread (scalar)
Code:
17.1
      1024  msec/iter =   90.68  ROE[avg,max] = [0.247321429, 0.312500000]  radices = 128 16 16 16  0  0  0  0  0  0
      1152  msec/iter =  102.31  ROE[avg,max] = [0.225341797, 0.265625000]  radices = 144 16 16 16  0  0  0  0  0  0
      1280  msec/iter =  119.48  ROE[avg,max] = [0.245872280, 0.312500000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =  128.28  ROE[avg,max] = [0.228543527, 0.281250000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =  157.27  ROE[avg,max] = [0.238232422, 0.281250000]  radices = 192 16 16 16  0  0  0  0  0  0
      1664  msec/iter =  151.83  ROE[avg,max] = [0.252692522, 0.312500000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =  164.11  ROE[avg,max] = [0.231361607, 0.281250000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =  182.04  ROE[avg,max] = [0.234856306, 0.281250000]  radices = 240 16 16 16  0  0  0  0  0  0
      2048  msec/iter =  188.62  ROE[avg,max] = [0.250000000, 0.281250000]  radices = 256 16 16 16  0  0  0  0  0  0
      2304  msec/iter =  214.94  ROE[avg,max] = [0.227878244, 0.250000000]  radices = 144  8  8  8 16  0  0  0  0  0
      2560  msec/iter =  252.43  ROE[avg,max] = [0.276897321, 0.343750000]  radices = 160  8  8  8 16  0  0  0  0  0
      2816  msec/iter =  267.55  ROE[avg,max] = [0.242633929, 0.281250000]  radices = 176  8  8  8 16  0  0  0  0  0
      3072  msec/iter =  315.78  ROE[avg,max] = [0.248883929, 0.312500000]  radices = 192  8  8  8 16  0  0  0  0  0
      3328  msec/iter =  319.69  ROE[avg,max] = [0.278794643, 0.343750000]  radices = 208  8  8  8 16  0  0  0  0  0
      3584  msec/iter =  343.25  ROE[avg,max] = [0.249330357, 0.281250000]  radices = 224  8  8  8 16  0  0  0  0  0
      3840  msec/iter =  400.92  ROE[avg,max] = [0.241594587, 0.281250000]  radices = 240  8  8  8 16  0  0  0  0  0
      4096  msec/iter =  414.05  ROE[avg,max] = [0.256026786, 0.312500000]  radices = 256  8  8  8 16  0  0  0  0  0
      4608  msec/iter =  493.21  ROE[avg,max] = [0.231989397, 0.281250000]  radices = 288  8  8  8 16  0  0  0  0  0
      5120  msec/iter =  583.92  ROE[avg,max] = [0.262723214, 0.312500000]  radices = 160  8  8 16 16  0  0  0  0  0
      5632  msec/iter =  628.49  ROE[avg,max] = [0.237304688, 0.281250000]  radices = 176  8  8 16 16  0  0  0  0  0
      6144  msec/iter =  731.52  ROE[avg,max] = [0.242550223, 0.281250000]  radices = 192  8  8 16 16  0  0  0  0  0
      6656  msec/iter =  739.47  ROE[avg,max] = [0.270982143, 0.312500000]  radices = 208  8  8 16 16  0  0  0  0  0
      7168  msec/iter =  799.27  ROE[avg,max] = [0.240229143, 0.281250000]  radices = 224  8  8 16 16  0  0  0  0  0
      7680  msec/iter =  894.03  ROE[avg,max] = [0.247879464, 0.312500000]  radices = 240  8  8 16 16  0  0  0  0  0
about 10% faster than 14.1 with GCC 4.8 (http://mersenneforum.org/showpost.ph...2&postcount=44)
VictordeHolland is offline   Reply With Quote
Old 2018-03-04, 00:06   #156
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
Odroid-U2 (4x Cortex-A9, armv7)

17.1 compiled with GCC 6.3.0
4 thread (scalar)
[snip]
[/code]about 10% faster than 14.1 with GCC 4.8 (http://mersenneforum.org/showpost.ph...2&postcount=44)
That performance difference is likely mostly or entirely due to various improvements in the code, rather than gcc - notably during that interval I implemented faster carry macros based on short-length recurrences to compute DWT weights, 5-10% across-the-board speedup from that.

Oh, hey, could you also try the scalar-double (non-simd) of the pair of precompiled-on-odroid-c2 binaries I posted to the readme page a couple days ago and LMK if any issues (or notable timing differences vs your above ones) with that? Thanks.
ewmayer is offline   Reply With Quote
Old 2018-03-13, 13:42   #157
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

95010 Posts
Default ROC-RK3328-CC

ROC-RK3328-CC
Image: ROC-RK3328-CC_Ubuntu16.04_Arch64_20180309
GCC: 7.2.0
asimd

They got their act together. It's now comparable to a pi3b. It should be noticeably better due to higher clocks, maybe the image is still not fully tailored to the hardware (lscpu does report 1392 for CPU max Mhz, don't know if it stays there under load). Doesn't look like the better on paper RAM has done anything for mlucas:
Code:
17.1
      1024  msec/iter =   62.34  ROE[avg,max] = [0.231563895, 0.281250000]  radices =  64 32 16 16  0  0  0  0  0  0
      1152  msec/iter =   67.97  ROE[avg,max] = [0.221044922, 0.250000000]  radices = 288  8 16 16  0  0  0  0  0  0
      1280  msec/iter =   77.26  ROE[avg,max] = [0.264508929, 0.343750000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =   90.94  ROE[avg,max] = [0.227343750, 0.265625000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =  102.08  ROE[avg,max] = [0.267187500, 0.343750000]  radices =  48 32 32 16  0  0  0  0  0  0
      1664  msec/iter =  110.82  ROE[avg,max] = [0.270758929, 0.312500000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =  120.42  ROE[avg,max] = [0.220532663, 0.250000000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =  131.74  ROE[avg,max] = [0.257756696, 0.312500000]  radices = 240 16 16 16  0  0  0  0  0  0
      2048  msec/iter =  140.40  ROE[avg,max] = [0.223493304, 0.250000000]  radices =  64 32 32 16  0  0  0  0  0  0
      2304  msec/iter =  163.35  ROE[avg,max] = [0.248751395, 0.312500000]  radices = 288 16 16 16  0  0  0  0  0  0
      2560  msec/iter =  179.19  ROE[avg,max] = [0.236908831, 0.312500000]  radices = 160 32 16 16  0  0  0  0  0  0
      2816  msec/iter =  208.29  ROE[avg,max] = [0.263392857, 0.312500000]  radices = 176 32 16 16  0  0  0  0  0  0
      3072  msec/iter =  231.04  ROE[avg,max] = [0.224818638, 0.251953125]  radices =  48 32 32 32  0  0  0  0  0  0
      3328  msec/iter =  251.91  ROE[avg,max] = [0.281250000, 0.375000000]  radices = 208 32 16 16  0  0  0  0  0  0
      3584  msec/iter =  272.49  ROE[avg,max] = [0.252343750, 0.312500000]  radices = 224 32 16 16  0  0  0  0  0  0
      3840  msec/iter =  295.94  ROE[avg,max] = [0.248437500, 0.343750000]  radices = 240 32 16 16  0  0  0  0  0  0
      4096  msec/iter =  307.20  ROE[avg,max] = [0.295089286, 0.343750000]  radices = 128 32 32 16  0  0  0  0  0  0
      4608  msec/iter =  356.76  ROE[avg,max] = [0.258928571, 0.312500000]  radices = 144 32 32 16  0  0  0  0  0  0
      5120  msec/iter =  390.13  ROE[avg,max] = [0.237137277, 0.281250000]  radices = 160 32 32 16  0  0  0  0  0  0
      5632  msec/iter =  455.21  ROE[avg,max] = [0.256919643, 0.312500000]  radices = 176 32 32 16  0  0  0  0  0  0
      6144  msec/iter =  512.26  ROE[avg,max] = [0.246651786, 0.281250000]  radices = 192 32 32 16  0  0  0  0  0  0
      6656  msec/iter =  550.61  ROE[avg,max] = [0.262500000, 0.312500000]  radices = 208 32 32 16  0  0  0  0  0  0
      7168  msec/iter =  606.13  ROE[avg,max] = [0.224874442, 0.281250000]  radices = 224 32 32 16  0  0  0  0  0  0
      7680  msec/iter =  664.46  ROE[avg,max] = [0.237053571, 0.281250000]  radices = 240 32 32 16  0  0  0  0  0  0
M344587487 is offline   Reply With Quote
Old 2018-03-14, 01:52   #158
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

Quote:
Originally Posted by M344587487 View Post
ROC-RK3328-CC
Image: ROC-RK3328-CC_Ubuntu16.04_Arch64_20180309
GCC: 7.2.0
asimd

They got their act together. It's now comparable to a pi3b. It should be noticeably better due to higher clocks, maybe the image is still not fully tailored to the hardware (lscpu does report 1392 for CPU max Mhz, don't know if it stays there under load). Doesn't look like the better on paper RAM has done anything for mlucas:
Glad you got at least a nice chunk of the "missing FLOPS" back. I'm looking forward to buying an Odroid N1 once they go on sale, hopefully within a month - one of the beta-testers ran Mlucas benchmarks, and using all 6 cores (one 2-threaded job running on the 'big' 2-core a72 cpu, another 4-theaded one on the 'little' 4-core a53 cpu) he gets 2.2-2.3 the total throughput of an Odroid C2, which means ~3.5x the total throughput of a Pi3. N1 pricing is estimated at ~$110, i.e. about the same $/FLOP as the C2. We can ony hope this sparks a full-blown 'multi-socket war' amongst the various ARM-micro-PC manufacturers. :) Even for the N1 one still needs ~10 of them to match the LL-test throughput of a cutting-edge Intel quad, but things are getting close to the "interestingness" level as far as wide-scale adoption goes.

Last fiddled with by ewmayer on 2018-03-14 at 03:01
ewmayer is offline   Reply With Quote
Old 2018-03-19, 21:22   #159
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

2×89 Posts
Default

Hello! I would like to share bechmarks for the ARMv8 MONSTER with 96 cores (2x48)!
CPU: Cavium ThunderX SoC (96 Physical Cores @ 2.0 GHz (2 × Cavium ThunderX)).
RAM: 128 GB of DDR4 ECC RAM
OS: CentOS 7
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
lscpu:
Code:
Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    1
Core(s) per socket:    48
Socket(s):             2
NUMA node(s):          2
NUMA node0 CPU(s):     0-47
NUMA node1 CPU(s):     48-95

processor       : 0
BogoMIPS        : 200.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0x0a1
CPU revision    : 1
Mlucas compiled with ASIMD without success:
Code:
    Mlucas 17.1

    http://hogranch.com/mayer/README.html

ERROR: at line 1831 of file ../src/util.c
Assertion failed: #define USE_ARM_V8_SIMD invoked but no advanced-SIMD support detected on this CPU!

INFO: testing qfloat routines...
CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 4.8.5 20150623 (Red Hat 4.8.5-16).
With full cores it has some problem(s):
./Mlucas -s m -iters 100 -nthread 96 >& selftest.log
Code:
    Mlucas 17.1

    http://hogranch.com/mayer/README.html

INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation. 
INFO: testing FFT radix tables...

           Mlucas selftest running.....

/****************************************************************************/

INFO: Unable to find/open mlucas.cfg file in r+ mode ... creating from scratch.
NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices      1024        16        32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
WARN: At line 425 of file ../src/radix1024_ditN_cy_dif1.c:
n_div_nwt%CY_THREADS != 0

 Return with code ERR_ASSERT
Error detected - this radix set will not be used.
NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices      1024        32        16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
WARN: At line 425 of file ../src/radix1024_ditN_cy_dif1.c:
n_div_nwt%CY_THREADS != 0

 Return with code ERR_ASSERT
Error detected - this radix set will not be used.
NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices       256         8        16        16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
WARN: At line 590 of file ../src/radix256_ditN_cy_dif1.c:
n_div_nwt%CY_THREADS != 0

 Return with code ERR_ASSERT
Error detected - this radix set will not be used.
NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices       128        16        16        16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
WARN: At line 451 of file ../src/radix128_ditN_cy_dif1.c:
n_div_nwt%CY_THREADS != 0

 Return with code ERR_ASSERT
Error detected - this radix set will not be used.
NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        64        16        16        32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
Using 64 threads in carry step
100 iterations of M20000047 with FFT length 1048576 = 1024 K
Res64: DD61B3E031F1E0BA. AvgMaxErr = 0.211633301. MaxErr = 0.250000000. Program: E17.1
Res mod 2^36     =            837935290
Res mod 2^35 - 1 =           6238131189
Res mod 2^36 - 1 =          41735145962
Clocks = 00:00:01.126

NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        64        32        16        16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
100 iterations of M20000047 with FFT length 1048576 = 1024 K
Res64: DD61B3E031F1E0BA. AvgMaxErr = 0.209770857. MaxErr = 0.250000000. Program: E17.1
Res mod 2^36     =            837935290
Res mod 2^35 - 1 =           6238131189
Res mod 2^36 - 1 =          41735145962
Clocks = 00:00:01.176

NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        64         8         8         8        16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
100 iterations of M20000047 with FFT length 1048576 = 1024 K
Res64: DD61B3E031F1E0BA. AvgMaxErr = 0.224783761. MaxErr = 0.281250000. Program: E17.1
Res mod 2^36     =            837935290
Res mod 2^35 - 1 =           6238131189
Res mod 2^36 - 1 =          41735145962
Clocks = 00:00:01.085

NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        32        16        32        32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
Using 64 threads in carry step
100 iterations of M20000047 with FFT length 1048576 = 1024 K
Res64: DD61B3E031F1E0BA. AvgMaxErr = 0.275223214. MaxErr = 0.375000000. Program: E17.1
Res mod 2^36     =            837935290
Res mod 2^35 - 1 =           6238131189
Res mod 2^36 - 1 =          41735145962
Clocks = 00:00:01.383

NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        32        32        32        16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
100 iterations of M20000047 with FFT length 1048576 = 1024 K
Res64: DD61B3E031F1E0BA. AvgMaxErr = 0.268080357. MaxErr = 0.312500000. Program: E17.1
Res mod 2^36     =            837935290
Res mod 2^35 - 1 =           6238131189
Res mod 2^36 - 1 =          41735145962
Clocks = 00:00:01.577

NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        32         8         8        16        16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
100 iterations of M20000047 with FFT length 1048576 = 1024 K
Res64: DD61B3E031F1E0BA. AvgMaxErr = 0.296428571. MaxErr = 0.375000000. Program: E17.1
Res mod 2^36     =            837935290
Res mod 2^35 - 1 =           6238131189
Res mod 2^36 - 1 =          41735145962
Clocks = 00:00:01.341

NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        16        32        32        32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
Using 64 threads in carry step
100 iterations of M20000047 with FFT length 1048576 = 1024 K
Res64: DD61B3E031F1E0BA. AvgMaxErr = 0.221902902. MaxErr = 0.281250000. Program: E17.1
Res mod 2^36     =            837935290
Res mod 2^35 - 1 =           6238131189
Res mod 2^36 - 1 =          41735145962
Clocks = 00:00:02.362

NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        16         8        16        16        16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
100 iterations of M20000047 with FFT length 1048576 = 1024 K
Res64: DD61B3E031F1E0BA. AvgMaxErr = 0.230357143. MaxErr = 0.312500000. Program: E17.1
Res mod 2^36     =            837935290
Res mod 2^35 - 1 =           6238131189
Res mod 2^36 - 1 =          41735145962
Clocks = 00:00:02.175

NTHREADS = 96
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices         8        16        16        16        16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
ERROR: at line 120 of file ../src/radix8_ditN_cy_dif1.c
Assertion failed: radix8_ditN_cy_dif1.c: CY_THREADS not a power of 2!
INFO: testing qfloat routines...
CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 4.8.5 20150623 (Red Hat 4.8.5-16).
INFO: Using inline-macro form of MUL_LOHI64.
INFO: MLUCAS_PATH is set to ""
Setting DAT_BITS = 10, PAD_BITS = 2
INFO: testing IMUL routines...
INFO: System has 96 available processor cores.
Set affinity for the following 96 cores: 0.1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.60.61.62.63.64.65.66.67.68.69.70.71.72.73.74.75.76.77.78.79.80.81.82.83.84.85.86.87.88.89.90.91.92.93.94.95.
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads
mers_mod_square: Init threadpool of 96 threads

[root@lorenzo2 src]#
Also tried with 92,90 cores with the same error (?).

./Mlucas -s m -iters 100 -nthread 60 >& selftest.log
Code:
17.1
      1024  msec/iter =   10.36  ROE[avg,max] = [0.224783761, 0.281250000]  radices =  64  8  8  8 16  0  0  0  0  0
      1152  msec/iter =   13.41  ROE[avg,max] = [0.209650530, 0.250000000]  radices = 144 16 16 16  0  0  0  0  0  0
      1280  msec/iter =   15.83  ROE[avg,max] = [0.223046875, 0.250000000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =   16.36  ROE[avg,max] = [0.227852958, 0.250000000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =   18.24  ROE[avg,max] = [0.234375000, 0.312500000]  radices = 192 16 16 16  0  0  0  0  0  0
      1664  msec/iter =   18.33  ROE[avg,max] = [0.229310826, 0.281250000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =   19.33  ROE[avg,max] = [0.221177455, 0.281250000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =   22.15  ROE[avg,max] = [0.226757812, 0.250000000]  radices =  60 16 32 32  0  0  0  0  0  0
      2048  msec/iter =   22.12  ROE[avg,max] = [0.215150670, 0.250000000]  radices = 128 16 16 32  0  0  0  0  0  0
      2304  msec/iter =   24.11  ROE[avg,max] = [0.223395647, 0.250000000]  radices = 144 16 16 32  0  0  0  0  0  0
./Mlucas -s m -iters 100 -nthread 48 >& selftest.log
Code:
17.1
      1024  msec/iter =    8.75  ROE[avg,max] = [0.211633301, 0.250000000]  radices =  64 16 16 32  0  0  0  0  0  0
      1152  msec/iter =   11.57  ROE[avg,max] = [0.209650530, 0.250000000]  radices = 144 16 16 16  0  0  0  0  0  0
      1280  msec/iter =   13.96  ROE[avg,max] = [0.223046875, 0.250000000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =   14.65  ROE[avg,max] = [0.227852958, 0.250000000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =   15.97  ROE[avg,max] = [0.234375000, 0.312500000]  radices = 192 16 16 16  0  0  0  0  0  0
      1664  msec/iter =   17.98  ROE[avg,max] = [0.229310826, 0.281250000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =   19.09  ROE[avg,max] = [0.221177455, 0.281250000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =   21.41  ROE[avg,max] = [0.226757812, 0.250000000]  radices =  60 16 32 32  0  0  0  0  0  0
      2048  msec/iter =   20.34  ROE[avg,max] = [0.215150670, 0.250000000]  radices = 128 16 16 32  0  0  0  0  0  0
      2304  msec/iter =   21.62  ROE[avg,max] = [0.223395647, 0.250000000]  radices = 144 16 16 32  0  0  0  0  0  0
      2560  msec/iter =   26.38  ROE[avg,max] = [0.302678571, 0.375000000]  radices = 160 16 16 32  0  0  0  0  0  0
      2816  msec/iter =   26.95  ROE[avg,max] = [0.266071429, 0.312500000]  radices = 176 16 16 32  0  0  0  0  0  0
      3072  msec/iter =   30.10  ROE[avg,max] = [0.219042969, 0.281250000]  radices = 192 16 16 32  0  0  0  0  0  0
      3328  msec/iter =   33.33  ROE[avg,max] = [0.290401786, 0.343750000]  radices = 208 16 16 32  0  0  0  0  0  0
      3584  msec/iter =   35.76  ROE[avg,max] = [0.227008929, 0.281250000]  radices = 224  8  8  8 16  0  0  0  0  0
      3840  msec/iter =   39.92  ROE[avg,max] = [0.228404018, 0.257812500]  radices = 240 16 16 32  0  0  0  0  0  0
      4096  msec/iter =   41.12  ROE[avg,max] = [0.228041295, 0.312500000]  radices = 256 16 16 32  0  0  0  0  0  0
      4608  msec/iter =   45.24  ROE[avg,max] = [0.233426339, 0.312500000]  radices = 144  8  8 16 16  0  0  0  0  0
      5120  msec/iter =   55.08  ROE[avg,max] = [0.260044643, 0.312500000]  radices = 160  8  8 16 16  0  0  0  0  0
      5632  msec/iter =   55.22  ROE[avg,max] = [0.218415179, 0.281250000]  radices = 176 16 32 32  0  0  0  0  0  0
      6144  msec/iter =   61.61  ROE[avg,max] = [0.253236607, 0.312500000]  radices = 192  8  8 16 16  0  0  0  0  0
      6656  msec/iter =   68.61  ROE[avg,max] = [0.320982143, 0.375000000]  radices = 208  8  8 16 16  0  0  0  0  0
      7168  msec/iter =   73.92  ROE[avg,max] = [0.327455357, 0.375000000]  radices = 224  8  8 16 16  0  0  0  0  0
      7680  msec/iter =   81.06  ROE[avg,max] = [0.232477679, 0.281250000]  radices = 240  8  8 16 16  0  0  0  0  0
So continue testing with large FFT and 48 cores:
./Mlucas -s l -iters 100 -nthread 48 >& selftest.log

Code:
      8192  msec/iter =   82.33  ROE[avg,max] = [0.326562500, 0.375000000]  radices = 256  8  8 16 16  0  0  0  0  0
      9216  msec/iter =   96.31  ROE[avg,max] = [0.248883929, 0.312500000]  radices = 144 32 32 32  0  0  0  0  0  0
     10240  msec/iter =  114.42  ROE[avg,max] = [0.288392857, 0.312500000]  radices = 160 32 32 32  0  0  0  0  0  0
     11264  msec/iter =  113.96  ROE[avg,max] = [0.217041016, 0.265625000]  radices = 176 32 32 32  0  0  0  0  0  0
     12288  msec/iter =  127.86  ROE[avg,max] = [0.232700893, 0.281250000]  radices = 192 32 32 32  0  0  0  0  0  0
     13312  msec/iter =  145.45  ROE[avg,max] = [0.236021205, 0.281250000]  radices = 208 32 32 32  0  0  0  0  0  0
     14336  msec/iter =  149.82  ROE[avg,max] = [0.232903181, 0.281250000]  radices = 224 32 32 32  0  0  0  0  0  0
     15360  msec/iter =  167.64  ROE[avg,max] = [0.266294643, 0.312500000]  radices = 240 32 32 32  0  0  0  0  0  0
     16384  msec/iter =  172.04  ROE[avg,max] = [0.233035714, 0.250000000]  radices = 256 32 32 32  0  0  0  0  0  0
     18432  msec/iter =  201.83  ROE[avg,max] = [0.215608433, 0.250000000]  radices = 144 16 16 16 16  0  0  0  0  0
     20480  msec/iter =  240.79  ROE[avg,max] = [0.285714286, 0.343750000]  radices = 160 16 16 16 16  0  0  0  0  0
     22528  msec/iter =  241.65  ROE[avg,max] = [0.234933036, 0.281250000]  radices = 176 16 16 16 16  0  0  0  0  0
     24576  msec/iter =  267.75  ROE[avg,max] = [0.237276786, 0.281250000]  radices = 192 16 16 16 16  0  0  0  0  0
     26624  msec/iter =  302.91  ROE[avg,max] = [0.256473214, 0.312500000]  radices = 208 16 16 16 16  0  0  0  0  0
     28672  msec/iter =  317.19  ROE[avg,max] = [0.216406250, 0.250000000]  radices = 224 16 16 16 16  0  0  0  0  0
     30720  msec/iter =  353.90  ROE[avg,max] = [0.245089286, 0.312500000]  radices = 240 16 16 16 16  0  0  0  0  0
     32768  msec/iter =  359.45  ROE[avg,max] = [0.326339286, 0.375000000]  radices = 256 16 16 16 16  0  0  0  0  0
And just for some details who needs.
dmidecode
Code:
Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.
Table at 0x10FFF1E0000.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: American Megatrends Inc.
        Version: G31FB12A
        Release Date: 10/26/2016
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 16384 kB
        Characteristics:
                PCI is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                Boot from CD is supported
                Selectable boot is supported
                BIOS ROM is socketed
                EDD is supported
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 kB floppy services are supported (int 13h)
                3.5"/2.88 MB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                ACPI is supported
                USB legacy is supported
                BIOS boot specification is supported
                Targeted content distribution is supported
                UEFI is supported
        BIOS Revision: 5.11

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: FOXCONN
        Product Name: R2-1221R-A4
        Version: 1A21HH300-600-G
        Serial Number: 7CE642P2KZ
        UUID: 10000000-AE90-0958-D62D-70106FB9EAD0
        Wake-up Type: Power Switch
        SKU Number: C2U4N
        Family: NULL

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
        Manufacturer: FOXCONN
        Product Name: C2U4N_MB
        Version: 1A42D1P00-600-G1
        Serial Number: 1A42D1P00TX14700C
        Asset Tag: NULL
        Features:
                Board is a hosting board
                Board is replaceable
        Location In Chassis: REAR
        Chassis Handle: 0x0003
        Type: Motherboard
        Contained Object Handles: 0

Handle 0x0003, DMI type 3, 22 bytes
Chassis Information
        Manufacturer: FOXCONN
        Type: Other
        Lock: Not Present
        Version: C2U4N
        Serial Number: 7CE642P0SE
        Asset Tag: NULL
        Boot-up State: Safe
        Power Supply State: Safe
        Thermal State: Safe
        Security Status: None
        OEM Information: 0x00000000
        Height: 2 U
        Number Of Power Cords: 2
        Contained Elements: 0
        SKU Number: NULL

Handle 0x0009, DMI type 9, 17 bytes
System Slot Information
        Designation: PCIe Slot
        Type: x8 PCI Express 3 x8
        Current Usage: Available
        Length: Short
        ID: 1
        Characteristics:
                3.3 V is provided
                PME signal is supported
        Bus Address: 0000:00:00.0

Handle 0x0010, DMI type 11, 5 bytes
OEM Strings
        String 1: NULL

Handle 0x0011, DMI type 13, 22 bytes
BIOS Language Information
        Language Description Format: Long
        Installable Languages: 1
                en|US|iso8859-1
        Currently Installed Language: en|US|iso8859-1

Handle 0x0012, DMI type 32, 11 bytes
System Boot Information
        Status: No errors detected

Handle 0x0013, DMI type 41, 11 bytes
Onboard Device
        Reference Designation: VGA
        Type: Video
        Status: Enabled
        Type Instance: 1
        Bus Address: 0004:15:00.0

Handle 0x0014, DMI type 38, 18 bytes
IPMI Device Information
        Interface Type: SSIF (SMBus System Interface)
        Specification Version: 2.0
        I2C Slave Address: 0x10
        NV Storage Device: Not Present
        Base Address: 0x12 (SMBus)

Handle 0x0023, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: USB 3.0 Port 1
        External Connector Type: Access Bus (USB)
        Port Type: USB

Handle 0x0024, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: Rear Video
        External Connector Type: DB-15 female
        Port Type: Video Port

Handle 0x0025, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: 1GbE
        External Connector Type: RJ-45
        Port Type: Network Port

Handle 0x0026, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: 10G SFP+ 1
        External Connector Type: Other
        Port Type: Network Port

Handle 0x0027, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: 10G SFP+ 2
        External Connector Type: Other
        Port Type: Network Port

Handle 0x0028, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Mini SAS Port 0
        Internal Connector Type: SAS/SATA Plug Receptacle
        External Reference Designator: Not Specified
        External Connector Type: None
        Port Type: SAS

Handle 0x0029, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Mini SAS Port 1
        Internal Connector Type: SAS/SATA Plug Receptacle
        External Reference Designator: Not Specified
        External Connector Type: None
        Port Type: SAS

Handle 0x002A, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: SATA M.2
        Internal Connector Type: Other
        External Reference Designator: Not Specified
        External Connector Type: None
        Port Type: SATA

Handle 0x002B, DMI type 12, 5 bytes
System Configuration Options
        Option 1: JP19(PASSWORD CLEAR) Jumper 1-2: Normal (Default), Jumper 2-3: Password Clear

Handle 0x002C, DMI type 4, 42 bytes
Processor Information
        Socket Designation: SoC 0
        Type: Central Processor
        Family: ARM
        Manufacturer: Cavium Inc.
        ID: 11 0A 1F 43 00 00 00 00
        Version: 2.1
        Voltage: 1.0 V
        External Clock: 50 MHz
        Max Speed: 2000 MHz
        Current Speed: 2000 MHz
        Status: Populated, Enabled
        Upgrade: None
        L1 Cache Handle: 0x002D
        L2 Cache Handle: 0x002F
        L3 Cache Handle: 0x0000
        Serial Number: 0180-1009-001E-31F1-8100-1000
        Asset Tag: NULL
        Part Number: CN8890H-2000BG2601-AAP-Y-G
        Core Count: 48
        Core Enabled: 48
        Thread Count: 48
        Characteristics:
                64-bit capable
                Multi-Core
                Execute Protection
                Enhanced Virtualization
                Power/Performance Control

Handle 0x002D, DMI type 7, 19 bytes
Cache Information
        Socket Designation: Internal L1D Cache
        Configuration: Enabled, Not Socketed, Level 1
        Operational Mode: Write Back
        Location: Internal
        Installed Size: 1536 kB
        Maximum Size: 1536 kB
        Supported SRAM Types:
                Unknown
        Installed SRAM Type: Unknown
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Data
        Associativity: 32-way Set-associative

Handle 0x002E, DMI type 7, 19 bytes
Cache Information
        Socket Designation: Internal L1I Cache
        Configuration: Enabled, Not Socketed, Level 1
        Operational Mode: Write Back
        Location: Internal
        Installed Size: 3744 kB
        Maximum Size: 3744 kB
        Supported SRAM Types:
                Unknown
        Installed SRAM Type: Unknown
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Instruction
        Associativity: Other

Handle 0x002F, DMI type 7, 19 bytes
Cache Information
        Socket Designation: Internal L2 Cache
        Configuration: Enabled, Not Socketed, Level 2
        Operational Mode: Write Back
        Location: Internal
        Installed Size: 16384 kB
        Maximum Size: 16384 kB
        Supported SRAM Types:
                Unknown
        Installed SRAM Type: Unknown
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Unified
        Associativity: 16-way Set-associative

Handle 0x0030, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Single-bit ECC
        Maximum Capacity: 256 GB
        Error Information Handle: Not Provided
        Number Of Devices: 8

Handle 0x0031, DMI type 19, 31 bytes
Memory Array Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x00FFFFFFFFF
        Range Size: 64 GB
        Physical Array Handle: 0x0030
        Partition Width: 4

Handle 0x0032, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: Unknown
        Locator: DIMM_A0
        Bank Locator: SoC 0
        Type: DDR4
        Type Detail: Registered (Buffered)
        Speed: 2400 MHz
        Manufacturer: Samsung
        Serial Number: #021631330C5A43
        Asset Tag: None
        Part Number: M393A2G40EB1-CRC    
        Rank: 2
        Configured Clock Speed: 2133 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0033, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x003FFFFFFFF
        Range Size: 16 GB
        Physical Device Handle: 0x0032
        Memory Array Mapped Address Handle: 0x0031
        Partition Row Position: 1
        Interleave Position: Unknown
        Interleaved Data Depth: Unknown

Handle 0x0034, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: Unknown
        Set: Unknown
        Locator: DIMM_A1
        Bank Locator: SoC 0
        Type: DDR4
        Type Detail: Unknown
        Speed: Unknown
        Manufacturer: NO DIMM
        Serial Number: NO DIMM
        Asset Tag: None
        Part Number: NO DIMM
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x0035, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: Unknown
        Locator: DIMM_B0
        Bank Locator: SoC 0
        Type: DDR4
        Type Detail: Registered (Buffered)
        Speed: 2400 MHz
        Manufacturer: Samsung
        Serial Number: #021631330C5727
        Asset Tag: None
        Part Number: M393A2G40EB1-CRC    
        Rank: 2
        Configured Clock Speed: 2133 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0036, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x00400000000
        Ending Address: 0x007FFFFFFFF
        Range Size: 16 GB
        Physical Device Handle: 0x0035
        Memory Array Mapped Address Handle: 0x0031
        Partition Row Position: 1
        Interleave Position: Unknown
        Interleaved Data Depth: Unknown

Handle 0x0037, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: Unknown
        Set: Unknown
        Locator: DIMM_B1
        Bank Locator: SoC 0
        Type: DDR4
        Type Detail: Unknown
        Speed: Unknown
        Manufacturer: NO DIMM
        Serial Number: NO DIMM
        Asset Tag: None
        Part Number: NO DIMM
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x0038, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: Unknown
        Locator: DIMM_C0
        Bank Locator: SoC 0
        Type: DDR4
        Type Detail: Registered (Buffered)
        Speed: 2400 MHz
        Manufacturer: Samsung
        Serial Number: #021631330C5726
        Asset Tag: None
        Part Number: M393A2G40EB1-CRC    
        Rank: 2
        Configured Clock Speed: 2133 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0039, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x00800000000
        Ending Address: 0x00BFFFFFFFF
        Range Size: 16 GB
        Physical Device Handle: 0x0038
        Memory Array Mapped Address Handle: 0x0031
        Partition Row Position: 1
        Interleave Position: Unknown
        Interleaved Data Depth: Unknown

Handle 0x003A, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: Unknown
        Set: Unknown
        Locator: DIMM_C1
        Bank Locator: SoC 0
        Type: DDR4
        Type Detail: Unknown
        Speed: Unknown
        Manufacturer: NO DIMM
        Serial Number: NO DIMM
        Asset Tag: None
        Part Number: NO DIMM
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x003B, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: Unknown
        Locator: DIMM_D0
        Bank Locator: SoC 0
        Type: DDR4
        Type Detail: Registered (Buffered)
        Speed: 2400 MHz
        Manufacturer: Samsung
        Serial Number: #021631330C5058
        Asset Tag: None
        Part Number: M393A2G40EB1-CRC    
        Rank: 2
        Configured Clock Speed: 2133 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x003C, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x00C00000000
        Ending Address: 0x00FFFFFFFFF
        Range Size: 16 GB
        Physical Device Handle: 0x003B
        Memory Array Mapped Address Handle: 0x0031
        Partition Row Position: 1
        Interleave Position: Unknown
        Interleaved Data Depth: Unknown

Handle 0x003D, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0030
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: Unknown
        Set: Unknown
        Locator: DIMM_D1
        Bank Locator: SoC 0
        Type: DDR4
        Type Detail: Unknown
        Speed: Unknown
        Manufacturer: NO DIMM
        Serial Number: NO DIMM
        Asset Tag: None
        Part Number: NO DIMM
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x003E, DMI type 4, 42 bytes
Processor Information
        Socket Designation: SoC 1
        Type: Central Processor
        Family: ARM
        Manufacturer: Cavium Inc.
        ID: 11 0A 1F 43 00 00 00 00
        Version: 2.1
        Voltage: 1.0 V
        External Clock: 50 MHz
        Max Speed: 2000 MHz
        Current Speed: 2000 MHz
        Status: Populated, Enabled
        Upgrade: None
        L1 Cache Handle: 0x003F
        L2 Cache Handle: 0x0041
        L3 Cache Handle: 0x0000
        Serial Number: 0160-2803-001E-31F1-8020-1000
        Asset Tag: NULL
        Part Number: CN8890H-2000BG2601-AAP-Y-G
        Core Count: 48
        Core Enabled: 48
        Thread Count: 48
        Characteristics:
                64-bit capable
                Multi-Core
                Execute Protection
                Enhanced Virtualization
                Power/Performance Control

Handle 0x003F, DMI type 7, 19 bytes
Cache Information
        Socket Designation: Internal L1D Cache
        Configuration: Enabled, Not Socketed, Level 1
        Operational Mode: Write Back
        Location: Internal
        Installed Size: 1536 kB
        Maximum Size: 1536 kB
        Supported SRAM Types:
                Unknown
        Installed SRAM Type: Unknown
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Data
        Associativity: 32-way Set-associative

Handle 0x0040, DMI type 7, 19 bytes
Cache Information
        Socket Designation: Internal L1I Cache
        Configuration: Enabled, Not Socketed, Level 1
        Operational Mode: Write Back
        Location: Internal
        Installed Size: 3744 kB
        Maximum Size: 3744 kB
        Supported SRAM Types:
                Unknown
        Installed SRAM Type: Unknown
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Instruction
        Associativity: Other

Handle 0x0041, DMI type 7, 19 bytes
Cache Information
        Socket Designation: Internal L2 Cache
        Configuration: Enabled, Not Socketed, Level 2
        Operational Mode: Write Back
        Location: Internal
        Installed Size: 16384 kB
        Maximum Size: 16384 kB
        Supported SRAM Types:
                Unknown
        Installed SRAM Type: Unknown
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Unified
        Associativity: 16-way Set-associative

Handle 0x0042, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Single-bit ECC
        Maximum Capacity: 256 GB
        Error Information Handle: Not Provided
        Number Of Devices: 8

Handle 0x0043, DMI type 19, 31 bytes
Memory Array Mapped Address
        Starting Address: 0x01000000000
        Ending Address: 0x02FFFFFFFFF
        Range Size: 128 GB
        Physical Array Handle: 0x0042
        Partition Width: 4

Handle 0x0044, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0042
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: Unknown
        Locator: DIMM_E0
        Bank Locator: SoC 1
        Type: DDR4
        Type Detail: Registered (Buffered)
        Speed: 2400 MHz
        Manufacturer: Samsung
        Serial Number: #021631330C56CB
        Asset Tag: None
        Part Number: M393A2G40EB1-CRC    
        Rank: 2
        Configured Clock Speed: 2133 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0045, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x01000000000
        Ending Address: 0x013FFFFFFFF
        Range Size: 16 GB
        Physical Device Handle: 0x0044
        Memory Array Mapped Address Handle: 0x0043
        Partition Row Position: 1
        Interleave Position: Unknown
        Interleaved Data Depth: Unknown

Handle 0x0046, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0042
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: Unknown
        Set: Unknown
        Locator: DIMM_E1
        Bank Locator: SoC 1
        Type: DDR4
        Type Detail: Unknown
        Speed: Unknown
        Manufacturer: NO DIMM
        Serial Number: NO DIMM
        Asset Tag: None
        Part Number: NO DIMM
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x0047, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0042
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: Unknown
        Locator: DIMM_F0
        Bank Locator: SoC 1
        Type: DDR4
        Type Detail: Registered (Buffered)
        Speed: 2400 MHz
        Manufacturer: Samsung
        Serial Number: #021631330C5723
        Asset Tag: None
        Part Number: M393A2G40EB1-CRC    
        Rank: 2
        Configured Clock Speed: 2133 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0048, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x01400000000
        Ending Address: 0x017FFFFFFFF
        Range Size: 16 GB
        Physical Device Handle: 0x0047
        Memory Array Mapped Address Handle: 0x0043
        Partition Row Position: 1
        Interleave Position: Unknown
        Interleaved Data Depth: Unknown

Handle 0x0049, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0042
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: Unknown
        Set: Unknown
        Locator: DIMM_F1
        Bank Locator: SoC 1
        Type: DDR4
        Type Detail: Unknown
        Speed: Unknown
        Manufacturer: NO DIMM
        Serial Number: NO DIMM
        Asset Tag: None
        Part Number: NO DIMM
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x004A, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0042
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: Unknown
        Locator: DIMM_G0
        Bank Locator: SoC 1
        Type: DDR4
        Type Detail: Registered (Buffered)
        Speed: 2400 MHz
        Manufacturer: Samsung
        Serial Number: #021631330C5A49
        Asset Tag: None
        Part Number: M393A2G40EB1-CRC    
        Rank: 2
        Configured Clock Speed: 2133 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x004B, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x01800000000
        Ending Address: 0x01BFFFFFFFF
        Range Size: 16 GB
        Physical Device Handle: 0x004A
        Memory Array Mapped Address Handle: 0x0043
        Partition Row Position: 1
        Interleave Position: Unknown
        Interleaved Data Depth: Unknown

Handle 0x004C, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0042
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: Unknown
        Set: Unknown
        Locator: DIMM_G1
        Bank Locator: SoC 1
        Type: DDR4
        Type Detail: Unknown
        Speed: Unknown
        Manufacturer: NO DIMM
        Serial Number: NO DIMM
        Asset Tag: None
        Part Number: NO DIMM
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x004D, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0042
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: Unknown
        Locator: DIMM_H0
        Bank Locator: SoC 1
        Type: DDR4
        Type Detail: Registered (Buffered)
        Speed: 2400 MHz
        Manufacturer: Samsung
        Serial Number: #021631330C51F1
        Asset Tag: None
        Part Number: M393A2G40EB1-CRC    
        Rank: 2
        Configured Clock Speed: 2133 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x004E, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x01C00000000
        Ending Address: 0x01FFFFFFFFF
        Range Size: 16 GB
        Physical Device Handle: 0x004D
        Memory Array Mapped Address Handle: 0x0043
        Partition Row Position: 1
        Interleave Position: Unknown
        Interleaved Data Depth: Unknown

Handle 0x004F, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0042
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: Unknown
        Set: Unknown
        Locator: DIMM_H1
        Bank Locator: SoC 1
        Type: DDR4
        Type Detail: Unknown
        Speed: Unknown
        Manufacturer: NO DIMM
        Serial Number: NO DIMM
        Asset Tag: None
        Part Number: NO DIMM
        Rank: Unknown
        Configured Clock Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown

Handle 0x0050, DMI type 127, 4 bytes
End Of Table
Lorenzo is offline   Reply With Quote
Old 2018-03-20, 01:11   #160
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Hi, Lorenzo, and thanks for the build data on this interesting manycore ARM implementation. Couple of notes:

[1] You are highly unlikely to get decent parallelism beyond 16 cores. Moreover, the self-test error messages you get for 48 and 96 cores reflect limitations in how much ||ism can be obtained for the specific-radix carry routines in question. (The reasons for these limitations are technical ... the self-test will print the rror message and skip to the next FFT radix combination in such cases, but when #threads gets large as in your attempts there will be few radix combos which can run that many threads.) So I suggest for now limiting yourself to fewer threads, say try the following core counts in your self-tests:

-cpu 0:3 [when tests done, mover mlucas.cfg to mlucas.cfg.4]
-cpu 0:7 [when tests done, mover mlucas.cfg to mlucas.cfg.8]
-cpu 0:16 [when tests done, mover mlucas.cfg to mlucas.cfg.16]

Those should tell us roughly where the #threads 'sweet spot' is. Once we find it, you would - if you were going to do actual production GIMPS work on this arch - run multiple jobs, each having that many threads, assigned to disjoint core sets, e.g. if 8-core is best, one job running with -cpu 0:7, a second with -cpu 8:15, etc.

[2] For the SIMD build, I suggest for now simply commenting out the ASSERT at util.c:1831, recompile that file and relinking the binary (the SIMD obj-files and binary should be in a separate directory from those for the non-SIMD build), and see if the self-test now runs. If so, sun the same moderate-threadcount self-tests as in [1], then we can compare the 2 sets of cfg-files to see if SIMD gives the expected boost.

I may have to change the SIMD-available checking code in util.c to read the /proc/cpuinfo file directly, since the current way which calls getauxval(AT_HWCAP) does not appear to be as portable as I'd been led to believe.

Last fiddled with by ewmayer on 2018-03-20 at 01:12
ewmayer is offline   Reply With Quote
Old 2018-03-20, 05:22   #161
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2·3·1,693 Posts
Default OMG! Gotta go public...

As I can't attach to a PM
Ernst, in some interaction between your ||(ism) and my screen, I saw these symbols as "jism."
kladner is offline   Reply With Quote
Old 2018-03-23, 14:07   #162
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×52×19 Posts
Default

With the latest image the Renegade now performs roughly as you'd expect, it may get slightly more optimised but this is my last bench of it honest. About 10% quicker than a stock pi3b when we get up to size 4096 and beyond.

Image: ROC-RK3328-CC_Ubuntu16.04_Arch64_20180315
GCC: 7.2.0

Code:
17.1
      1024  msec/iter =   56.53  ROE[avg,max] = [0.254687500, 0.312500000]  radices = 256  8 16 16  0  0  0  0  0  0
      1152  msec/iter =   61.35  ROE[avg,max] = [0.221044922, 0.250000000]  radices = 288  8 16 16  0  0  0  0  0  0
      1280  msec/iter =   68.82  ROE[avg,max] = [0.264508929, 0.343750000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =   81.40  ROE[avg,max] = [0.227343750, 0.265625000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =   91.96  ROE[avg,max] = [0.267187500, 0.343750000]  radices =  48 32 32 16  0  0  0  0  0  0
      1664  msec/iter =   98.80  ROE[avg,max] = [0.270758929, 0.312500000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =  106.67  ROE[avg,max] = [0.220532663, 0.250000000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =  115.22  ROE[avg,max] = [0.257756696, 0.312500000]  radices = 240 16 16 16  0  0  0  0  0  0
      2048  msec/iter =  123.77  ROE[avg,max] = [0.236921038, 0.281250000]  radices = 256 16 16 16  0  0  0  0  0  0
      2304  msec/iter =  143.72  ROE[avg,max] = [0.248751395, 0.312500000]  radices = 288 16 16 16  0  0  0  0  0  0
      2560  msec/iter =  159.73  ROE[avg,max] = [0.236908831, 0.312500000]  radices = 160 32 16 16  0  0  0  0  0  0
      2816  msec/iter =  186.83  ROE[avg,max] = [0.263392857, 0.312500000]  radices = 176 32 16 16  0  0  0  0  0  0
      3072  msec/iter =  206.21  ROE[avg,max] = [0.224818638, 0.251953125]  radices =  48 32 32 32  0  0  0  0  0  0
      3328  msec/iter =  227.02  ROE[avg,max] = [0.281250000, 0.375000000]  radices = 208 32 16 16  0  0  0  0  0  0
      3584  msec/iter =  245.98  ROE[avg,max] = [0.252343750, 0.312500000]  radices = 224 32 16 16  0  0  0  0  0  0
      3840  msec/iter =  267.59  ROE[avg,max] = [0.248437500, 0.343750000]  radices = 240 32 16 16  0  0  0  0  0  0
      4096  msec/iter =  279.72  ROE[avg,max] = [0.295089286, 0.343750000]  radices = 128 32 32 16  0  0  0  0  0  0
      4608  msec/iter =  324.20  ROE[avg,max] = [0.258928571, 0.312500000]  radices = 144 32 32 16  0  0  0  0  0  0
      5120  msec/iter =  354.06  ROE[avg,max] = [0.237137277, 0.281250000]  radices = 160 32 32 16  0  0  0  0  0  0
      5632  msec/iter =  407.10  ROE[avg,max] = [0.256919643, 0.312500000]  radices = 176 32 32 16  0  0  0  0  0  0
      6144  msec/iter =  457.47  ROE[avg,max] = [0.246651786, 0.281250000]  radices = 192 32 32 16  0  0  0  0  0  0
      6656  msec/iter =  492.90  ROE[avg,max] = [0.262500000, 0.312500000]  radices = 208 32 32 16  0  0  0  0  0  0
      7168  msec/iter =  542.15  ROE[avg,max] = [0.224874442, 0.281250000]  radices = 224 32 32 16  0  0  0  0  0  0
      7680  msec/iter =  592.53  ROE[avg,max] = [0.237053571, 0.281250000]  radices = 240 32 32 16  0  0  0  0  0  0
Has anyone had any luck with a 64 bit distro on the new pi3b+? I can't get the gentoo distro that works with the pi3b to boot, it tries but the low power lightning symbol appears top right (power source is definitely good, I'm guessing the issue is that hardware differences between 3b and 3b+ require an image update).
M344587487 is offline   Reply With Quote
Old 2018-03-23, 14:23   #163
wombatman
I moo ablest echo power!
 
wombatman's Avatar
 
May 2013

3·619 Posts
Default

I haven't got a Pi3b+, but contact the person who made the Gentoo image (sakaki). They were very responsive, polite, and helpful.
wombatman is offline   Reply With Quote
Old 2018-06-14, 14:55   #164
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

32×131 Posts
Default

ARM Cortex-A76 announced

Some key features:
Performance orientated design
OoO (Out of Order)
4-wide (=wider than the previous A57, A72, A73, A75)
Dual-128bit ASIMD/FP execution pipelines
Increased memory bandwidth/ lower latency throughout the caches/memory

https://www.anandtech.com/show/12785...7nm-powerhouse

Last fiddled with by VictordeHolland on 2018-06-14 at 14:56 Reason: removed extra newlines
VictordeHolland is offline   Reply With Quote
Old 2018-06-14, 20:59   #165
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267548 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
ARM Cortex-A76 announced
Thanks for the link, Victor. I've been hearing about Apple's working on a custom high-perf ARM-based chip for the future PCs, and I surmised that 256-bit-wide vectors would be a key part of that. Let's just hope that the various 256-bit CPUs are code-compatible.
ewmayer is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Economic prospects for solar photovoltaic power cheesehead Science & Technology 137 2018-06-26 15:46
Which SIMD flag to use for Raspberry Pi BrainStone Mlucas 14 2017-11-19 00:59
compiler/assembler optimizations possible? ixfd64 Software 7 2011-02-25 20:05
Running 32-bit builds on a Win7 system ewmayer Programming 34 2010-10-18 22:36
SIMD string->int fivemack Software 7 2009-03-23 18:15

All times are UTC. The time now is 04:24.


Fri Jul 7 04:24:32 UTC 2023 up 323 days, 1:53, 0 users, load averages: 1.73, 1.69, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔