mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2017-11-22, 01:46   #133
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

I sent e-mail to Hardkernel last night:
Quote:
I bought a small Odroid C2 earlier this year and over the last 4 months used it to port a recreational-mathematics code I maintain to the ARMv8 vector-arithmetic instruction set. The only problem is that the A53 processor used in the C2 is so relatively slow compared to newer and higher-end ARM processors such as the A57. Is there any chance Hardkernel might be working on a newer version of the Odroid based on a processor like the Cortex-A57? I think there could be a real market for an A57-based linux micro-PC in say, the $100 price range for a quad-core, and perhaps $150 for an octocore. The timings posted by several users of (specialized and expensive) A57-based systems here indicate that it gives about 3x the per-cycle throughput of the A53 used in the Odroid C2. Thus a $100 quad A57 would offer 3x the throughput for 2x the price, and a $150 octocore-A57 would mean 6x the throughput for 3x the price.
Got a reply suggesting I check out the Odroid forums, just spent a couple hours poking around there - I see the please-build-something-based-on-Cortex-A57 idea is well-covered in the "Wish list for XU5" thread of The Ideas subforum, though I saw no "this is on our roadmap" confirmation from the Odroid folks there.

Might be useful to post a link to this thread and the performance data - Simd-versus-not, A53-versus-A57, etc - folks have posted here, any suggestions as to which subforum at the above site would be most appropriate for such a thread are welcome - perhaps the Projects subforum?

My little C2 continues to steadily crunch away, currently ~25% through its first GIMPS double-check assignment.
ewmayer is offline   Reply With Quote
Old 2017-12-11, 12:38   #134
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2·52·19 Posts
Default

There's an A53 board currently looking for crowdfunding on indiegogo, which looks interesting compared to similar boards because of the RAM: https://www.indiegogo.com/projects/r...android-linux/

DDR4-2133 in the Renegade vs LPDDR2 900 in a Pi 3, could this potentially yield an appreciable difference in mlucas? Failing that, is there some other prime related use for a low cost device with more performant RAM? There are 3 options for RAM, 1, 2 and 4 GB.
M344587487 is offline   Reply With Quote
Old 2017-12-11, 20:51   #135
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267548 Posts
Default

Quote:
Originally Posted by M344587487 View Post
There's an A53 board currently looking for crowdfunding on indiegogo, which looks interesting compared to similar boards because of the RAM: https://www.indiegogo.com/projects/r...android-linux/

DDR4-2133 in the Renegade vs LPDDR2 900 in a Pi 3, could this potentially yield an appreciable difference in mlucas? Failing that, is there some other prime related use for a low cost device with more performant RAM? There are 3 options for RAM, 1, 2 and 4 GB.
Faster memory is nearly always better, but how much depends on the degree to which throughput is being bottlenecked by the RAM versus the processor itself - without being able to test those 2 factors independently only running-code-on-such-a-system will tell use how much the faster RAM might make up in terms of the 3x speed difference (on a per-core basis) I see on my Odroid C2 compared to the high-end A57 CPUs fivemack tested the code on.
ewmayer is offline   Reply With Quote
Old 2017-12-11, 23:45   #136
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2·52·19 Posts
Default

Ok, worth a shot. I pledged for a 1GB Renegade, and am now prepping to bench a pi3 in armv8 mode (64 bit kernel).

Compiled on pi3 with
Code:
gcc -march=armv8-a -mtune=cortex-a53 -mcpu=cortex-a53 -c -O3 -DUSE_ARM_V8_SIMD -DUSE_THREADS ../src/*.c >& build.log
No errors in build.log, but running the self tests seg faulted almost immediately, same place after multiple attempts. Running with no args it detects that the processor has armv8 simd and that we're using it. This is as far as the self test got:
Code:
    Mlucas 17.1

    http://hogranch.com/mayer/README.html

INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing FFT radix tables...

           Mlucas selftest running.....

/****************************************************************************/

INFO: Unable to find/open mlucas.cfg file in r+ mode ... creating from scratch.
NTHREADS = 4
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices      1024        16        32
radix16_dif_dit_pass pfetch_dist = 32
radix16_wrapper_square: pfetch_dist = 1024
Recompiled without -DUSE_ARM_V8_SIMD, and let the self test run for a minute with no seg fault, so it looks like scalar works (didn't leave it running, waiting on a heatsink before stressing it). Will try to recompile simd tomorrow without
Code:
-march=armv8-a -mtune=cortex-a53 -mcpu=cortex-a53
, but if it still segfaults I'm out of ideas.
M344587487 is offline   Reply With Quote
Old 2017-12-12, 01:00   #137
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

@M344587487:

I also tried builds on my Odroid with and sans the extra -march flags, using them actually gave a consistently 1-2% slower binary, so I reverted to the basic -O3.

Suggest firing up the SIMD-enabled binary and running the self-test under gdb ('run -s m') to try to localize the crash. If on crashing-under-gdb 'where' gives a specific function, you can rebuild the file containing same with some debugging enabled to try to further localize things. E.g. if 'where' indicates (say) the radix1024_ditN_cy_dif1 function, here is what I would do if the crash were on my system:

ctrl-z (pause gdb)
gcc -c -O0 -g3 -ggdb ../src/radix1024_ditN_cy_dif1.c (rebuild just that file with no-opts and debug symbols enabled)
gcc -o Mlucas *.o -lm -lpthread -lrt && fg (relink and go back into gdb)
run -s m

...and if the crash again occurs (if not, you'd need to redo the above with -O1 to see if that minimal opt-level allows you to reproduce the crash) in the same function, now it should allow you zero in on a specific line number, probably one with a SIMD-asm macro invocation.

Two further questions:

1. What version of gcc are you using?

2. What does cat /proc/cpuinfo say re. the CPU(s) on the system in question?
ewmayer is offline   Reply With Quote
Old 2017-12-12, 11:22   #138
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×52×19 Posts
Default

Running this distribution: https://github.com/bamarni/pi64

/proc/cpuinfo:
Code:
processor       : 0
BogoMIPS        : 38.40
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 1
BogoMIPS        : 38.40
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 2
BogoMIPS        : 38.40
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 3
BogoMIPS        : 38.40
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4
gcc --version:
Code:
gcc (Debian 6.3.0-18) 6.3.0 20170516
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Apologies for being verbose, I'm inexperienced with gdb and something funky is happening, so I need you to check my homework :P

where indicated radix32_wrapper_square:
Code:
(gdb) run -s m
Starting program: /home/pi/mlucas/mlucas_v17.1/mlucas -s m
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".

    Mlucas 17.1

    http://hogranch.com/mayer/README.html

INFO: testing qfloat routines...
CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 6.3.0 20170516.
INFO: Build uses ARMv8 advanced-SIMD instruction set.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: MLUCAS_PATH is set to ""
INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
Setting DAT_BITS = 10, PAD_BITS = 2
INFO: testing IMUL routines...
INFO: System has 4 available processor cores.
INFO: testing FFT radix tables...

           Mlucas selftest running.....

/****************************************************************************/

INFO: Unable to find/open mlucas.cfg file in r+ mode ... creating from scratch.
No CPU set or threadcount specified ... running single-threaded.
Set affinity for the following 1 cores: 0.
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices      1024        16        32
[New Thread 0x7fb3c97200 (LWP 1224)]
mers_mod_square: Init threadpool of 1 threads
radix16_dif_dit_pass pfetch_dist = 32
radix16_wrapper_square: pfetch_dist = 1024

Thread 2 "mlucas" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb3c97200 (LWP 1224)]
0x000000010016bf68 in radix32_wrapper_square ()
(gdb) where
#0  0x000000010016bf68 in radix32_wrapper_square ()
#1  0x000000010004e8c0 in mers_process_chunk ()
#2  0x000000010022b4e4 in worker_thr_routine ()
#3  0x0000007fb7f020a0 in start_thread (arg=0x10022b288 <worker_thr_routine>) at pthread_create.c:335
#4  0x0000007fb7e61edc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:77
(gdb)
After rebuilding and relinking it didn't crash there but it did with radix16_wrapper_square instead, and also the first test had excessive roundoff on iteration 5, which I didn't think was possible so early on:
Code:
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358212233
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 1
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices      1024        32        16
[New Thread 0x7fb27e6200 (LWP 1389)]
mers_mod_square: Init threadpool of 1 threads

Thread 4 "mlucas" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb27e6200 (LWP 1389)]
0x00000001000f2aac in radix16_wrapper_square ()
Retried radix32 with -O1, crashed at radix16 as before. And again with -O2 and -O3, and annoyingly I couldn't even get it to repeat the crash at radix32 with the original build ops and without the debugging ops:
Code:
-march=armv8-a -mtune=cortex-a53 -mcpu=cortex-a53 -O3
Assuming the first radix32 error was probably erroneous, I tried the same process with radix16. With -O0 it no longer crashed, but every test had excessive roundoff in iteration 5:
Code:
Starting program: /home/pi/mlucas/mlucas_v17.1/obj/mlucas -s m
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".

    Mlucas 17.1

    http://hogranch.com/mayer/README.html

INFO: testing qfloat routines...
CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 6.3.0 20170516.
INFO: Build uses ARMv8 advanced-SIMD instruction set.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: MLUCAS_PATH is set to ""
INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
Setting DAT_BITS = 10, PAD_BITS = 2
INFO: testing IMUL routines...
INFO: System has 4 available processor cores.
INFO: testing FFT radix tables...

           Mlucas selftest running.....

/****************************************************************************/

INFO: Unable to find/open mlucas.cfg file in r+ mode ... creating from scratch.
No CPU set or threadcount specified ... running single-threaded.
Set affinity for the following 1 cores: 0.
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices      1024        16        32
[New Thread 0x7fb3c97200 (LWP 1645)]
mers_mod_square: Init threadpool of 1 threads
radix16_dif_dit_pass pfetch_dist = 32
[New Thread 0x7fb3080200 (LWP 1646)]
Using 1 threads in carry step
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358203385
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 1
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices      1024        32        16
[New Thread 0x7fb285e200 (LWP 1647)]
mers_mod_square: Init threadpool of 1 threads
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358212698
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 1
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices       256         8        16        16
[New Thread 0x7fb205e200 (LWP 1648)]
mers_mod_square: Init threadpool of 1 threads
[New Thread 0x7fb185e200 (LWP 1649)]
Using 1 threads in carry step
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358212698
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 1
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices       128        16        16        16
[New Thread 0x7fb105e200 (LWP 1650)]
mers_mod_square: Init threadpool of 1 threads
[New Thread 0x7fb085e200 (LWP 1651)]
Using 1 threads in carry step
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358212698
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 1
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        64        16        16        32
[New Thread 0x7fb005e200 (LWP 1652)]
mers_mod_square: Init threadpool of 1 threads
[New Thread 0x7faf85e200 (LWP 1653)]
Using 1 threads in carry step
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358203385
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 1
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        64        32        16        16
[New Thread 0x7faf05e200 (LWP 1654)]
mers_mod_square: Init threadpool of 1 threads
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358212698
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 1
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        64         8         8         8        16
[New Thread 0x7fae85e200 (LWP 1655)]
mers_mod_square: Init threadpool of 1 threads
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358212698
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 1
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        32        16        32        32
[New Thread 0x7fae05e200 (LWP 1656)]
mers_mod_square: Init threadpool of 1 threads
[New Thread 0x7fad85e200 (LWP 1657)]
Using 1 threads in carry step
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358203385
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 1
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        32        32        32        16
[New Thread 0x7fad05e200 (LWP 1658)]
mers_mod_square: Init threadpool of 1 threads
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358212698
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 1
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        32         8         8        16        16
[New Thread 0x7fac85e200 (LWP 1660)]
mers_mod_square: Init threadpool of 1 threads
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358212698
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 1
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        16        32        32        32
[New Thread 0x7fac05e200 (LWP 1661)]
mers_mod_square: Init threadpool of 1 threads
[New Thread 0x7fab85e200 (LWP 1662)]
Using 1 threads in carry step
M20000047 Roundoff warning on iteration        5, maxerr =   0.406358203385
M20000047 Roundoff warning on iteration        6, maxerr =   0.484222412109
 FATAL ERROR...Halting test of exponent 20000047
***** Excessive level of roundoff error detected - this radix set will not be used. *****
...
Same for "-O1 -g3 -ggdb", "-O2 -g3 -ggdb", "-O3 -g3 -ggdb", "-march=armv8-a -mtune=cortex-a53 -mcpu=cortex-a53 -O3 -g3 -ggdb", and annoyingly again the original build setting of "-march=armv8-a -mtune=cortex-a53 -mcpu=cortex-a53 -O3" now works but with constant roundoff errors.

I'm now rebuilding from scratch without the -march, -mtune or -mcpu flags just incase that's the problem. Could it be that the distribution I'm using is unsuitable? If this clean build fails I guess the next step is to try a different distro.

Thanks for walking me through the debugging steps, I've barely used gdb before so it helped a lot.

edit: The clean build seg faults at radix32 as before:
Code:
pi@raspberrypi:~/mlucas/mlucas_v17.1/clean$ gdb mlucas
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from mlucas...(no debugging symbols found)...done.
(gdb) run -s m
Starting program: /home/pi/mlucas/mlucas_v17.1/clean/mlucas -s m
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".

    Mlucas 17.1

    http://hogranch.com/mayer/README.html

INFO: testing qfloat routines...
CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 6.3.0 20170516.
INFO: Build uses ARMv8 advanced-SIMD instruction set.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: MLUCAS_PATH is set to ""
INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
Setting DAT_BITS = 10, PAD_BITS = 2
INFO: testing IMUL routines...
INFO: System has 4 available processor cores.
INFO: testing FFT radix tables...

           Mlucas selftest running.....

/****************************************************************************/

INFO: Unable to find/open mlucas.cfg file in r+ mode ... creating from scratch.
No CPU set or threadcount specified ... running single-threaded.
Set affinity for the following 1 cores: 0.
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices      1024        16        32
[New Thread 0x7fb3c97200 (LWP 2237)]
mers_mod_square: Init threadpool of 1 threads
radix16_dif_dit_pass pfetch_dist = 32
radix16_wrapper_square: pfetch_dist = 1024

Thread 2 "mlucas" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb3c97200 (LWP 2237)]
0x000000010016ca4c in radix32_wrapper_square ()
(gdb) where
#0  0x000000010016ca4c in radix32_wrapper_square ()
#1  0x000000010004f7d8 in mers_process_chunk ()
#2  0x000000010022ac34 in worker_thr_routine ()
#3  0x0000007fb7f020a0 in start_thread (arg=0x10022a9d8 <worker_thr_routine>) at pthread_create.c:335
#4  0x0000007fb7e61edc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:77
(gdb) quit

Last fiddled with by M344587487 on 2017-12-12 at 11:33
M344587487 is offline   Reply With Quote
Old 2017-12-12, 17:33   #139
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×52×19 Posts
Default

pi3b scalar 4 thread stock (A53 @ 1.2 Ghz) Pi64 distro: https://github.com/bamarni/pi64
Code:
17.1
      1024  msec/iter =   74.14  ROE[avg,max] = [0.262276786, 0.312500000]  radices = 256  8 16 16  0  0  0  0  0  0
      1152  msec/iter =   90.16  ROE[avg,max] = [0.206633650, 0.250000000]  radices = 288  8 16 16  0  0  0  0  0  0
      1280  msec/iter =  111.18  ROE[avg,max] = [0.222712054, 0.250000000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =  118.85  ROE[avg,max] = [0.228299386, 0.250000000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =  133.57  ROE[avg,max] = [0.234375000, 0.312500000]  radices = 192 16 16 16  0  0  0  0  0  0
      1664  msec/iter =  137.73  ROE[avg,max] = [0.229310826, 0.281250000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =  146.71  ROE[avg,max] = [0.221177455, 0.281250000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =  164.82  ROE[avg,max] = [0.258203125, 0.312500000]  radices = 240 16 16 16  0  0  0  0  0  0
      2048  msec/iter =  190.78  ROE[avg,max] = [0.216552734, 0.250000000]  radices = 128 32 16 16  0  0  0  0  0  0
      2304  msec/iter =  199.57  ROE[avg,max] = [0.254799107, 0.312500000]  radices = 288 16 16 16  0  0  0  0  0  0
      2560  msec/iter =  261.56  ROE[avg,max] = [0.302678571, 0.375000000]  radices = 160 16 16 32  0  0  0  0  0  0
      2816  msec/iter =  276.59  ROE[avg,max] = [0.265848214, 0.312500000]  radices = 176 32 16 16  0  0  0  0  0  0
      3072  msec/iter =  423.48  ROE[avg,max] = [0.260714286, 0.312500000]  radices = 768  8 16 16  0  0  0  0  0  0
      3328  msec/iter =  576.40  ROE[avg,max] = [0.316964286, 0.375000000]  radices = 208  8  8  8 16  0  0  0  0  0
      3584  msec/iter =  606.78  ROE[avg,max] = [0.227008929, 0.281250000]  radices = 224  8  8  8 16  0  0  0  0  0
      3840  msec/iter =  522.76  ROE[avg,max] = [0.227008929, 0.281250000]  radices =  60 32 32 32  0  0  0  0  0  0
      4096  msec/iter =  570.49  ROE[avg,max] = [0.260937500, 0.281250000]  radices =  64 32 32 32  0  0  0  0  0  0
      4608  msec/iter =  629.13  ROE[avg,max] = [0.226729911, 0.265625000]  radices = 144 16 32 32  0  0  0  0  0  0
      5120  msec/iter =  718.28  ROE[avg,max] = [0.248325893, 0.312500000]  radices = 160 16 32 32  0  0  0  0  0  0
      5632  msec/iter =  775.04  ROE[avg,max] = [0.300000000, 0.343750000]  radices =  44 16 16 16 16  0  0  0  0  0
      6144  msec/iter =  860.31  ROE[avg,max] = [0.238113839, 0.281250000]  radices = 192 16 32 32  0  0  0  0  0  0
      6656  msec/iter =  867.53  ROE[avg,max] = [0.303348214, 0.375000000]  radices = 208 16 32 32  0  0  0  0  0  0
      7168  msec/iter = 1302.08  ROE[avg,max] = [0.310044643, 0.375000000]  radices = 224 32 32 16  0  0  0  0  0  0
      7680  msec/iter = 1186.41  ROE[avg,max] = [0.232700893, 0.281250000]  radices = 240  8  8 16 16  0  0  0  0  0
edit: As for power consumption, it was bouncing between 4.0W and 6.0W from the wall throughout the test. No tweaks to power consumption, totally stock.

Last fiddled with by M344587487 on 2017-12-12 at 17:38
M344587487 is offline   Reply With Quote
Old 2017-12-12, 21:30   #140
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267548 Posts
Default

@M344587487: Thanks for the data - from the general symptomology my suspicion is a bad SIMD compile. Can you get hold of a newer GCC version for your distro? Tom Womack hit a bad build using his default-installed 4.6, but was able to build the SIMD code fine using GCC 7.2. Trying GCC v7 is the first thing we should try, because debugging the nonreproducible crashes you describe from your various compile attempts sounds like a nightmare.
ewmayer is offline   Reply With Quote
Old 2017-12-12, 22:47   #141
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

95010 Posts
Default

I changed to the distro ET_ was using for his benchmarks and the simd tests completed. The power consumption bounced between 4.5W and 5.5W, but seemed much more stable around 5.0W than the scalar test was:

pi3b simd 4 thread stock (A53 @ 1.2 Ghz) gentoo 64 bit, gcc 6.4.0: https://github.com/sakaki-/gentoo-on-rpi3-64bit
Code:
17.1
      1024  msec/iter =   55.98  ROE[avg,max] = [0.254687500, 0.312500000]  radices = 256  8 16 16  0  0  0  0  0  0
      1152  msec/iter =   63.23  ROE[avg,max] = [0.223256138, 0.281250000]  radices = 144 16 16 16  0  0  0  0  0  0
      1280  msec/iter =   68.89  ROE[avg,max] = [0.264508929, 0.343750000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =   80.33  ROE[avg,max] = [0.227343750, 0.265625000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =   88.89  ROE[avg,max] = [0.254241071, 0.312500000]  radices = 192 16 16 16  0  0  0  0  0  0
      1664  msec/iter =   97.32  ROE[avg,max] = [0.270758929, 0.312500000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =  105.75  ROE[avg,max] = [0.220532663, 0.250000000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =  116.11  ROE[avg,max] = [0.257756696, 0.312500000]  radices = 240 16 16 16  0  0  0  0  0  0
      2048  msec/iter =  123.62  ROE[avg,max] = [0.236921038, 0.281250000]  radices = 256 16 16 16  0  0  0  0  0  0
      2304  msec/iter =  140.70  ROE[avg,max] = [0.248751395, 0.312500000]  radices = 288 16 16 16  0  0  0  0  0  0
      2560  msec/iter =  162.87  ROE[avg,max] = [0.236908831, 0.312500000]  radices = 160 32 16 16  0  0  0  0  0  0
      2816  msec/iter =  186.90  ROE[avg,max] = [0.262500000, 0.312500000]  radices = 176 32 16 16  0  0  0  0  0  0
      3072  msec/iter =  205.46  ROE[avg,max] = [0.262111119, 0.312500000]  radices = 192 32 16 16  0  0  0  0  0  0
      3328  msec/iter =  224.56  ROE[avg,max] = [0.281250000, 0.375000000]  radices = 208 32 16 16  0  0  0  0  0  0
      3584  msec/iter =  248.33  ROE[avg,max] = [0.252343750, 0.312500000]  radices = 224 32 16 16  0  0  0  0  0  0
      3840  msec/iter =  278.88  ROE[avg,max] = [0.248437500, 0.343750000]  radices = 240 32 16 16  0  0  0  0  0  0
      4096  msec/iter =  305.09  ROE[avg,max] = [0.229129464, 0.281250000]  radices = 256 16 16 32  0  0  0  0  0  0
      4608  msec/iter =  359.20  ROE[avg,max] = [0.258928571, 0.312500000]  radices = 144 32 32 16  0  0  0  0  0  0
      5120  msec/iter =  389.39  ROE[avg,max] = [0.237137277, 0.281250000]  radices = 160 32 32 16  0  0  0  0  0  0
      5632  msec/iter =  459.98  ROE[avg,max] = [0.256919643, 0.312500000]  radices = 176 32 32 16  0  0  0  0  0  0
      6144  msec/iter =  499.97  ROE[avg,max] = [0.246651786, 0.281250000]  radices = 192 32 32 16  0  0  0  0  0  0
      6656  msec/iter =  556.30  ROE[avg,max] = [0.262500000, 0.312500000]  radices = 208 32 32 16  0  0  0  0  0  0
      7168  msec/iter =  594.89  ROE[avg,max] = [0.224874442, 0.281250000]  radices = 224 32 32 16  0  0  0  0  0  0
      7680  msec/iter =  645.15  ROE[avg,max] = [0.237053571, 0.281250000]  radices = 240 32 32 16  0  0  0  0  0  0
Slightly better than ET_'s bench, possibly due to mine having a heatsink? It's close enough that it could just be variance.

It's not all roses though as I tried to do a scalar self test on this distro, which aborted with a stack smash:
Code:
NTHREADS = 4
M39397201: using FFT length 2048K = 2097152 8-byte floats.
 this gives an average   18.786049365997314 bits per digit
Using complex FFT radices      1024        32        32
mers_mod_square: Init threadpool of 4 threads
M39397201 Roundoff warning on iteration       60, maxerr =   0.500000000000
 FATAL ERROR...Halting test of exponent 39397201
***** Excessive level of roundoff error detected - this radix set will not be used. *****
NTHREADS = 4
M39397201: using FFT length 2048K = 2097152 8-byte floats.
 this gives an average   18.786049365997314 bits per digit
Using complex FFT radices       256        16        16        16
mers_mod_square: Init threadpool of 4 threads
*** stack smashing detected ***: ./mlucas terminated
Aborted
Retrying the self test ended in a seg fault instead of a stack smash. Could it be that scalar taxes the cpu more than simd does, to the point where the scalar code is failing erratically due to overheating?

Am now attempting to upgrade gcc on gentoo, never used emerge/gentoo before so google to the rescue, maybe.

Last fiddled with by M344587487 on 2017-12-12 at 22:47
M344587487 is offline   Reply With Quote
Old 2017-12-13, 00:59   #142
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

Quote:
Originally Posted by M344587487 View Post
I changed to the distro ET_ was using for his benchmarks and the simd tests completed. The power consumption bounced between 4.5W and 5.5W, but seemed much more stable around 5.0W than the scalar test was
Interesting - according to Luigi's (aka ET) post #129 in this thread, his distro uses gcc 5.4.0, which is older than your v6 with the failed SIMD builds. Your pi3b SIMD-vs-scalar timings are interesting, the SIMD speedup starts out around the same 1.5x others are getting, but then rises to ~1.8x at the larger FFT lengths covered by the medium self-tests.

No idea what's causing your stack-smash problems with scalar-double build under that distro, but overall the scalar build should not stress the processor more than the SIMD, unless it's a local-functional-unit-stress issue *and* the processor in question uses different parts of the silicon for, say scalar and SIMD arithmetic hardware. While that is the case on x86, with its weird mix of legacy 80-bit FP register arithmetic and IEEE64-compliant SIMD-double math, it would be very surprising to see on an architecture designed to be lean, mean and low-power as is the ARM, and for which both sclar and vector arithmetic are IEEE64-compliant.

Interestingly, on the same-or-different-functional-units-for-scalar-and-simd-math theme, only a few hours before reading your gbd-trial-and-error post above, I had sent the following PM to fivemack, regarding the possible origins of the mere 1.5x SIMD speedup for ARMv8, compared to the 2.5-3x gain I get over scalar-double from using 128-bit SIMD (SSE2) on my Core2:
Quote:
Hi, Tom:

Do the ARMv8 cores share functional units between scalar and vector arithmetic instructions? (That would make sense from a low-power minimal-transistor-count perspective.) If so, are there 2 each of float64 adders and multipliers, and how are those used by the respective instructions? E.g. can 2 scalar 64-bit FP adds start on each cycle, or just one?
The reason for my question is that 1.5x is around what I would expect if both scalar and SIMD shared the same set of functional units, e.g. both could start 2 FP64 adds per cycle. The reason one still expects a gain - just a relatively modest none - from SIMD assembler in that scenario is twofold:

[1] Half as many instructions needed to process the total dataset, i.e. better use of icache;

[2] Hand-rolled ASM better at using registers and FMA.

But since the number of arithmetic functional units is no greater for SIMD than for scalar, that limits the gains to the instruction side only, and the maximum possible arithmetic throughput is the same for both build modes.

Last fiddled with by ewmayer on 2017-12-13 at 01:00
ewmayer is offline   Reply With Quote
Old 2017-12-18, 09:41   #143
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

5·7·139 Posts
Default Mlucas for PRP

How hard would it be to give Mlucas those PRP capabilities added to mprime in the last month? I am asking because PRP-C workunits are quite small (between 3M and 6M exponents) and they would be a wonderful task for small Berries.
ET_ is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Economic prospects for solar photovoltaic power cheesehead Science & Technology 137 2018-06-26 15:46
Which SIMD flag to use for Raspberry Pi BrainStone Mlucas 14 2017-11-19 00:59
compiler/assembler optimizations possible? ixfd64 Software 7 2011-02-25 20:05
Running 32-bit builds on a Win7 system ewmayer Programming 34 2010-10-18 22:36
SIMD string->int fivemack Software 7 2009-03-23 18:15

All times are UTC. The time now is 04:24.


Fri Jul 7 04:24:50 UTC 2023 up 323 days, 1:53, 0 users, load averages: 1.63, 1.67, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔