mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2017-11-10, 01:17   #122
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

Quote:
Originally Posted by GP2 View Post
I guess when compiling with gcc and targeting the ODROID-C2 you would use these flags:

Code:
-march=armv8-a -mtune=cortex-a53 -mcpu=cortex-a53
From reading the gcc man page, unlike the x86 case, it would not be enough simply to specify the -march option.

For platforms other than ODROID-C2,
the allowable values for -march are: armv8-a, armv8.1-a
the allowable values for -mtune and -mcpu are: generic, cortex-a35, cortex-a53, cortex-a57, cortex-a72, exynos-m1, qdf24xx, thunderx, xgene1
Thanks for digging thos options out, Gord. I did a build on my A53 with the added arch-flags you listed, but the resulting mlucas.cfg file shows timings consistently 1-2% slower than without the added arch-targeting flags, under similar no-other-user-procs-running conditions (I suspended the DC I am doing in order to run the self-test timings).

Many thanks for the builds and timings, Tom and Laurent. So A57 is a big step up (roughly 3x faster, comparing the respective 4-threaded timing data) from my little A53, performance-wise, even with no better SIMD-versus-not speedup factor. The || scaling on the A57 is really nice - based on my x86 experience, I did not expect such good numbers in going from 4 to 8 threads.

Tom, without giving away any trade secrets, can you comment on the odds of improved per-cycle SIMD in forthcoming updates of the ARM architecture? As I noted earlier, even a fairly modest upgrade to the dual-issue capabilities of being able to do one vector add and one vector mul per cycle would give the FFT code a nice boost. I expect/hope that performing an FADD instruction in hardware draws a sufficiently lower amount of power than FMUL/FMA that such restricted dual-issuance would not wreck the low-power aspects of the architecture. (If such an enhancement is not on the current roadmap, maybe if we take key HW-panners out for dinner, get them good and drunk, and extract suitable promises form them...)

The A57 8-core/8-thread timings are only ~20% slower than those I get from an AVX2 build of the code on my dual-2-GHz-Broadwell-core Intel NUC (running 4-threaded on the 2 hardware cores, which gives me the best overall throughput on that system), it would be illuminating to compare the wattages of those two low-power-oriented solutions. Roughly what is the cost of a bare-bones 8-core A57? My NUC cost somewhere in the $400-500 range.

Last fiddled with by ewmayer on 2017-11-10 at 01:18
ewmayer is offline   Reply With Quote
Old 2017-11-10, 04:58   #123
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1175610 Posts
Default

Tom's Hardware piece from January 2016 on the A72 - it mentions reduced latency for e.g. FMA, but actual numbers will tell the tale of whether that boosts the FFT code appreciably.
ewmayer is offline   Reply With Quote
Old 2017-11-14, 10:47   #124
GP2
 
GP2's Avatar
 
Sep 2003

2·5·7·37 Posts
Default

Quote:
Originally Posted by GP2 View Post
For platforms other than ODROID-C2,
the allowable values for -march are: armv8-a, armv8.1-a
the allowable values for -mtune and -mcpu are: generic, cortex-a35, cortex-a53, cortex-a57, cortex-a72, exynos-m1, qdf24xx, thunderx, xgene1
I forgot the following additional allowable values. Quoting "man gcc":

The values cortex-a57.cortex-a53, cortex-a72.cortex-a53, cortex-a73.cortex-a35, cortex-a73.cortex-a53 specify that GCC should tune for a big.LITTLE system.
GP2 is offline   Reply With Quote
Old 2017-11-17, 21:21   #125
rocky
 
Nov 2017

2 Posts
Default Test on APM X-Gene Mustang and AMD Opteron 1100

I have access to GGC Farm (https://cfarm.tetaneutral.net/machines/list/) and I did a test of compiling and running a benchmark of mlucas on those machine and here are the results.

If you need more different test or more info, just ask.

I used Mlucas v17.1
Compile:
$ gcc -c -O3 -DUSE_ARM_V8_SIMD ../src/*.c |& tee build.log
Great, no error
$ gcc --version
gcc (SUSE Linux) 5.3.1 20160301 [gcc-5-branch revision 233849]
$ gcc -o mlucas *.o -lm -lpthread -lrt
$ ./mlucas -s m |& tee selftest.log

On AMD Opteron 1100 (gcc118), the content of mlucas.cfg is:
Code:
17.1
      1024  msec/iter =   49.45  ROE[avg,max] = [0.217382812, 0.281250000]  radices =  16 32 32 32  0  0  0  0  0  0
      1152  msec/iter =   60.75  ROE[avg,max] = [0.237018694, 0.281250000]  radices =  36 16 32 32  0  0  0  0  0  0
      1280  msec/iter =   66.72  ROE[avg,max] = [0.289955357, 0.375000000]  radices =  20 32 32 32  0  0  0  0  0  0
      1408  msec/iter =   78.04  ROE[avg,max] = [0.244419643, 0.312500000]  radices =  44 16 32 32  0  0  0  0  0  0
      1536  msec/iter =   80.59  ROE[avg,max] = [0.227845982, 0.281250000]  radices =  24 32 32 32  0  0  0  0  0  0
      1664  msec/iter =   94.73  ROE[avg,max] = [0.226674107, 0.281250000]  radices =  52 16 32 32  0  0  0  0  0  0
      1792  msec/iter =   97.68  ROE[avg,max] = [0.230078125, 0.281250000]  radices =  56 16 32 32  0  0  0  0  0  0
      1920  msec/iter =  110.56  ROE[avg,max] = [0.234151786, 0.265625000]  radices =  60 16 32 32  0  0  0  0  0  0
      2048  msec/iter =  109.44  ROE[avg,max] = [0.228236607, 0.281250000]  radices =  32 32 32 32  0  0  0  0  0  0
      2304  msec/iter =  132.20  ROE[avg,max] = [0.250456892, 0.312500000]  radices =  36 32 32 32  0  0  0  0  0  0
      2560  msec/iter =  146.44  ROE[avg,max] = [0.236383929, 0.312500000]  radices = 160 16 16 32  0  0  0  0  0  0
      2816  msec/iter =  167.99  ROE[avg,max] = [0.260044643, 0.312500000]  radices = 176 16 16 32  0  0  0  0  0  0
      3072  msec/iter =  175.99  ROE[avg,max] = [0.224818638, 0.251953125]  radices =  48 32 32 32  0  0  0  0  0  0
      3328  msec/iter =  197.09  ROE[avg,max] = [0.280803571, 0.375000000]  radices = 208 32 16 16  0  0  0  0  0  0
      3584  msec/iter =  206.07  ROE[avg,max] = [0.223172433, 0.250000000]  radices =  56 32 32 32  0  0  0  0  0  0
      3840  msec/iter =  229.05  ROE[avg,max] = [0.248437500, 0.343750000]  radices = 240 32 16 16  0  0  0  0  0  0
      4096  msec/iter =  240.88  ROE[avg,max] = [0.243750000, 0.296875000]  radices =  64 32 32 32  0  0  0  0  0  0
      4608  msec/iter =  276.55  ROE[avg,max] = [0.251339286, 0.281250000]  radices = 288 32 16 16  0  0  0  0  0  0
      5120  msec/iter =  300.46  ROE[avg,max] = [0.237053571, 0.265625000]  radices = 160 16 32 32  0  0  0  0  0  0
      5632  msec/iter =  342.46  ROE[avg,max] = [0.261160714, 0.281250000]  radices = 176 16 32 32  0  0  0  0  0  0
      6144  msec/iter =  380.17  ROE[avg,max] = [0.255022321, 0.343750000]  radices = 192 16 32 32  0  0  0  0  0  0
      6656  msec/iter =  403.75  ROE[avg,max] = [0.266085379, 0.312500000]  radices = 208 16 32 32  0  0  0  0  0  0
      7168  msec/iter =  433.80  ROE[avg,max] = [0.233168248, 0.312500000]  radices = 224 16 32 32  0  0  0  0  0  0
      7680  msec/iter =  468.23  ROE[avg,max] = [0.239662388, 0.281250000]  radices = 240 16 32 32  0  0  0  0  0  0
On APM X-Gene Mustang board (gcc113):
Code:
17.1
      1024  msec/iter =   70.94  ROE[avg,max] = [0.228257533, 0.281250000]  radices =  32  8  8 16 16  0  0  0  0  0
      1152  msec/iter =   81.80  ROE[avg,max] = [0.221268136, 0.250000000]  radices = 288  8 16 16  0  0  0  0  0  0
      1280  msec/iter =   91.57  ROE[avg,max] = [0.264508929, 0.343750000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =  107.83  ROE[avg,max] = [0.227343750, 0.265625000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =  114.63  ROE[avg,max] = [0.272656250, 0.312500000]  radices =  48 16 32 32  0  0  0  0  0  0
      1664  msec/iter =  127.96  ROE[avg,max] = [0.270758929, 0.312500000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =  137.40  ROE[avg,max] = [0.230078125, 0.281250000]  radices =  56 16 32 32  0  0  0  0  0  0
      1920  msec/iter =  146.12  ROE[avg,max] = [0.257756696, 0.312500000]  radices = 240 16 16 16  0  0  0  0  0  0
      2048  msec/iter =  155.02  ROE[avg,max] = [0.236921038, 0.281250000]  radices = 256 16 16 16  0  0  0  0  0  0
      2304  msec/iter =  177.40  ROE[avg,max] = [0.248751395, 0.312500000]  radices = 288 16 16 16  0  0  0  0  0  0
      2560  msec/iter =  194.36  ROE[avg,max] = [0.236383929, 0.312500000]  radices = 160 16 16 32  0  0  0  0  0  0
      2816  msec/iter =  225.86  ROE[avg,max] = [0.260044643, 0.312500000]  radices = 176 16 16 32  0  0  0  0  0  0
      3072  msec/iter =  243.38  ROE[avg,max] = [0.267466518, 0.312500000]  radices = 192 16 16 32  0  0  0  0  0  0
      3328  msec/iter =  266.68  ROE[avg,max] = [0.279910714, 0.343750000]  radices = 208 16 16 32  0  0  0  0  0  0
      3584  msec/iter =  285.43  ROE[avg,max] = [0.252566964, 0.312500000]  radices = 224 16 16 32  0  0  0  0  0  0
      3840  msec/iter =  303.83  ROE[avg,max] = [0.249302455, 0.343750000]  radices = 240 16 16 32  0  0  0  0  0  0
      4096  msec/iter =  322.17  ROE[avg,max] = [0.229129464, 0.281250000]  radices = 256 16 16 32  0  0  0  0  0  0
      4608  msec/iter =  366.85  ROE[avg,max] = [0.249079241, 0.281250000]  radices = 288 16 16 32  0  0  0  0  0  0
      5120  msec/iter =  414.81  ROE[avg,max] = [0.232087054, 0.250000000]  radices = 160  8  8 16 16  0  0  0  0  0
      5632  msec/iter =  480.58  ROE[avg,max] = [0.232352121, 0.281250000]  radices = 176  8  8 16 16  0  0  0  0  0
      6144  msec/iter =  518.04  ROE[avg,max] = [0.297767857, 0.343750000]  radices = 192  8  8 16 16  0  0  0  0  0
      6656  msec/iter =  568.81  ROE[avg,max] = [0.310044643, 0.375000000]  radices = 208  8  8 16 16  0  0  0  0  0
      7168  msec/iter =  610.64  ROE[avg,max] = [0.234877232, 0.281250000]  radices = 224  8  8 16 16  0  0  0  0  0
      7680  msec/iter =  650.18  ROE[avg,max] = [0.245975167, 0.281250000]  radices = 240  8  8 16 16  0  0  0  0  0
I can give more information on the machines:
gcc113:
$ uname -a
Linux gcc113 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:45:34 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
$ cat /proc/cpuinfo
Processor : AArch64 Processor rev 1 (aarch64)
processor : 0
BogoMIPS : 100.00
processor : 1
BogoMIPS : 100.00
processor : 2
BogoMIPS : 100.00
processor : 3
BogoMIPS : 100.00
processor : 4
BogoMIPS : 100.00
processor : 5
BogoMIPS : 100.00
processor : 6
BogoMIPS : 100.00
processor : 7
BogoMIPS : 100.00
Features : fp asimd evtstrm
CPU implementer : 0x50
CPU architecture: AArch64
CPU variant : 0x0
CPU part : 0x000
CPU revision : 1

Hardware : APM X-Gene Mustang board

$ free
total used free shared buffers cached
Mem: 32969968 15259696 17710272 92 259780 13519520
-/+ buffers/cache: 1480396 31489572
Swap: 20409340 55456 20353884


gcc118:
$ uname -a
Linux gcc118 4.1.12-1-default #1 SMP Thu Oct 29 06:43:42 UTC 2015 (e24bad1) aarch64 aarch64 aarch64 GNU/Linux
$ cat /proc/cpuinfo
processor : 0 [nid: 0]
...
processor : 7 [nid: 0]
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd07
CPU revision : 2

$ free
total used free shared buffers cached
Mem: 16642240 15505856 1136384 7680 4032 15058944
-/+ buffers/cache: 442880 16199360
Swap: 2104256 3712 2100544

Is it useful ?
Do you need more info ?
Best regards
Rocky

Last fiddled with by ewmayer on 2017-11-17 at 22:43 Reason: wrapped cfg-file date in code flag for readability
rocky is offline   Reply With Quote
Old 2017-11-17, 23:01   #126
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2DEC16 Posts
Default

Thanks for the build/system info and timings, Rocky - I wrapped your cfg-file date in code flags for better readability.

Some questions:

1. Are these based on specific Cortex-A[something] versions, or are they AMD-custom implementations of the ARMv8 architecture? Are they available retail, and if so, at what prices?

2. What does e.g. 'gcc118' refer to? Is that some variant of the GCC compiler? (I've only heard of GCC versions up to 7, so the '118' puzzles me, since v11.8 makes no sense.) It seems rather to refer to the 2 boards you ran on, is that right?

3. Your Opteron 1100 cfg-file timings look a lot like Tom Womack's for A57 in post 114 on page 11 of this thread, if we extrapolate backward from his 4-threaded timings to your 1-threaded ones. Since both of your boards are octocores, can you also run the timings 4-threaded (./mlucas -s m -cpu 0:3) and 8-threaded (./mlucas -s m -cpu 0:7) on both systems and post the resulting cfg-file data? (Please warp them in code-blocks like I did for readability.)
ewmayer is offline   Reply With Quote
Old 2017-11-18, 07:44   #127
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

3×199 Posts
Default

Opteron 1100 is an 8 core Cortex-A57 (and likely what Tom used). X-Gene uses a custom core.

Both would cost you a lot if you can find them I'm afraid.

gcc113 and gcc118 are the names of the machines in the gcc build farm that rocky linked in his message.
ldesnogu is offline   Reply With Quote
Old 2017-11-18, 22:25   #128
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
Opteron 1100 is an 8 core Cortex-A57 (and likely what Tom used). X-Gene uses a custom core.

Both would cost you a lot if you can find them I'm afraid.

gcc113 and gcc118 are the names of the machines in the gcc build farm that rocky linked in his message.
Thanks Laurent - looks like what we need is to convince an outfit like Hardkernel to put out an A57-based version of their Odroids. The A57 has been out for enough years that I'm rather surprised at the lack of cheap micro-PC boards based on it.
ewmayer is offline   Reply With Quote
Old 2017-11-19, 12:29   #129
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

486510 Posts
Default Mlucas on Raspberry PI 3 64bits

Here is my testbed:

Code:
Linux pi64 4.10.0-rc5-v8 #1 SMP PREEMPT Wed Jan 25 20:13:50 GMT 2017 aarch64 GNU/Linux

grep asimd /proc/cpuinfo 
Features	: fp asimd evtstrm crc32
Features	: fp asimd evtstrm crc32
Features	: fp asimd evtstrm crc32
Features	: fp asimd evtstrm crc32

gcc version 5.4.0 (Gentoo 5.4.0-r2 p1.2, pie-0.6.5)
and the mlucas.cfg file (4 threads):

Code:
17.1
      1024  msec/iter =   65.15  ROE[avg,max] = [0.254687500, 0.312500000]  radices = 256  8 16 16  0  0  0  0  0  0
      1152  msec/iter =   75.04  ROE[avg,max] = [0.223256138, 0.281250000]  radices = 144 16 16 16  0  0  0  0  0  0
      1280  msec/iter =   81.41  ROE[avg,max] = [0.264508929, 0.343750000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =   94.69  ROE[avg,max] = [0.227343750, 0.265625000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =  103.01  ROE[avg,max] = [0.254241071, 0.312500000]  radices = 192 16 16 16  0  0  0  0  0  0
      1664  msec/iter =  114.01  ROE[avg,max] = [0.270758929, 0.312500000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =  123.92  ROE[avg,max] = [0.220532663, 0.250000000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =  134.22  ROE[avg,max] = [0.257756696, 0.312500000]  radices = 240 16 16 16  0  0  0  0  0  0
      2048  msec/iter =  143.26  ROE[avg,max] = [0.236921038, 0.281250000]  radices = 256 16 16 16  0  0  0  0  0  0
      2304  msec/iter =  165.91  ROE[avg,max] = [0.248751395, 0.312500000]  radices = 288 16 16 16  0  0  0  0  0  0
      2560  msec/iter =  187.56  ROE[avg,max] = [0.236908831, 0.312500000]  radices = 160 32 16 16  0  0  0  0  0  0
      2816  msec/iter =  215.95  ROE[avg,max] = [0.262500000, 0.312500000]  radices = 176 32 16 16  0  0  0  0  0  0
      3072  msec/iter =  234.68  ROE[avg,max] = [0.262111119, 0.312500000]  radices = 192 32 16 16  0  0  0  0  0  0
      3328  msec/iter =  259.13  ROE[avg,max] = [0.281250000, 0.375000000]  radices = 208 32 16 16  0  0  0  0  0  0
      3584  msec/iter =  282.45  ROE[avg,max] = [0.252343750, 0.312500000]  radices = 224 32 16 16  0  0  0  0  0  0
      3840  msec/iter =  306.40  ROE[avg,max] = [0.248437500, 0.343750000]  radices = 240 32 16 16  0  0  0  0  0  0
      4096  msec/iter =  326.20  ROE[avg,max] = [0.228655134, 0.281250000]  radices = 256 32 16 16  0  0  0  0  0  0
      4608  msec/iter =  378.54  ROE[avg,max] = [0.251339286, 0.281250000]  radices = 288 32 16 16  0  0  0  0  0  0
      5120  msec/iter =  417.17  ROE[avg,max] = [0.237137277, 0.281250000]  radices = 160 32 32 16  0  0  0  0  0  0
      5632  msec/iter =  479.39  ROE[avg,max] = [0.256919643, 0.312500000]  radices = 176 32 32 16  0  0  0  0  0  0
      6144  msec/iter =  527.16  ROE[avg,max] = [0.246651786, 0.281250000]  radices = 192 32 32 16  0  0  0  0  0  0
      6656  msec/iter =  581.98  ROE[avg,max] = [0.262500000, 0.312500000]  radices = 208 32 32 16  0  0  0  0  0  0
      7168  msec/iter =  635.62  ROE[avg,max] = [0.224874442, 0.281250000]  radices = 224 32 32 16  0  0  0  0  0  0
      7680  msec/iter =  691.29  ROE[avg,max] = [0.237053571, 0.281250000]  radices = 240 32 32 16  0  0  0  0  0  0
Tests on the Odroids is on the way.

Luigi

Last fiddled with by ET_ on 2017-11-19 at 12:30
ET_ is offline   Reply With Quote
Old 2017-11-19, 16:57   #130
rocky
 
Nov 2017

2 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thanks for the build/system info and timings, Rocky - I wrapped your cfg-file date in code flags for better readability.

Some questions:

3. Your Opteron 1100 cfg-file timings look a lot like Tom Womack's for A57 in post 114 on page 11 of this thread, if we extrapolate backward from his 4-threaded timings to your 1-threaded ones. Since both of your boards are octocores, can you also run the timings 4-threaded (./mlucas -s m -cpu 0:3) and 8-threaded (./mlucas -s m -cpu 0:7) on both systems and post the resulting cfg-file data? (Please warp them in code-blocks like I did for readability.)
I recompiled with -DUSE_THREADS and run with 4 and 8 cpus. Here are the results:

APM X-Gene Mustang board ./Mlucas -s m -cpu 0:7 (8 cpus):
Code:
17.1
      1024  msec/iter =   10.23  ROE[avg,max] = [0.231349041, 0.296875000]  radices =  64 16 16 32  0  0  0  0  0  0
      1152  msec/iter =   12.36  ROE[avg,max] = [0.223343084, 0.312500000]  radices = 288  8 16 16  0  0  0  0  0  0
      1280  msec/iter =   14.86  ROE[avg,max] = [0.264203447, 0.375000000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =   17.70  ROE[avg,max] = [0.228616585, 0.312500000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =   20.17  ROE[avg,max] = [0.271927352, 0.375000000]  radices =  48 16 32 32  0  0  0  0  0  0
      1664  msec/iter =   22.53  ROE[avg,max] = [0.272265625, 0.406250000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =   24.26  ROE[avg,max] = [0.222731285, 0.312500000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =   26.30  ROE[avg,max] = [0.255133245, 0.375000000]  radices = 240 16 16 16  0  0  0  0  0  0
      2048  msec/iter =   27.92  ROE[avg,max] = [0.312242268, 0.406250000]  radices = 128 16 16 32  0  0  0  0  0  0
      2304  msec/iter =   32.79  ROE[avg,max] = [0.249449173, 0.312500000]  radices = 288 16 16 16  0  0  0  0  0  0
      2560  msec/iter =   36.00  ROE[avg,max] = [0.233106476, 0.312500000]  radices = 160 32 16 16  0  0  0  0  0  0
      2816  msec/iter =   40.67  ROE[avg,max] = [0.260065641, 0.375000000]  radices = 176 16 16 32  0  0  0  0  0  0
      3072  msec/iter =   44.44  ROE[avg,max] = [0.266442494, 0.375000000]  radices = 192 16 16 32  0  0  0  0  0  0
      3328  msec/iter =   48.37  ROE[avg,max] = [0.280374114, 0.375000000]  radices = 208 16 16 32  0  0  0  0  0  0
      3584  msec/iter =   52.39  ROE[avg,max] = [0.254961340, 0.312500000]  radices = 224 16 16 32  0  0  0  0  0  0
      3840  msec/iter =   56.34  ROE[avg,max] = [0.247425197, 0.343750000]  radices = 240 16 16 32  0  0  0  0  0  0
      4096  msec/iter =   59.62  ROE[avg,max] = [0.227451789, 0.281250000]  radices = 256 16 16 32  0  0  0  0  0  0
      4608  msec/iter =   68.25  ROE[avg,max] = [0.248973603, 0.312500000]  radices = 288 16 16 32  0  0  0  0  0  0
      5120  msec/iter =   77.27  ROE[avg,max] = [0.234943193, 0.312500000]  radices = 160 16 32 32  0  0  0  0  0  0
      5632  msec/iter =   88.35  ROE[avg,max] = [0.261650290, 0.343750000]  radices = 176 16 32 32  0  0  0  0  0  0
      6144  msec/iter =   98.88  ROE[avg,max] = [0.245978727, 0.343750000]  radices = 192 32 32 16  0  0  0  0  0  0
      6656  msec/iter =  107.94  ROE[avg,max] = [0.268344274, 0.375000000]  radices = 208 16 32 32  0  0  0  0  0  0
      7168  msec/iter =  115.25  ROE[avg,max] = [0.230832822, 0.312500000]  radices = 224 16 32 32  0  0  0  0  0  0
      7680  msec/iter =  124.74  ROE[avg,max] = [0.241652278, 0.343750000]  radices = 240 16 32 32  0  0  0  0  0  0
AMD Opteron 1100 8 cpus:
Code:
      1024  msec/iter =    6.91  ROE[avg,max] = [0.241764659, 0.343750000]  radices =  32 32 32 16  0  0  0  0  0  0
      1152  msec/iter =    8.74  ROE[avg,max] = [0.223343084, 0.312500000]  radices = 288  8 16 16  0  0  0  0  0  0
      1280  msec/iter =    9.55  ROE[avg,max] = [0.264203447, 0.375000000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =   11.00  ROE[avg,max] = [0.228616585, 0.312500000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =   11.50  ROE[avg,max] = [0.271927352, 0.375000000]  radices =  48 16 32 32  0  0  0  0  0  0
      1664  msec/iter =   13.25  ROE[avg,max] = [0.272265625, 0.406250000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =   14.53  ROE[avg,max] = [0.222731285, 0.312500000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =   15.95  ROE[avg,max] = [0.255133245, 0.375000000]  radices = 240 16 16 16  0  0  0  0  0  0
      2048  msec/iter =   16.46  ROE[avg,max] = [0.312242268, 0.406250000]  radices = 128 16 16 32  0  0  0  0  0  0
      2304  msec/iter =   19.40  ROE[avg,max] = [0.270892397, 0.375000000]  radices = 144 32 16 16  0  0  0  0  0  0
      2560  msec/iter =   21.23  ROE[avg,max] = [0.236825121, 0.312500000]  radices = 160 16 16 32  0  0  0  0  0  0
      2816  msec/iter =   24.05  ROE[avg,max] = [0.225605129, 0.281250000]  radices = 176  8  8  8 16  0  0  0  0  0
      3072  msec/iter =   26.94  ROE[avg,max] = [0.248841087, 0.312500000]  radices = 192  8  8  8 16  0  0  0  0  0
      3328  msec/iter =   28.65  ROE[avg,max] = [0.231577450, 0.312500000]  radices = 208  8  8  8 16  0  0  0  0  0
      3584  msec/iter =   31.00  ROE[avg,max] = [0.254961340, 0.312500000]  radices = 224 16 16 32  0  0  0  0  0  0
      3840  msec/iter =   33.57  ROE[avg,max] = [0.247425197, 0.343750000]  radices = 240 16 16 32  0  0  0  0  0  0
      4096  msec/iter =   35.65  ROE[avg,max] = [0.227451789, 0.281250000]  radices = 256 16 16 32  0  0  0  0  0  0
      4608  msec/iter =   40.67  ROE[avg,max] = [0.228279161, 0.312500000]  radices = 288  8  8  8 16  0  0  0  0  0
      5120  msec/iter =   46.35  ROE[avg,max] = [0.234943193, 0.312500000]  radices = 160 16 32 32  0  0  0  0  0  0
      5632  msec/iter =   51.91  ROE[avg,max] = [0.259536082, 0.343750000]  radices = 176 32 32 16  0  0  0  0  0  0
      6144  msec/iter =   58.14  ROE[avg,max] = [0.245978727, 0.343750000]  radices = 192 32 32 16  0  0  0  0  0  0
      6656  msec/iter =   61.66  ROE[avg,max] = [0.266108247, 0.375000000]  radices = 208 32 32 16  0  0  0  0  0  0
      7168  msec/iter =   66.42  ROE[avg,max] = [0.225737707, 0.312500000]  radices = 224 32 32 16  0  0  0  0  0  0
      7680  msec/iter =   71.11  ROE[avg,max] = [0.236637429, 0.312500000]  radices = 240 32 32 16  0  0  0  0  0  0
AMD Opteron 1100 4 cpus:
Code:
      1024  msec/iter =   13.10  ROE[avg,max] = [0.241406250, 0.312500000]  radices =  32 32 32 16  0  0  0  0  0  0
      1152  msec/iter =   16.24  ROE[avg,max] = [0.221044922, 0.250000000]  radices = 288  8 16 16  0  0  0  0  0  0
      1280  msec/iter =   17.22  ROE[avg,max] = [0.250167411, 0.312500000]  radices =  40 16 32 32  0  0  0  0  0  0
      1408  msec/iter =   21.08  ROE[avg,max] = [0.227343750, 0.265625000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =   21.65  ROE[avg,max] = [0.272656250, 0.312500000]  radices =  48 16 32 32  0  0  0  0  0  0
      1664  msec/iter =   25.20  ROE[avg,max] = [0.270758929, 0.312500000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =   26.09  ROE[avg,max] = [0.230078125, 0.281250000]  radices =  56 16 32 32  0  0  0  0  0  0
      1920  msec/iter =   29.81  ROE[avg,max] = [0.257756696, 0.312500000]  radices = 240 16 16 16  0  0  0  0  0  0
      2048  msec/iter =   30.09  ROE[avg,max] = [0.228236607, 0.281250000]  radices =  32 32 32 32  0  0  0  0  0  0
      2304  msec/iter =   35.97  ROE[avg,max] = [0.272405134, 0.343750000]  radices = 144 16 16 32  0  0  0  0  0  0
      2560  msec/iter =   39.08  ROE[avg,max] = [0.245999581, 0.312500000]  radices =  40 32 32 32  0  0  0  0  0  0
      2816  msec/iter =   44.60  ROE[avg,max] = [0.260044643, 0.312500000]  radices = 176 16 16 32  0  0  0  0  0  0
      3072  msec/iter =   48.20  ROE[avg,max] = [0.224818638, 0.251953125]  radices =  48 32 32 32  0  0  0  0  0  0
      3328  msec/iter =   52.66  ROE[avg,max] = [0.279017857, 0.343750000]  radices = 208 16 16 32  0  0  0  0  0  0
      3584  msec/iter =   56.96  ROE[avg,max] = [0.252566964, 0.312500000]  radices = 224 16 16 32  0  0  0  0  0  0
      3840  msec/iter =   61.36  ROE[avg,max] = [0.249302455, 0.343750000]  radices = 240 16 16 32  0  0  0  0  0  0
      4096  msec/iter =   64.82  ROE[avg,max] = [0.229129464, 0.281250000]  radices = 256 16 16 32  0  0  0  0  0  0
      4608  msec/iter =   74.32  ROE[avg,max] = [0.249079241, 0.281250000]  radices = 288 16 16 32  0  0  0  0  0  0
      5120  msec/iter =   81.62  ROE[avg,max] = [0.237137277, 0.281250000]  radices = 160 32 32 16  0  0  0  0  0  0
      5632  msec/iter =   92.76  ROE[avg,max] = [0.256919643, 0.312500000]  radices = 176 32 32 16  0  0  0  0  0  0
      6144  msec/iter =  102.56  ROE[avg,max] = [0.246651786, 0.281250000]  radices = 192 32 32 16  0  0  0  0  0  0
      6656  msec/iter =  109.54  ROE[avg,max] = [0.262500000, 0.312500000]  radices = 208 32 32 16  0  0  0  0  0  0
      7168  msec/iter =  117.78  ROE[avg,max] = [0.224874442, 0.281250000]  radices = 224 32 32 16  0  0  0  0  0  0
      7680  msec/iter =  126.96  ROE[avg,max] = [0.237053571, 0.281250000]  radices = 240 32 32 16  0  0  0  0  0  0
rocky is offline   Reply With Quote
Old 2017-11-20, 00:03   #131
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1175610 Posts
Default

Quote:
Originally Posted by ET_ View Post
Here is my testbed:

[code]
Linux pi64 4.10.0-rc5-v8 #1 SMP PREEMPT Wed Jan 25 20:13:50 GMT 2017 aarch64 GNU/Linux
[snip]
Luigi, your Pi3 timings indicate around 2/3 the speed I get on my Odroid C2, 4-threaded. Is that roughly what you would expect based on the respective hardware profiles?

Rocky, thanks for the multithreaded timings, here are the resulting speedup-factor tables:
X-gene:
Code:
	1-thr:	8-thr: speedup
1024	 70.94	 10.23	6.93x
1152	 81.80	 12.36	6.62x
1280	 91.57	 14.86	6.16x
1408	107.83	 17.70	6.09x
1536	114.63	 20.17	5.68x
1664	127.96	 22.53	5.68x
1792	137.40	 24.26	5.66x
1920	146.12	 26.30	5.56x
2048	155.02	 27.92	5.55x
2304	177.40	 32.79	5.41x
2560	194.36	 36.00	5.40x
2816	225.86	 40.67	5.55x
3072	243.38	 44.44	5.48x
3328	266.68	 48.37	5.51x
3584	285.43	 52.39	5.45x
3840	303.83	 56.34	5.39x
4096	322.17	 59.62	5.40x
4608	366.85	 68.25	5.38x
5120	414.81	 77.27	5.37x
5632	480.58	 88.35	5.44x
6144	518.04	 98.88	5.24x
6656	568.81	107.94	5.27x
7168	610.64	115.25	5.30x
7680	650.18	124.74	5.21x
Opteron 1100:
Code:
	1-thr:	4-thr:	speedup	8-thr:	speedup
1024	 49.45	 13.10	3.77x	 6.91	7.16x
1152	 60.75	 16.24	3.74x	 8.74	6.95x
1280	 66.72	 17.22	3.87x	 9.55	6.99x
1408	 78.04	 21.08	3.70x	11.00	7.09x
1536	 80.59	 21.65	3.72x	11.50	7.01x
1664	 94.73	 25.20	3.76x	13.25	7.15x
1792	 97.68	 26.09	3.74x	14.53	6.72x
1920	110.56	 29.81	3.71x	15.95	6.93x
2048	109.44	 30.09	3.64x	16.46	6.65x
2304	132.20	 35.97	3.68x	19.40	6.81x
2560	146.44	 39.08	3.75x	21.23	6.90x
2816	167.99	 44.60	3.77x	24.05	6.99x
3072	175.99	 48.20	3.65x	26.94	6.53x
3328	197.09	 52.66	3.74x	28.65	6.88x
3584	206.07	 56.96	3.62x	31.00	6.65x
3840	229.05	 61.36	3.73x	33.57	6.82x
4096	240.88	 64.82	3.72x	35.65	6.76x
4608	276.55	 74.32	3.72x	40.67	6.80x
5120	300.46	 81.62	3.68x	46.35	6.48x
5632	342.46	 92.76	3.69x	51.91	6.60x
6144	380.17	102.56	3.72x	58.14	6.54x
6656	403.75	109.54	3.69x	61.66	6.55x
7168	433.80	117.78	3.68x	66.42	6.53x
7680	468.23	126.96	3.69x	71.11	6.58x
So both platforms show ~7x gain from 8-threads at the smaller FFT lengths, but X-Gene shows significant degradation of parallel scaling as the FFT lengths increase.
ewmayer is offline   Reply With Quote
Old 2017-11-20, 15:55   #132
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

5×7×139 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Luigi, your Pi3 timings indicate around 2/3 the speed I get on my Odroid C2, 4-threaded. Is that roughly what you would expect based on the respective hardware profiles?
Yes. Slower clock, (s)lower memory and no cooling surfaces (as well as tests with other software) put the PI at 60% - 70% of the Odroid.
ET_ is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Economic prospects for solar photovoltaic power cheesehead Science & Technology 137 2018-06-26 15:46
Which SIMD flag to use for Raspberry Pi BrainStone Mlucas 14 2017-11-19 00:59
compiler/assembler optimizations possible? ixfd64 Software 7 2011-02-25 20:05
Running 32-bit builds on a Win7 system ewmayer Programming 34 2010-10-18 22:36
SIMD string->int fivemack Software 7 2009-03-23 18:15

All times are UTC. The time now is 04:24.


Fri Jul 7 04:24:33 UTC 2023 up 323 days, 1:53, 0 users, load averages: 1.73, 1.69, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔