mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Mlucas (https://www.mersenneforum.org/forumdisplay.php?f=118)
-   -   Mlucas v18 available (https://www.mersenneforum.org/showthread.php?t=24100)

Lorenzo 2019-03-29 10:07

Hello! Benchmark for v18 on [B]Ampere eMAG 32-Core @ 3.3GHz[/B] using pre-built Mlucas_v18_c2simd.

[CODE]root@lorenzoArm:~/mersenne/arm8# lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
CPU max MHz: 3300.0000
CPU min MHz: 363.9700
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
NUMA node0 CPU(s): 0-31[/CODE]

[CODE]root@lorenzoArm:~/mersenne/arm8# cat /proc/cpuinfo
processor : 0
BogoMIPS : 90.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x50
CPU architecture: 8
CPU variant : 0x3
CPU part : 0x000
CPU revision : 2
[/CODE]

root@lorenzoArm:~/mersenne/arm8# ./Mlucas_v18_c2simd -s m -cpu 0:31:
[CODE]
root@lorenzoArm:~/mersenne/arm8# cat mlucas.cfg
18.0
2048 msec/iter = 19.63 ROE[avg,max] = [0.000307249, 0.375000000] radices = 128 32 16 16 0 0 0 0 0 0
2304 msec/iter = 19.88 ROE[avg,max] = [0.000272423, 0.375000000] radices = 144 32 16 16 0 0 0 0 0 0
2560 msec/iter = 22.07 ROE[avg,max] = [0.000281943, 0.375000000] radices = 160 8 8 8 16 0 0 0 0 0
2816 msec/iter = 22.07 ROE[avg,max] = [0.000260572, 0.312500000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 22.24 ROE[avg,max] = [0.000265834, 0.375000000] radices = 192 16 16 32 0 0 0 0 0 0
3328 msec/iter = 23.63 ROE[avg,max] = [0.000281118, 0.375000000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 25.02 ROE[avg,max] = [0.000250660, 0.343750000] radices = 224 32 16 16 0 0 0 0 0 0
3840 msec/iter = 26.60 ROE[avg,max] = [0.000222911, 0.312500000] radices = 60 32 32 32 0 0 0 0 0 0
4096 msec/iter = 25.42 ROE[avg,max] = [0.000244299, 0.312500000] radices = 64 32 32 32 0 0 0 0 0 0
4608 msec/iter = 30.73 ROE[avg,max] = [0.000298148, 0.375000000] radices = 144 8 8 16 16 0 0 0 0 0
5120 msec/iter = 31.50 ROE[avg,max] = [0.000235369, 0.312500000] radices = 160 32 32 16 0 0 0 0 0 0
5632 msec/iter = 33.74 ROE[avg,max] = [0.000257523, 0.343750000] radices = 176 32 32 16 0 0 0 0 0 0
6144 msec/iter = 36.94 ROE[avg,max] = [0.000247058, 0.312500000] radices = 192 32 32 16 0 0 0 0 0 0
6656 msec/iter = 36.74 ROE[avg,max] = [0.000313628, 0.406250000] radices = 208 8 8 16 16 0 0 0 0 0
7168 msec/iter = 36.94 ROE[avg,max] = [0.000233152, 0.312500000] radices = 224 8 8 16 16 0 0 0 0 0
7680 msec/iter = 36.94 ROE[avg,max] = [0.000246354, 0.312500000] radices = 240 8 8 16 16 0 0 0 0 0
[/CODE]

Lorenzo 2019-03-29 10:10

Just FYI
[CODE]root@lorenzoArm:~/mersenne/arm8# ./Mlucas_v18_c2simd -fftlen 18432 -iters 100 -cpu 0:7

Mlucas 18.0

http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 5.4.0 20160609.
INFO: Build uses ARMv8 advanced-SIMD instruction set.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: MLUCAS_PATH is set to ""
INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
Setting DAT_BITS = 10, PAD_BITS = 2
INFO: testing IMUL routines...
INFO: System has 32 available processor cores.
INFO: testing FFT radix tables...
Set affinity for the following 8 cores: 0.1.2.3.4.5.6.7.

Mlucas selftest running.....

/****************************************************************************/

NTHREADS = 8
M337615261: using FFT length 18432K = 18874368 8-byte floats, initial residue shift count = 49407158
this gives an average 17.887500180138481 bits per digit
Using complex FFT radices 288 32 32 32
mers_mod_square: Init threadpool of 8 threads
radix16_dif_dit_pass pfetch_dist = 32
radix16_wrapper_square: pfetch_dist = 1024
Using 8 threads in carry step
100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 321038982
Res64: 69FF742497F16902. AvgMaxErr = 0.003191964. MaxErr = 0.375000000. Program: E18.0
Res mod 2^36 = 19729049858
Res mod 2^35 - 1 = 20161851329
Res mod 2^36 - 1 = 1044285462
Clocks = 00:00:21.067

NTHREADS = 8
M337615261: using FFT length 18432K = 18874368 8-byte floats, initial residue shift count = 321038982
this gives an average 17.887500180138481 bits per digit
Using complex FFT radices 144 16 16 16 16
mers_mod_square: Init threadpool of 8 threads
Using 8 threads in carry step
100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 171176556
Res64: 2258A7342961B652. AvgMaxErr = 0.002428013. MaxErr = 0.281250000. Program: E18.0
Res mod 2^36 = 17874138706
Res mod 2^35 - 1 = 28069471175
Res mod 2^36 - 1 = 53816329185
Clocks = 00:00:21.009
NTHREADS = 8
[B][COLOR="Red"]ERROR: at line 1540 of file ../src/Mlucas.c
Assertion failed: Return value of shift_word(): unpadded-array-index out of range![/COLOR][/B][/CODE]

ewmayer 2019-03-29 19:27

Thanks, Lorenzo: Could you also try the '-s m' tests using just -cpu 0:3 on that 32-core system and post the resulting cfg-file here? I'd like to see what kind of degradation of parallelism results from using more than one socket on that system.

I will look into the residue-shift assertion issue you hit in your 18432K FFT length test - first I need to see if I can reproduce it on any of the hardware I have. By way of a workaround, '-shift 0' should bypass all the new residue-shift code and allow you get a best-radix-set timing at 18432K.

[b]Edit:[/b] I was able to reproduce the assertion on my 2-core Macbook using an x86 SSE2 build, so the bug appears to be code-logic-related rather than anything platform specific.

Lorenzo 2019-03-30 07:51

[QUOTE=ewmayer;512164]Thanks, Lorenzo: Could you also try the '-s m' tests using just -cpu 0:3 on that 32-core system and post the resulting cfg-file here? I'd like to see what kind of degradation of parallelism results from using more than one socket on that system.

I will look into the residue-shift assertion issue you hit in your 18432K FFT length test - first I need to see if I can reproduce it on any of the hardware I have. By way of a workaround, '-shift 0' should bypass all the new residue-shift code and allow you get a best-radix-set timing at 18432K.

[b]Edit:[/b] I was able to reproduce the assertion on my 2-core Macbook using an x86 SSE2 build, so the bug appears to be code-logic-related rather than anything platform specific.[/QUOTE]

Hello, ewmayer! Sorry but unfortunately I haven't access to this machine any more.

kriesel 2019-03-30 19:18

[QUOTE=Lorenzo;512109]Hello! Benchmark for v18 on [B]Ampere eMAG 32-Core @ 3.3GHz[/B] using pre-built Mlucas_v18_c2simd.
[CODE]
4608 msec/iter = 30.73 ROE[avg,max] = [0.000298148, 0.375000000] radices =
[/CODE][/QUOTE]
Yikes, 717 hours, so at nominal $1/hour, that works out to over $700/84M primality test at [URL]https://www.packet.com/cloud/servers/[/URL] It's triple the speed of Ernst's Samsung S7 phone, at far higher cost (~83x) there. I've bought whole used workstations capable of 10+ times the 30.73ms/it speed, for the price of one exponent at packet.com at that rate. (Spot rate $0.25/hr helps but not nearly enough.)

ewmayer 2019-03-30 19:29

[QUOTE=Lorenzo;512191]Hello, ewmayer! Sorry but unfortunately I haven't access to this machine any more.[/QUOTE]

OK - for future reference, on a 'typical' multisocket system with 1 or more sockets and each socket holding a 4-core CPU, I like to see the following timing tests:

1. All 4 cores on 1 socket: '-s m -iters 100 -cpu 0:3'

2. If there are differences between the CPUs on various sockets (use /proc/cpuinfo as your guide here), run the same self-tests on each distinct-CPU-type socket. If it's e.g. a BIG socket with a high-perf CPU having just 2 cores, fiddle the -cpu args to use just those 2 cores;

3. All cores across all sockets: Like the 32-core test you did above;

4. One program instance per socket: This can get tricky in self-test mode if the runspeed varies appreciably between sockets. Better is to create a rundir for each socket, e.g. run0-run7 on an 8-socket 32-core system, copy the mlucas.cfg files from your -cpu 0:3 self-test to each rundir, create a worktodo.ini file containing one exponent of the size range of interest (you can use a single-shot invocation of the primenet.py script to grab such an assignment), then copy that to each rundir. Then cd to run0 and fire up a production run using -cpu 0:3, let that get to the first 10000-iter checkpoint (you will see a pair of p|q-named binary savefiles get created, and the p*.stat file updated with a checkpoint entry), that gives you a production-run timing for 1 socket. At that point cd to each of the other rundirs in turn and start up an instance in each. Let those runs get through a couple checkpoints and average the last-line-of-statfile timings, compare that average to the 1-socket-used timing.

I have found and fixed the bug your 18432K self-test exposed, will post update on that once I finish creating new ARM binaries from the updated source tarball and uploading to the server.

ewmayer 2019-03-30 19:39

[QUOTE=kriesel;512248]Yikes, 717 hours, so at nominal $1/hour, that works out to over $700/84M primality test at [URL]https://www.packet.com/cloud/servers/[/URL] It's triple the speed of Ernst's Samsung S7 phone, at far higher cost (~83x) there. I've bought whole used workstations capable of 10+ times the 30.73ms/it speed, for the price of one exponent at packet.com at that rate. (Spot rate $0.25/hr helps but not nearly enough.)[/QUOTE]

I suspect the total throughput for that system would be several times greater using one instance per 4-core socket, that's why I asked Lorenzo if he could provide that timing. You get a hint of throughput loss due to too-many-threads-for-one-job from the cfg-file timings he posted: The 32-thread timing @7680K is less than 2x greater than that for 2048K. Larger FFT lengths tend to be more parallelizable than smaller ones because at the same threadcount the work units done by each thread are proportionally larger, resulting in that sort of timing pattern. But I'm sure even in optimum-usage mode such a system would be a lot more expensive than a cellphone compute node - reminiscent of the difference between the big-iron AWS-instance runs we use to verify new prime discoveries, as compared to a $/FLOP-optimized low-end retail Intel rig.

But manycore tests are always interesting because we hope to see signs of one manufacturer or another achieving a breakthrough in parallelism. Though in that regard, nearly all the action the past 5 years has been on the GPU side of the ledger.

kriesel 2019-03-31 00:21

[QUOTE=ewmayer;512250]I suspect the total throughput for that system would be several times greater using one instance per 4-core socket, that's why I asked Lorenzo if he could provide that timing. You get a hint of throughput loss due to too-many-threads-for-one-job from the cfg-file timings he posted: The 32-thread timing @7680K is less than 2x greater than that for 2048K. Larger FFT lengths tend to be more parallelizable than smaller ones because at the same threadcount the work units done by each thread are proportionally larger, resulting in that sort of timing pattern. But I'm sure even in optimum-usage mode such a system would be a lot more expensive than a cellphone compute node - reminiscent of the difference between the big-iron AWS-instance runs we use to verify new prime discoveries, as compared to a $/FLOP-optimized low-end retail Intel rig.

But manycore tests are always interesting because we hope to see signs of one manufacturer or another achieving a breakthrough in parallelism. Though in that regard, nearly all the action the past 5 years has been on the GPU side of the ledger.[/QUOTE]
Thought experiment: suppose one instance per 4-core socket was the same speed as his 32-core test, so 8 instances, 8-fold more throughput. It still loses to the dual-e5-2670 that I bought for a month's rent of the 32-arm-core system.
Another way to go at it would be to make 1-core, 4-core, and 8-core benchmark runs and compare to the 32.

ldesnogu 2019-03-31 17:39

[QUOTE=kriesel;512264]It still loses to the dual-e5-2670 that I bought for a month's rent of the 32-arm-core system.[/QUOTE]
How much power does your system consume? How much will that cost you?

For the record, the CPU from Ampere is not that great from a performance point of view, in particular its FP performance is less than Amazon Cortex-A72 chip despite running at 3.3 GHz vs 2.3 GHz: [URL]http://browser.geekbench.com/v4/cpu/compare/12589322?baseline=11678329[/URL]

It's not even that much faster than an S7: [URL]http://browser.geekbench.com/v4/cpu/compare/12589322?baseline=12621230[/URL]

BTW Ernst, I'm afraid I don't get why you're talking about multiple sockets. That system has a single socket.

kriesel 2019-03-31 19:21

[QUOTE=ldesnogu;512313]How much power does your system consume? How much will that cost you?[/QUOTE]~US$3 / 85M exponent total cost, equipment amortization and utilities and taxes. Details at [url]https://www.mersenneforum.org/showpost.php?p=512218&postcount=20[/url]
I have no reason to expect that figure to be optimal among cpu choices. It's just one of the better among my little fleet. (Then there's curtisc's and others' $0/exponent, when the participant is using someone else's hardware and electricity.)

ewmayer 2019-03-31 19:34

[QUOTE=ldesnogu;512313]BTW Ernst, I'm afraid I don't get why you're talking about multiple sockets. That system has a single socket.[/QUOTE]

Ah, I didn't look into the details of that kind of system, assumed it was a single-mobo cluster of 2 or 4-core cortex CPUs.


All times are UTC. The time now is 11:03.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.