mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Mlucas (https://www.mersenneforum.org/forumdisplay.php?f=118)
-   -   ARM builds and SIMD-assembler prospects (https://www.mersenneforum.org/showthread.php?t=21992)

Lorenzo 2017-03-14 20:02

[QUOTE=VictordeHolland;454872]You got it working, nice!
That is a Pine64 with 4x ARM Cortex A53 cores (@1.4GHz) right?

I'm a little bit surprised it is about as fast as my
Odroid-U2 (4x ARM Cortex A9 cores @1.7Ghz)
which is only 32bit and an much older architecture.
[URL]http://mersenneforum.org/showpost.php?p=426575&postcount=94[/URL]
[/QUOTE]

Right. This is PINE64 board.
Anyway results for PINE64 is bit better and your device has bigger freq (+300MHz for each core). So PINE64 will be much better on the same freq as your device :)

Also i did benchmark for 1-3 threads for moreless actual FFT size 2048K:
./mlucas -fftlen 2048 -nthread N -iters 10
[CODE] 2048 msec/iter = 707.58 ROE[avg,max] = [0.000000000, 0.000091553] radices = 256 16 16 16 0 0 0 0 0 0
2048 msec/iter = 371.82 ROE[avg,max] = [0.000000000, 0.000091553] radices = 256 16 16 16 0 0 0 0 0 0
2048 msec/iter = 241.66 ROE[avg,max] = [0.000000000, 0.000091553] radices = 256 16 16 16 0 0 0 0 0 0
[/CODE]

ewmayer 2017-03-14 20:21

[QUOTE=VictordeHolland;454872]You got it working, nice!
That is a Pine64 with 4x ARM Cortex A53 cores (@1.4GHz) right?

I'm a little bit surprised it is about as fast as my
Odroid-U2 (4x ARM Cortex A9 cores @1.7Ghz)
which is only 32bit and an much older architecture.
...
BTW: Is it possible to compile run Mlucas on Windows 7/10? If so, I could try to run benchmarks on my i5 2500k and/or i7 3770k[/QUOTE]

Thanks for the timings! 32 vs 64-bit speed for LL testing is overwhelmingly a matter of the float-double capability - how do those 2 version of the ARM compare in that regard?

I used to have Win-buildability in the 32-bit days for the x86, but MSFT delayed supporting 64-bit inline asm by at least 4-5 years (w.r.to when x86_64 started shipping), so I dropped Win support years ago. To build/run under Win you'll need a Linux emulator.

wombatman 2017-03-14 20:37

Under Windows in the Ubuntu shell with an i7-6700 @ 3.4Ghz, using:
[CODE]./mlucas -fftlen 2048 -nthread N -iters 10 [/CODE]

with N=1 to 8 (4 core machine with hyperthreading)

[CODE] 2048 msec/iter = 21.03 ROE[avg,max] = [0.000000000, 0.000091553] radices = 32 32 32 32 0 0 0 0 0 0
2048 msec/iter = 13.90 ROE[avg,max] = [0.000000000, 0.000091553] radices = 32 8 16 16 16 0 0 0 0 0
2048 msec/iter = 11.43 ROE[avg,max] = [0.000000000, 0.000091553] radices = 64 16 32 32 0 0 0 0 0 0
2048 msec/iter = 10.52 ROE[avg,max] = [0.000000000, 0.000091553] radices = 32 8 16 16 16 0 0 0 0 0
2048 msec/iter = 10.79 ROE[avg,max] = [0.000000000, 0.000091553] radices = 32 8 16 16 16 0 0 0 0 0
2048 msec/iter = 11.39 ROE[avg,max] = [0.000000000, 0.000091553] radices = 128 16 16 32 0 0 0 0 0 0
2048 msec/iter = 11.06 ROE[avg,max] = [0.000000000, 0.000091553] radices = 256 16 16 16 0 0 0 0 0 0
2048 msec/iter = 11.43 ROE[avg,max] = [0.000000000, 0.000091553] radices = 256 16 16 16 0 0 0 0 0 0[/CODE]

With 100 iterations:

[CODE]2048 msec/iter = 18.29 ROE[avg,max] = [0.247767857, 0.250000000] radices = 32 32 32 32 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274, 8060072069, 29249383388
2048 msec/iter = 11.17 ROE[avg,max] = [0.341964286, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274, 8060072069, 29249383388
2048 msec/iter = 8.36 ROE[avg,max] = [0.312165179, 0.375000000] radices = 128 16 16 32 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274, 8060072069, 29249383388
2048 msec/iter = 7.84 ROE[avg,max] = [0.341964286, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274, 8060072069, 29249383388
2048 msec/iter = 7.67 ROE[avg,max] = [0.312165179, 0.375000000] radices = 128 16 16 32 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274, 8060072069, 29249383388
2048 msec/iter = 7.64 ROE[avg,max] = [0.341964286, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274, 8060072069, 29249383388
2048 msec/iter = 7.70 ROE[avg,max] = [0.341964286, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274, 8060072069, 29249383388
2048 msec/iter = 7.69 ROE[avg,max] = [0.341964286, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274, 8060072069, 29249383388[/CODE]

ewmayer 2017-03-14 20:48

@wombatman: Suggest you use 1000-iter for your multithread-scaling tests, to minimize init-overhead effects. (More precisely, one would do 1000*(t_1000-t_100)/900.)

wombatman 2017-03-14 23:53

No problem. I can do that tomorrow when I'm back at work :smile:

VictordeHolland 2017-03-15 13:29

2 Attachment(s)
[QUOTE=ewmayer;454878]Thanks for the timings! 32 vs 64-bit speed for LL testing is overwhelmingly a matter of the float-double capability - how do those 2 version of the ARM compare in that regard?
[/QUOTE]
ARM Cortex A9 was announced in October 2007, the Cortex A53 in October 2012. But 5 years newer doesn't tell the whole story. The design choices were different.

A9
The ARM Cortex A9 was designed as a 'performance' core (with a power budget) and is dual-issue, Out-of-Order. In other words, it can decode/send two instructions per clock to the execution units and reorder them if necessary to extract extra performance. But a Vector Floating Point (VFP) execution unit is [U]not[/U] mandatory in the A9. Most A9s have the (optional) Vector Float Point v3 (VFPv3) for handling FP though. It has 32 registers of 64-bits with NEON capability. NEON is ARMs name for a SIMD. If I understand it all correctly the A9 is limited to[U] 1 DP Float per clock.
[/U]
A53
ARM designed A53 with (high) power efficiency in mind as it is supposed to fill the roll of 'little' cores in their little.BIG philosophy. So in many high-end devices (mostly phones) they are coupled with more powerful A57 or A72 cores. When maximum responsiveness is needed (loading websites/apps, games, etc) the A57/A72 cores are used. While the A53s handles background tasks with their greater efficiency in order to extend battery life. Anandtech tested them in the Samsung Exynos 7420 and 5433, taking into account overhead and different frequencies and concluded a Cortex A53 core consumes ~200mW/core @1.4GHz (see attached graph). A53 is a dual-issue in-order design with a VFPv4 + (advanced) NEON [U]mandatory[/U]. The VFPv4 has 32 registers of 128-bits, theoretically allowing it to process [U]2 DP Floats per clock. [/U]

Other differences which could impact performance of the ODROID-U2 vs. PINE64
ODROID-U2 has 1MB L2 cache (shared amongst the cores), PINE64 512KB L2 (shared amongst the cores).
Fab: 32nm (U2) vs. 40nm (PINE64) which might explain/allow the U2 to clock slightly higher (1.7GHz vs 1.4GHz).

If anybody has a board with A57s or A72s, please share your benchmarks, we're curious how they perform :).

I've also attached a graph with a comparison of Drystone benchmark performance (DMIPS/MHz) of different ARM architectures. Keep in mind Drystone is an old Integer benchmark, but it gives a rough idea.

PINE64 Allwinner A64:
[URL]http://linux-sunxi.org/A64[/URL]
ODROID-U2:
[URL]http://www.hardkernel.com/main/products/prdt_info.php?g_code=G135341370451&tab_idx=2[/URL]

Useful pages for comparison between cores:
[URL]https://en.wikipedia.org/wiki/Comparison_of_ARMv7-A_cores[/URL]
[URL]https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores[/URL]

wombatman 2017-03-15 14:57

[QUOTE=ewmayer;454884]@wombatman: Suggest you use 1000-iter for your multithread-scaling tests, to minimize init-overhead effects. (More precisely, one would do 1000*(t_1000-t_100)/900.)[/QUOTE]

As requested, the 1000 iteration tests for nthread = 1-4:
[CODE] 2048 msec/iter = 18.94 ROE[avg,max] = [0.370465528, 0.375000000] radices = 128 16 16 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 81AEAC0C7E6089BB, 25132671466, 41950605021
2048 msec/iter = 11.20 ROE[avg,max] = [0.370465528, 0.375000000] radices = 128 16 16 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 81AEAC0C7E6089BB, 25132671466, 41950605021
2048 msec/iter = 8.48 ROE[avg,max] = [0.372615979, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 81AEAC0C7E6089BB, 25132671466, 41950605021
2048 msec/iter = 8.01 ROE[avg,max] = [0.372615979, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 81AEAC0C7E6089BB, 25132671466, 41950605021[/CODE]

For anyone else wanting to run mlucas on Windows 10, the Ubuntu shell works well, and mlucas compiles straight away.

ldesnogu 2017-03-15 15:17

IIRC Cortex-A9 can only issue one DP mul every other cycle. But that was so long ago, that I might be wrong...

ewmayer 2017-03-15 22:02

Many thanks for the details, VdH - one key point, though, needing clarification - In accordance with my earlier post re. the number of 128-bit registers, I believe your "VFPv4 has 32 such" is 2x too large. From Wikipedia (underlines mine):
[i]
[b]VFPv4 or VFPv4-D32[/b]
Implemented on the Cortex-A12 and A15 ARMv7 processors, Cortex-A7 optionally has VFPv4-D32 in the case of an FPU with NEON.[81] [u]VFPv4 has 32 64-bit FPU registers as standard[/u], adds both half-precision support as a storage format and fused multiply-accumulate instructions to the features of VFPv3.
[/i]
The same wikipage says Aarch64 has 31 64-bit GPRs - just confirming, those are distinct from the FPRs, yes?

Wombatman, thanks for the timings - so no appreciable difference vs your simple 100-iter ones here. (This varies a lot by CPU< thus always better safe than sorry.)

ldesnogu 2017-03-15 22:49

AArch64 has 32 128-bit SIMD/FP regs on top of 3x 64-bit int regs.

fivemack 2017-03-15 22:51

AArch64 has 32 integer registers (but X31 reads as zero and throws away anything written to it, so basically that's 31 registers), and also 32 128-bit-wide "SIMD and floating-point" registers.

Code looks like

FADD V3.2D, V5.2D, V7.2D (which adds the doubles in V5[127:64] and V7[127:64] and puts the result in V3[127:64], and also adds the doubles in V5[63:0] and V7[63:0] and puts the result in V3[63:0])

or FADD S3, S7, S2 (which adds the bottom floats of V7 and V2, puts the result in the bottom float of S3, and sets the other three floats of V3 to zero)

It has fused FMA support, but in the form Vd = Vd + Vm*Vn because there isn't space to pass four five-bit register names in a 32-bit opcode (there is also an FMLS instruction that does Vd = Vd - Vm*Vn form).


All times are UTC. The time now is 04:24.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.