![]() |
|
|
#78 | |||
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote:
Don't really need the log, just post the resulting mlucas.cfg file. Quote:
Quote:
'./Mlucas -s m -nthread 2' is the same as './Mlucas -s m -cpu 0:1' and './Mlucas -s m -nthread 4' is the same as './Mlucas -s m -cpu 0:3' |
|||
|
|
|
|
|
#79 | ||
|
Banned
"Luigi"
Aug 2002
Team Italia
32×5×107 Posts |
Quote:
Quote:
The mlucas.cfg is correctly written after the test is completed, while I erroneously thought it was written on the fly. Luigi P.S. Guess you may need the 2 and 3 threads files as well... 3 Looks like a good choice. If so, I will compute them tomorrow. Last fiddled with by ET_ on 2017-06-23 at 22:26 |
||
|
|
|
|
|
#80 | ||
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
Quote:
Quote:
Code:
1-thread: 2-thread: 3-thread: 4-thread: FFTlen ms/iter ||-eff% ms/iter ||-eff% ms/iter ||-eff% ms/iter ||-eff% 1024 223.93 100 112.33 99.7 95.18 78.4 61.24 91.4 1152 272.53 100 137.15 99.4 117.28 77.5 74.66 91.3 1280 317.98 100 160.73 98.9 138.97 76.3 88.37 90.0 1408 351.05 100 177.97 98.6 152.51 76.7 95.73 91.7 1536 387.47 100 197.59 98.0 169.60 76.2 108.45 89.3 1664 411.91 100 209.30 98.4 179.49 76.5 112.95 91.2 1792 419.99 100 213.92 98.2 182.06 76.9 116.43 90.2 1920 485.82 100 246.78 98.4 212.18 76.3 134.59 90.2 2048 476.06 100 241.00 98.8 205.41 77.3 131.84 90.3 2304 570.37 100 290.40 98.2 250.33 75.9 158.11 90.2 2560 654.57 100 345.22 94.8 300.91 72.5 196.02 83.5 2816 725.62 100 383.84 94.5 334.08 72.4 217.17 83.5 3072 793.89 100 418.03 95.0 357.01 74.1 237.74 83.5 3328 849.32 100 448.77 94.6 390.24 72.5 255.72 83.0 3584 859.99 100 456.01 94.3 393.88 72.8 262.40 81.9 3840 990.53 100 525.26 94.3 457.94 72.1 298.67 82.9 4096 974.11 100 512.90 95.0 445.75 72.8 297.59 81.8 4608 1213.42 100 615.35 98.6 537.29 75.3 353.29 85.9 5120 1460.96 100 775.31 94.2 669.04 72.8 447.42 81.6 5632 1617.00 100 857.64 94.3 742.06 72.6 495.67 81.6 6144 1764.45 100 937.44 94.1 780.16 75.4 546.03 80.8 6656 1897.60 100 1009.80 94.0 870.94 72.6 586.85 80.8 7168 1945.88 100 1035.60 93.9 887.20 73.1 609.23 79.8 7680 2231.54 100 1179.50 94.6 1021.03 72.9 691.53 80.7 - 3-thread scaling is by far the worst, unsurprisingly because Mlucas is optimized for power-of-2 thread counts. - The || scaling for 1,2,4-threads is quite impressive, especially given that we typically expect one-single-thread-job-per-core mode to run at no better than 80-90% efficiency due to overall system memory contention among the jobs (i.e. 1-worker/4-thread ~= 4-worker/1-thread in total-throughput terms). - Only significant timing anomaly is for FFT lengths of form 15*2^n, which are slower than the power-of-2 lengths just above them. The medium self-test (-s m) currently does not do 8192K, but likely those timings would, by extension be slightly faster than the 7680K ones which form the bottom ros of the above table. |
||
|
|
|
|
|
#81 | |
|
Banned
"Luigi"
Aug 2002
Team Italia
32·5·107 Posts |
Quote:
|
|
|
|
|
|
|
#82 | |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote:
On to the vector-asm coding effort! |
|
|
|
|
|
|
#83 |
|
Banned
"Luigi"
Aug 2002
Team Italia
113178 Posts |
In fact, I suppose that 5 boards and a switch closed inside a plexyglass cube may become hot and tend to throttle... Oh, well, time to design a cooling system will come.
|
|
|
|
|
|
#84 |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
With some helpful advice from fellow forumite and ARM employee Tom Womack (a.k.a. fivemack), got my first nontrivial asm-macros put together and timing-tested in the last few days.
Copied in the code-box below is the ARMv8 version - at least the first go - of a complex 4-DFT with 3 complex twiddles. I tested this side-by-side with the SSE2 version of the same macro, which has no FMAs, obviously. My initial timings were surprising, and had the ARM code running faster - not only in cycles but further in terms of wall-clock time - on 1 CPU of my 1.5GHz Odroid-C2than than the SSE2 version of the same macro running on 1 CPU of my 2.0GHz Core2Duo macbook. But based on the relative theoretical instruction throughputs of the 2 respective CPUs that 's simply wildly implausible. One more odd thing from the same side-by-side testing ... on the Core2/SSE2 once the GCC opt-level hit -O1 the macro timings bottomed out, not surprising since the compiler treats the ASM loop body as a back box and can only optimize the loop logic. However on the ARM going from -O1 to -O3 gave slightly better than a 2-fold speedup. Further digging into possible timing-loop overheads quickly revealed the cause of the above oddities - the loop was running the macro in in-place mode, reading 16 doubles (8 complex-double, thus 4 vector-complex-double, hence "4-DFT") from a block of quasirandom (i.e. 'repeatably random') inited local memory, then writing back to the same memory. This requires one to re-init the macro inputs on each loop pass via memcpy. For one reason or another, GCC (v4.2 on my Core2 and 5.something on my Odroid) is doing a really bad job of this init step on the Core2 even at -O3, whereas in going from -O1 to -O3 on the Odroid the compiler is slashing the cost of the init. The obvious answer was to switch to running the macro in out-of-place mode, which allows the inputs to be inited just once and the loop body now consists just of the 4-DFT macro. With that, here are the opcounts and cycle counts for the 2 respective 128-bit SIMD implementations: x86_64 SSE2: 41 MEM (19 load[1 via mem-op in addpd], 14 store, 8 reg-copy), 22 ADDPD, 16 MULPD: 46 cycles. (Note I can get this down to 36 cycles, but only by using > 8 vector registers, which is inconsistent with being able to do two such 4-DFTs side-by-side, one in vector registers 0-7, the other in 8-15.) ARM v8 Neon: 11 MEM (7 load-pair, 4 store-pair, i.e. 22 total vector-load/stores via 11 instructions), 16 FADD, 12 FMUl/FMA: 93 cycles. (Here I use 12 vector registers to save some spill/fills and arithmetic, because I have 32 such registers to work with.) Thus almost exactly double the cycle count on the ARM vs the Core2. Here is the ARM inline-asm macro - note that q- and v- are different name prefixes for the same set of vector registers, the former treating a given register as an integer one, the latter as a floating-point register. The LDP and STP (load-pair and store-pair) instructions can be used to operate on either kind of underlying data but formally require the integer form of the register name. Thus we e.g. 'LPD q4,q5' to load 32 bytes of contiguous date from a memory location, then use the same resulting register data under the name v4 and v5 to do vector floating-point arithmetic on them. The '.2d' v-register suffixes mean 'treat register as pair of floating doubles': Code:
__asm__ volatile (\ "ldr x0,%[__add0] \n\t"\ "ldr w1,%[__p1] \n\t"\ "ldr w2,%[__p2] \n\t"\ "ldr w3,%[__p3] \n\t"\ "ldr x4,%[__cc0] \n\t"\ "ldr x5,%[__r0] \n\t"\ "add x1, x0,x1,lsl #3 \n\t"\ "add x2, x0,x2,lsl #3 \n\t"\ "add x3, x0,x3,lsl #3 \n\t"\ /* SSE2_RADIX_04_DIF_3TWIDDLE(r0,c0): */\ /* Do the p0,p2 combo: */\ "ldp q4,q5,[x2] \n\t"\ "ldp q8,q9,[x4] \n\t"/* cc0 */\ "ldp q0,q1,[x0] \n\t"\ "fmul v6.2d,v4.2d,v8.2d \n\t"/* twiddle-mul: */\ "fmul v7.2d,v5.2d,v8.2d \n\t"\ "fmls v6.2d,v5.2d,v9.2d \n\t"\ "fmla v7.2d,v4.2d,v9.2d \n\t"\ "fsub v2.2d ,v0.2d,v6.2d \n\t"/* 2 x 2 complex butterfly: */\ "fsub v3.2d ,v1.2d,v7.2d \n\t"\ "fadd v10.2d,v0.2d,v6.2d \n\t"\ "fadd v11.2d,v1.2d,v7.2d \n\t"\ /* Do the p1,3 combo: */\ "ldp q8,q9,[x4,#0x40] \n\t"/* cc0+4 */\ "ldp q6,q7,[x3] \n\t"\ "fmul v0.2d,v6.2d,v8.2d \n\t"/* twiddle-mul: */\ "fmul v1.2d,v7.2d,v8.2d \n\t"\ "fmls v0.2d,v7.2d,v9.2d \n\t"\ "fmla v1.2d,v6.2d,v9.2d \n\t"\ "ldp q8,q9,[x4,#0x20] \n\t"/* cc0+2 */\ "ldp q6,q7,[x1] \n\t"\ "fmul v4.2d,v6.2d,v8.2d \n\t"/* twiddle-mul: */\ "fmul v5.2d,v7.2d,v8.2d \n\t"\ "fmls v4.2d,v7.2d,v9.2d \n\t"\ "fmla v5.2d,v6.2d,v9.2d \n\t"\ "fadd v6.2d,v4.2d,v0.2d \n\t"/* 2 x 2 complex butterfly: */\ "fadd v7.2d,v5.2d,v1.2d \n\t"\ "fsub v4.2d,v4.2d,v0.2d \n\t"\ "fsub v5.2d,v5.2d,v1.2d \n\t"\ /* Finish radix-4 butterfly and store results: */\ "fsub v8.2d,v10.2d,v6.2d \n\t"\ "fsub v9.2d,v11.2d,v7.2d \n\t"\ "fsub v1.2d,v3.2d,v4.2d \n\t"\ "fsub v0.2d,v2.2d,v5.2d \n\t"\ "fadd v6.2d,v6.2d,v10.2d \n\t"\ "fadd v7.2d,v7.2d,v11.2d \n\t"\ "fadd v4.2d,v4.2d,v3.2d \n\t"\ "fadd v5.2d,v5.2d,v2.2d \n\t"\ "stp q6,q7,[x5 ] \n\t"/* out 0 */\ "stp q0,q4,[x5,#0x20] \n\t"/* out 1 */\ "stp q8,q9,[x5,#0x40] \n\t"/* out 2 */\ "stp q5,q1,[x5,#0x60] \n\t"/* out 3 */\ : /* outputs: none */\ : [__add0] "m" (r0) /* All inputs from memory addresses here */\ ,[__p1] "m" (p1)\ ,[__p2] "m" (p2)\ ,[__p3] "m" (p3)\ ,[__two] "m" (two)\ ,[__cc0] "m" (cc0)\ ,[__r0] "m" (r0)\ : "cc","memory","x0","x1","x2","x3","x4","x5","v0","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11" /* Clobbered registers */\ ); x86_64 SSE2: 1-col = 46 cycles, 2-col = 77 cycles. ARM Neon: 1-col = 93 cycles, 2-col = 165 cycles. Thus a decent per-cycle throughput gain on both, but comparatively more for the SSE2 code. To the ARM experts hereabouts, do those timings seem reasonable? |
|
|
|
|
|
#85 |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
I am only a few weeks away from releasing a beta version Mlucas with ARMv8 SIMD-assembly support - many thanks to fellow forumite and ARM engineer Tom Womack (a.k.a. fivemack) for much useful assistance in my early steep-part-of-the-learning-curve coding efforts.
But let me damp down expectations right off the bat: The performance of the SIMD code is less than I'd hoped in my wide-eyed initial guesstimates - looks like all 4 cores of my Odroid C2 are roughly equivalent to 1 core of my vintage-2009 Core2 Duo macbook running an SSE2 build of Mlucas, and equivalent to perhaps 1/8th of both cores of my ham-sandwich-sized Intel Broadwell NUC running an AVX2 build f the code. I'm not sure how that stacks up on a per-watt basis, probably decently enough, but overall we're talking on the order of half a year or more to do a single exponent at the current GIMPS wavefront, and that number needs to come down in order to spur any appreciable user adoption. So I'm hoping readers/future-users of the ARMv8 code can tell me some good news about better performance for higher-end ARMv8 systems than my humble A53, and e.g. low-cost multi-socket ARMv8 systems which contain multiple copies of such 4-core CPUs. Heck, fot this kind of work we'd really like just a simple board with multiple CPUs and wouldn't even need any memory subsystem support for interprocessor communication, but I'm guessing that's a no-go from a marketing perspective for general-purpose compute hardware. Some interesting performance trends already visible in the current powers-of-2-only binary (Adding non-power-of-2 support is pretty quick at this point since it shares ~90% of the power-of-2 FFT-code infrastructure, merely requiring implementation of odd-radix DFT macros for radices 3,5,7,9,11,13,15, only the two composite ones of which require separate DIF and DIT version of said macros). A key performance-related parameter relates to the leading radix, used for the initial fFFT pass and final iFFT pass, let's call said radix R. Say I'm doing a length-N FFT. Once I do that initial radix-R pass which accesses stride-N/R-sperarated sets of data, the subsequent passes of the fFFT, the dyadic-mul step and the iFFT passes all the way up to the final radix-R iFFT one, all those operate on R disjoint chunks of N/R data each, which naturally are assigned to separate threads in a multithreaded run. The size of these disjoint chunks is thus key in terms of getting good cache performance - we want each such chunk to fit into L2, typically with some room to spare. Thus at any given FFT length we typically see a "sweet spot" leading radix R - make R smaller and the resulting larger data chunks start to spill out of L2, make R too large and the overhead of handling many small chunks begins to dominate the runtime. A typical example on the ARM is provided by various radix combos at 4096K FFT, where the sweet spot is at R = 256, which yields an N/R-chunksize of ~32MB/256 = 128 kB. Let's compare 100 iterations using leading radices 128 and 256 here, for 1 and 4-threads: R 1-thr 4-thr 128 70 sec 24 sec 256 64 sec 23 sec i.e. R = 256 gives a 10% speedupp over R = 128 in single-thread mode, but the advantage drops to just 4% when running 4-threaded. (The eagle-eyed follower of this thread may compare these timings to those I gave for my initial scalar-double C-code build in post 80 and note that e.g. the above 64/23 sec for 1/4-thread represent only ~1.5x speedup over the non-SIMD build - like I said, underwhelming.) Two possibilities immediately come to mind to explain the 4-thread behavior: memory bandwidth (i.e. the RAM can't keep all 4 cores fed) and thermal throttling (the C2 has no cooling fan, just a small heatsink). Is there a way to see if the latter is occurring, and if so, to what degree? I know there are temp-monitoring packages for various linux distros, but those require the OS to play nice with the underlying CPU and motherboard hardware. In post 50 of this thread VistordeHolland mentions power draw of 200mW per core for the A53 implementation of ARMv8 (the one in the Odroid C2), but the little passive heatsink on my C2 gets sufficiently hot under load that I'm skeptical of the total die power draw being under 1W ... might the L2 cache's power draw be excluded from the 200mW figure? The cheap little plastic housing I bought as an accessory for my C2 is poorly ventilated and surely doesn't help, but is necessary to keep on at the moment since I'm sneakernetting code updates via thumb drive and need to protect the board during all that plugging-and-playing. The whole thermal-throttling thing may prove to be a bogus notion, but I won't know until I know, right? Lastly for now, can any of our resident ARM coders tell me whether ARM has some analog of x86's CPUID functionality? It would be nice to be able to support the same kind of "on program startup, check the CPU's SIMD support against that targeted by the build. If build target instructions not supported by CPU quite with error; if CPU supports SIMD but build does not target same, print info-message to that effect" functionality I use for x86 SIMD-capable CPUs/builds. |
|
|
|
|
|
#86 | |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
22×1,549 Posts |
Quote:
Last fiddled with by retina on 2017-10-28 at 00:12 |
|
|
|
|
|
|
#87 |
|
Sep 2003
5×11×47 Posts |
There's a persistent rumor that Apple, which already designs its own ARM-based chips for mobile devices, is considering moving the Mac to ARM too.
They have already successfully navigated two architecture switches in past decades, from Motorola 68xxx to PowerPC to Intel, so it's surely feasible. Intel architecture has stagnated for some time now, and seems to have its inherent limitations. And the consumer demand for faster chips on the desktop is modest at best. Meanwhile, all the mobile devices are doing face recognition and augmented reality and whatnot while coping with limited battery life, so there's a neverending powerful industrywide incentive to keep making ARM run faster and use less power. So I think ARM will at some point become very relevant to our interests, and it's good to get ahead of the curve. I personally wouldn't care if each individual exponent took half a year, as long as I could run a whole bunch of them in parallel at the lowest possible buck for the bang. |
|
|
|
|
|
#88 | |
|
Jan 2008
France
2·52·11 Posts |
Quote:
There exists two possibilities:
Last fiddled with by ldesnogu on 2017-10-28 at 08:23 |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Economic prospects for solar photovoltaic power | cheesehead | Science & Technology | 137 | 2018-06-26 15:46 |
| Which SIMD flag to use for Raspberry Pi | BrainStone | Mlucas | 14 | 2017-11-19 00:59 |
| compiler/assembler optimizations possible? | ixfd64 | Software | 7 | 2011-02-25 20:05 |
| Running 32-bit builds on a Win7 system | ewmayer | Programming | 34 | 2010-10-18 22:36 |
| SIMD string->int | fivemack | Software | 7 | 2009-03-23 18:15 |