mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2017-11-07, 02:39   #100
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Getting back to the issue of heat and throttling, last night I noticed this blurb on the Odroid C2 page over at hardkernel.com: "Remove the Jumper on J1, if you don't use the USB OTG port as a power input. It will reduce the power consumption and heat significantly."

The tiny-fonted glossy paper info isert that shipped with the C2 has what appears intended to be an analogous notice, but the language, which refers to the same-pictured jumper as J8, is the much-less-clear "OTG Port Enable Jumper: If you remove the J8 jumper, the power path from the micro-USB port is disabled for stable access to the USB device mode (Gadget driver or ADB/Fastboot interface)", which makes it far less than clear as to whether removing the jumper is desirable, hence I left it in place when I first set up the Odroid.

Anyhoo, having read the clarifying note about reduced power consumption at the manufacturer's website, I hand-drilled a small hole in one side of the upper clamshell of the clear plastic housing I bought as an accessory, allowing me to insert a thermal probe in between the central fins of the fanless heatsink, and did some experiments, all using 10000 LL-test iterations at FFT length 3840K, using the just-completed vector-arithmetic code running in parallel mode on 4 processor cores:

1. Jumper in place, top of housing in place: T started ~40C and stabilized at 75C, runtime = 2049s.

2. Jumper removed, top of housing in place: T stabilized at 68C, runtime = 2088s. So using the temperature as a proxy, the power consumption is indeed significiantly reduced, with the precise wattage difference unclear (I don't have a watt-or-ammeter). The runtime, however, is actually 2% *greater*, which is bizarre.

3. Jumper removed, top of housing removed (but no discernible room air movement to further aid cooling): T stabilized at 57C, runtime = 2065s, only 1% lower than [2].

4. As in [3[ but with the odroid set atop the upper case fan of my Haswell quad: T stabilized at 40C, runtime = 2069s, immaterially different from [3]. So much for the "throttling is an issue" hypothesis.

So here a table of best-FFT-params timings at the FFT lengths covered by the 'medium' self-tests via ('./Mlucas -s m'), with nnalogous timings from my Core2 macbook for comparison. "Scalar' means the default C-code build using doubles, no SIMD, no assembly. 'SIMD' means a build with vector-arithmetic assembly enabled; 'gain' is the speedup at each FFT length obtained from use of SIMD. All timings generated via 4-threaded runs on ARMv8, and 2-threaded on Core2Duo. On both processors SIMD is 128-bit. All timings in msec/iteration:
Code:
		ARMv8				Core2Duo
FFT(K)	Scalar	SIMD	gain		Scalar	SIMD	gain
-----	------	------	----		------	------	----
1024	 61.37	 44.27	1.39x		 58.10	 26.39	2.20x
1152	 74.67	 47.58	1.57x		 72.72	 31.71	2.29x
1280	 88.43	 52.05	1.70x		 85.30	 34.63	2.46x
1408	 96.07	 60.55	1.59x		 92.33	 40.95	2.25x
1536	110.80	 66.62	1.66x		106.41	 42.40	2.51x
1664	114.43	 74.46	1.54x		111.07	 47.24	2.35x
1792	121.25	 80.90	1.50x		118.43	 50.58	2.34x
1920	134.09	 88.92	1.51x		131.35	 54.25	2.42x
2048	133.86	 94.31	1.42x		119.78	 57.07	2.10x
2304	158.86	110.96	1.43x		140.42	 67.93	2.07x
2560	204.72	130.47	1.57x		168.86	 72.74	2.32x
2816	223.44	147.72	1.51x		183.47	 84.00	2.18x
3072	241.55	158.51	1.52x		212.86	 88.95	2.39x
3328	259.77	174.36	1.49x		230.13	 99.93	2.30x
3584	267.83	189.83	1.41x		239.19	105.10	2.28x
3840	304.38	206.49	1.47x		259.15	114.34	2.27x
4096	306.74	222.01	1.38x		265.52	123.71	2.15x
4608	361.18	260.47	1.39x		303.17	143.00	2.12x
5120	457.37	290.94	1.57x		360.84	156.46	2.31x
5632	504.79	331.82	1.52x		387.65	180.17	2.15x
6144	554.07	364.37	1.52x		463.53	195.02	2.38x
6656	598.31	389.86	1.53x		470.08	209.32	2.25x
7168	625.96	425.13	1.47x		490.23	226.07	2.17x
7680	704.32	469.40	1.50x		533.57	243.69	2.19x
The reason I had early on hoped for something like a 3x gain from use of vector arithmetic on the ARM was the on-average greater-than-2x gain from similar I got on the Core2, coupled with the fact that, unlike the Core2, the ARM has both FMA support and a generous set of 32 vector registers compared to just 16 for the x86 SSE2. But the ARM is limited to at most one vector op per cycle, a constraint which more than negates the aforementioned pluses. Let's hope future iterations of the ARM vector architecture improve on the hardware instruction throughput - even a restricted form of dual-issue capability, allowing, say, up to one of each of vector (add or sub) and (mul or fma) to be issued per cycle should give a nice boost to things.

Now back to the final code/build issues needing to be taken care of - I hope to have a release tarball ready by Friday.
ewmayer is offline   Reply With Quote
Old 2017-11-07, 11:04   #101
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

5×7×139 Posts
Default

Quote:
Originally Posted by ewmayer View Post
On both processors SIMD is 128-bit. All timings in msec/iteration:
Code:
		ARMv8				Core2Duo
FFT(K)	Scalar	SIMD	gain		Scalar	SIMD	gain
-----	------	------	----		------	------	----
1024	 61.37	 44.27	1.39x		 58.10	 26.39	2.20x
1152	 74.67	 47.58	1.57x		 72.72	 31.71	2.29x
1280	 88.43	 52.05	1.70x		 85.30	 34.63	2.46x
1408	 96.07	 60.55	1.59x		 92.33	 40.95	2.25x
1536	110.80	 66.62	1.66x		106.41	 42.40	2.51x
1664	114.43	 74.46	1.54x		111.07	 47.24	2.35x
1792	121.25	 80.90	1.50x		118.43	 50.58	2.34x
1920	134.09	 88.92	1.51x		131.35	 54.25	2.42x
2048	133.86	 94.31	1.42x		119.78	 57.07	2.10x
2304	158.86	110.96	1.43x		140.42	 67.93	2.07x
2560	204.72	130.47	1.57x		168.86	 72.74	2.32x
2816	223.44	147.72	1.51x		183.47	 84.00	2.18x
3072	241.55	158.51	1.52x		212.86	 88.95	2.39x
3328	259.77	174.36	1.49x		230.13	 99.93	2.30x
3584	267.83	189.83	1.41x		239.19	105.10	2.28x
3840	304.38	206.49	1.47x		259.15	114.34	2.27x
4096	306.74	222.01	1.38x		265.52	123.71	2.15x
4608	361.18	260.47	1.39x		303.17	143.00	2.12x
5120	457.37	290.94	1.57x		360.84	156.46	2.31x
5632	504.79	331.82	1.52x		387.65	180.17	2.15x
6144	554.07	364.37	1.52x		463.53	195.02	2.38x
6656	598.31	389.86	1.53x		470.08	209.32	2.25x
7168	625.96	425.13	1.47x		490.23	226.07	2.17x
7680	704.32	469.40	1.50x		533.57	243.69	2.19x

Now back to the final code/build issues needing to be taken care of - I hope to have a release tarball ready by Friday.
How curious... just 3 seconds of difference in the scalar code up to 1920 K FFT, and then a huge and constant increment in timings.
ET_ is offline   Reply With Quote
Old 2017-11-07, 22:55   #102
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

100100110112 Posts
Default

Quote:
Originally Posted by ET_ View Post
How curious... just 3 seconds of difference in the scalar code up to 1920 K FFT, and then a huge and constant increment in timings.
Timings are in ms/iteration

Still I'm also surprised these 4 little Cortex-A53 cores @1.5GHz can almost keep up with a 2 cores of a C2D (albeit in Scalar mode).

When I did some testing of my Odroid-U2 (4x Cortex-A9 @1.7GHz) some time ago it got reckt by a C2D, but that was probably scalar vs. SIMD which would explain the large discrepancy.


BTW, nice work ewmayer!
VictordeHolland is offline   Reply With Quote
Old 2017-11-07, 23:23   #103
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

100100110112 Posts
Default

I added my timings from my ODROID-U2 and C2D E7400 that I tested earlier this year
http://mersenneforum.org/showpost.ph...2&postcount=44
to ewmayers table:

Code:
         ARMv7           ARMv8                 Core2Duo                Core2Duo (Mlucas)    Core2Duo (Mprime)    
     (4x A9 A1.7GHz)  (4x A53 @1.5GHz)        (ewmayer)                E7400 @2.8GHz        E7400 @2.8GHz    
FFT(K)  Scalar?        Scalar    SIMD        Scalar    SIMD           SIMD(1C) SIMD(2C)    SIMD(1C) SIMD(2C)
-----    -----        ------    ------        ------    ------        ------    ------      ------    ------
1024    121.70         61.37     44.27         58.10     26.39         33.76    20.01        16.70    15.64
1152    142.69         74.67     47.58         72.72     31.71         40.30    25.43            
1280    161.44         88.43     52.05         85.30     34.63         45.42    28.85        21.58    19.03
1408    185.52         96.07     60.55         92.33     40.95         52.31    35.14            
1536    195.56        110.80     66.62        106.41     42.40         53.31    33.98        27.72    23.00
1664    208.36        114.43     74.46        111.07     47.24         61.81    38.98            
1792    222.32        121.25     80.90        118.43     50.58         65.81    40.84            
1920    243.65        134.09     88.92        131.35     54.25         70.98    45.63            
2048    255.25        133.86     94.31        119.78     57.07         71.98    45.92        32.14    31.35
2304    297.26        158.86    110.96        140.42     67.93         81.60    54.36            
2560    339.70        204.72    130.47        168.86     72.74         90.96    54.64        46.91    39.69
2816    384.56        223.44    147.72        183.47     84.00        102.69    63.06            
3072    413.85        241.55    158.51        212.86     88.95        112.85    67.77            
3328                  259.77    174.36        230.13     99.93        123.71    74.36            
3584    370.28        267.83    189.83        239.19    105.10        135.08    79.71        83.96    85.87
3840                  304.38    206.49        259.15    114.34        135.08    87.04            
4096    455.10        306.74    222.01        265.52    123.71        154.69    92.87        72.50    66.71
4608                  361.18    260.47        303.17    143.00        177.26    106.31            
5120                  457.37    290.94        360.84    156.46        201.17    116.95       88.87    80.26
5632                  504.79    331.82        387.65    180.17        224.76    147.80            
6144                  554.07    364.37        463.53    195.02        244.47    150.32      111.36    94.00
6656                  598.31    389.86        470.08    209.32        271.08    164.51            
7168                  625.96    425.13        490.23    226.07        292.72    172.77      132.74    119.71
7680                  704.32    469.40        533.57    243.69        312.74    191.50      147.63    128.26
VictordeHolland is offline   Reply With Quote
Old 2017-11-08, 00:33   #104
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Quote:
Originally Posted by ET_ View Post
How curious... just 3 seconds of difference in the scalar code up to 1920 K FFT, and then a huge and constant increment in timings.
To that point, it is also useful to compare the runtimes at the largest and smallest FFT lengths in the table for the various HW/SW columns. Ideal scaling would be the FFT's n log(n) arithmetic-opcount scaling, in terms of which the ratio of n log(n) for 7680K to that at 1024K is 8.6. If the observed timing ratio for a given build mode (t2/t1 in my labeling below) exceeds that it indicates suboptimal cache behavior. Here is the corresponding added row to my table:
Code:
		ARMv8		Core2Duo
FFT(K)	Scalar	SIMD		Scalar	SIMD
-----	------	------		------	------
1024	 61.37	 44.27		 58.10	 26.39
...
7680	704.32	469.40		533.57	243.69
-----	------	------		------	------
t2/t1:	 11.48	 10.60		  9.18	  9.23
Likely the Odroid scaling can be imroved by further fiddling with prefetch strategies, but the SIMD build - which has some prefetch in the key largest-data-stride phase of the FFT - erases much of the deterioration vs Core2.

Victor, thanks for the additional timings - so the ARM v8 gives a nice boost vs the v7 even for the scalar-double builds.

On the hardware side, I see two issues as being of special interest:

o Are there higher-end implementations than the A53 in my Odroid which provide appreciably better timings? If so, how does the performance stack up on a per-watt and per-hardware-cost basis?

o How difficult/expensive is it to stack a bunch of Odroid-style minis into an array, perhaps run off a single DC source with multiple power-output taps? Based on the SIMD timings I get, I'd need ~20 Odroid C2s (half that many if we can find an octocore option based on the ARMv8) to match the total throughput of my Haswell quad. Physically one can stack that many into a much smaller volume than that of an ATX-case, but when one compares vs the multi-mobo and 1-mobo-plus-several-GPUs solutions commonly seen around this forum, the comparison becomes more fair. And again, how would such an array-solution stack up (pardon the pun) in per-watt and per-hardware-cost terms? (The hardkernel page has a volume discounts tab, but we'd need a much bigger price cut for, say, a 20-board option to compare close to a commodity Intel quad in terms of cost.)

[Of course there is also the botnet-based-on-ARM-cores-embedded-into-various-e-devices line of thinking, but I'm not I want to encourage that. Of it's your own e-gizmos being so used, fine, but "My grandma's cable box slowed to uselessness because some hacker was running *your* code on it", that kind of publicity I don't need.]

Last fiddled with by ewmayer on 2017-11-08 at 00:58
ewmayer is offline   Reply With Quote
Old 2017-11-08, 10:29   #105
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

5×7×139 Posts
Default

Quote:
Originally Posted by ewmayer View Post
o How difficult/expensive is it to stack a bunch of Odroid-style minis into an array, perhaps run off a single DC source with multiple power-output taps? Based on the SIMD timings I get, I'd need ~20 Odroid C2s (half that many if we can find an octocore option based on the ARMv8) to match the total throughput of my Haswell quad. Physically one can stack that many into a much smaller volume than that of an ATX-case, but when one compares vs the multi-mobo and 1-mobo-plus-several-GPUs solutions commonly seen around this forum, the comparison becomes more fair. And again, how would such an array-solution stack up (pardon the pun) in per-watt and per-hardware-cost terms? (The hardkernel page has a volume discounts tab, but we'd need a much bigger price cut for, say, a 20-board option to compare close to a commodity Intel quad in terms of cost.)
You could use OpenStack or Mesos/DCOS with a cluster of Odroids, like that:
https://www.picocluster.com/collections/pico-20
ET_ is offline   Reply With Quote
Old 2017-11-08, 12:32   #106
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

614110 Posts
Default

The A9 wasn't the last ARMv7 version. The A15 had a good amount of performance improvement.

The A9 is also more comparable to the higher power A57/72/73/75 rather than the low power A53/A55
There is more performance to be gained from ARM chips than those tried. I don't know about performance per watt though. Whatever their power usage they would smoke Core 2s.
henryzz is offline   Reply With Quote
Old 2017-11-08, 14:02   #107
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

32·131 Posts
Default

If I would build a new system now it would probably look something like this:

- quad-core CPU (i3-8100 or i5-7400 or i5-6500) 3.5GHz-ish
- 8GB DDR4-2800 or higher
- microATX motherboard
- 80+Gold PSU
- USB stick for booting

That would be something like €120-180 for the CPU, €100 for the memory, €50-100 for the motherboard and €80-100 for the PSU. Lets round it off to €500 (in Dutch webstore, including tax, shipping, etc.) That system would be capable of producing 160 iters/sec using 4096K FFT with mprime (see George's dream build thread) using 70-80W. You could even stack 4-7 mobos in a case, using a single PSU (so that would drop the costs even further).

The Odroid-C2 does only ~5 iters/sec using 4096K FFT, so you would need 32!!! Odroid-C2s to match the performance of a new quadcore.

If I order the Odroid parts in the German store (my German is not so good, but ordering should not be too difficult).

Odroid-C2
https://www.pollin.de/p/odroid-c2-ei...-4x-usb-810491
€58.95 (but not actually in stock)

Power suppy 5V 2A
https://www.pollin.de/p/steckernetzt...-0-8-mm-351536
€4.95

Or to power 5-6 boards, something like this (6x USB, 5V, max 10A)
https://www.pollin.de/p/usb-lader-go...ax-10-a-351898
€22.95

32 x €60 = €1920 upfront cost vs. €500

Each Odroid-C2 must also consume NOT more than 2.5W in order to be equally energy efficient as the quad core Intel. I guess they ship them with a 5V 2A powersupply for a reason....

If I didn't make a mistake somewhere, the economies seem to be against the ARM board with the current prices/tech/code (at least for LL). But that could change in the future of course. Just as GPUs have changed the way we do TF.

By the way, hardkernel now has a pre-build cluster in their store with 4 boards and a fan (Odroid-MC1):
http://www.hardkernel.com/main/produ...=G150152508314
Based on the Samsung Exynos5422 (4x Cortex-A15 @2Ghz + 4x Cortex-A7) CPUs.

The Cortex-A15 and A7 are ARMv7 right?
Anybody got numbers on those?
VictordeHolland is offline   Reply With Quote
Old 2017-11-08, 14:32   #108
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

3×23×89 Posts
Default

The A7 and A15 are both ARMv7. The A7 was used in the Raspberry Pi 2.
I would imagine that the A15 will be fairly similar to the A53 per mhz although it will use more power.
henryzz is offline   Reply With Quote
Old 2017-11-08, 14:42   #109
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

3·199 Posts
Default

Quote:
Originally Posted by henryzz View Post
The A7 and A15 are both ARMv7. The A7 was used in the Raspberry Pi 2.
I would imagine that the A15 will be fairly similar to the A53 per mhz although it will use more power.
Note that the A15, and all ARMv7 CPU, won't be able to use the SIMD work Ernst has done, since 32-bit mode lacks the support of 64-bit FP SIMD instructions.
ldesnogu is offline   Reply With Quote
Old 2017-11-08, 17:17   #110
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

95010 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
...
Each Odroid-C2 must also consume NOT more than 2.5W in order to be equally energy efficient as the quad core Intel. I guess they ship them with a 5V 2A powersupply for a reason....
...
5V 2A is pretty standard, many devices like this are powered by USB which is where that probably came from. It's also way more than the processor needs, the headroom is to power everything, including what you plug into the USBs. Judging from this link, power usage of a C2 under CPU load is around 650 mA: https://www.jeffgeerling.com/blog/20...orange-pi-plus

But that's without tweaking settings for power efficiency, that link also shows 350 mA at idle. If it's similar to the raspberry pi 3, the USB controller uses 240 mA just existing: https://github.com/superjamie/lazywe...berry-Pi-Power

To get maximal power efficiency you'd have to disable everything, including networking (not just to save power at the device, a switch isn't free either). It might be possible to do this in software, enabling when needed, but I don't know where this leaves the network switch on power consumption. You also have to consider the power source, a single transformer powering 32 devices is probably more efficient than 1 per device (?). This link pegs a random sample of USB AC chargers at 63% to 80% efficient, which ranges from not great to OK compared to PC PSUs: http://www.righto.com/2012/10/a-doze...-apple-is.html

You can get PSUs specifically for 5V at high amps, but they look specialist and hence expensive, and you'd still need to DIY the connectors. I don't know how to best DIY your own solution on the cheap. You'd could use an efficient ATX PSU, but you'd have to downstep from 12V as the 5V rail typically doesn't supply enough current (which can be done at around 90% efficiency AFAIK). Sounds like a Frankenstein job if I tried it ;)

tl;dr I think you can make a cluster that is comparable in perf/watt (but you have to work for it), but not perf/cost relative to a typical quad core.
M344587487 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Economic prospects for solar photovoltaic power cheesehead Science & Technology 137 2018-06-26 15:46
Which SIMD flag to use for Raspberry Pi BrainStone Mlucas 14 2017-11-19 00:59
compiler/assembler optimizations possible? ixfd64 Software 7 2011-02-25 20:05
Running 32-bit builds on a Win7 system ewmayer Programming 34 2010-10-18 22:36
SIMD string->int fivemack Software 7 2009-03-23 18:15

All times are UTC. The time now is 04:24.


Fri Jul 7 04:24:49 UTC 2023 up 323 days, 1:53, 0 users, load averages: 1.63, 1.67, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔