![]() |
|
|
#100 |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Getting back to the issue of heat and throttling, last night I noticed this blurb on the Odroid C2 page over at hardkernel.com: "Remove the Jumper on J1, if you don't use the USB OTG port as a power input. It will reduce the power consumption and heat significantly."
The tiny-fonted glossy paper info isert that shipped with the C2 has what appears intended to be an analogous notice, but the language, which refers to the same-pictured jumper as J8, is the much-less-clear "OTG Port Enable Jumper: If you remove the J8 jumper, the power path from the micro-USB port is disabled for stable access to the USB device mode (Gadget driver or ADB/Fastboot interface)", which makes it far less than clear as to whether removing the jumper is desirable, hence I left it in place when I first set up the Odroid. Anyhoo, having read the clarifying note about reduced power consumption at the manufacturer's website, I hand-drilled a small hole in one side of the upper clamshell of the clear plastic housing I bought as an accessory, allowing me to insert a thermal probe in between the central fins of the fanless heatsink, and did some experiments, all using 10000 LL-test iterations at FFT length 3840K, using the just-completed vector-arithmetic code running in parallel mode on 4 processor cores: 1. Jumper in place, top of housing in place: T started ~40C and stabilized at 75C, runtime = 2049s. 2. Jumper removed, top of housing in place: T stabilized at 68C, runtime = 2088s. So using the temperature as a proxy, the power consumption is indeed significiantly reduced, with the precise wattage difference unclear (I don't have a watt-or-ammeter). The runtime, however, is actually 2% *greater*, which is bizarre. 3. Jumper removed, top of housing removed (but no discernible room air movement to further aid cooling): T stabilized at 57C, runtime = 2065s, only 1% lower than [2]. 4. As in [3[ but with the odroid set atop the upper case fan of my Haswell quad: T stabilized at 40C, runtime = 2069s, immaterially different from [3]. So much for the "throttling is an issue" hypothesis. So here a table of best-FFT-params timings at the FFT lengths covered by the 'medium' self-tests via ('./Mlucas -s m'), with nnalogous timings from my Core2 macbook for comparison. "Scalar' means the default C-code build using doubles, no SIMD, no assembly. 'SIMD' means a build with vector-arithmetic assembly enabled; 'gain' is the speedup at each FFT length obtained from use of SIMD. All timings generated via 4-threaded runs on ARMv8, and 2-threaded on Core2Duo. On both processors SIMD is 128-bit. All timings in msec/iteration: Code:
ARMv8 Core2Duo FFT(K) Scalar SIMD gain Scalar SIMD gain ----- ------ ------ ---- ------ ------ ---- 1024 61.37 44.27 1.39x 58.10 26.39 2.20x 1152 74.67 47.58 1.57x 72.72 31.71 2.29x 1280 88.43 52.05 1.70x 85.30 34.63 2.46x 1408 96.07 60.55 1.59x 92.33 40.95 2.25x 1536 110.80 66.62 1.66x 106.41 42.40 2.51x 1664 114.43 74.46 1.54x 111.07 47.24 2.35x 1792 121.25 80.90 1.50x 118.43 50.58 2.34x 1920 134.09 88.92 1.51x 131.35 54.25 2.42x 2048 133.86 94.31 1.42x 119.78 57.07 2.10x 2304 158.86 110.96 1.43x 140.42 67.93 2.07x 2560 204.72 130.47 1.57x 168.86 72.74 2.32x 2816 223.44 147.72 1.51x 183.47 84.00 2.18x 3072 241.55 158.51 1.52x 212.86 88.95 2.39x 3328 259.77 174.36 1.49x 230.13 99.93 2.30x 3584 267.83 189.83 1.41x 239.19 105.10 2.28x 3840 304.38 206.49 1.47x 259.15 114.34 2.27x 4096 306.74 222.01 1.38x 265.52 123.71 2.15x 4608 361.18 260.47 1.39x 303.17 143.00 2.12x 5120 457.37 290.94 1.57x 360.84 156.46 2.31x 5632 504.79 331.82 1.52x 387.65 180.17 2.15x 6144 554.07 364.37 1.52x 463.53 195.02 2.38x 6656 598.31 389.86 1.53x 470.08 209.32 2.25x 7168 625.96 425.13 1.47x 490.23 226.07 2.17x 7680 704.32 469.40 1.50x 533.57 243.69 2.19x Now back to the final code/build issues needing to be taken care of - I hope to have a release tarball ready by Friday. |
|
|
|
|
|
#101 | |
|
Banned
"Luigi"
Aug 2002
Team Italia
5×7×139 Posts |
Quote:
|
|
|
|
|
|
|
#102 | |
|
"Victor de Hollander"
Aug 2011
the Netherlands
100100110112 Posts |
Quote:
![]() Still I'm also surprised these 4 little Cortex-A53 cores @1.5GHz can almost keep up with a 2 cores of a C2D (albeit in Scalar mode). When I did some testing of my Odroid-U2 (4x Cortex-A9 @1.7GHz) some time ago it got reckt by a C2D, but that was probably scalar vs. SIMD which would explain the large discrepancy. BTW, nice work ewmayer!
|
|
|
|
|
|
|
#103 |
|
"Victor de Hollander"
Aug 2011
the Netherlands
100100110112 Posts |
I added my timings from my ODROID-U2 and C2D E7400 that I tested earlier this year
http://mersenneforum.org/showpost.ph...2&postcount=44 to ewmayers table: Code:
ARMv7 ARMv8 Core2Duo Core2Duo (Mlucas) Core2Duo (Mprime)
(4x A9 A1.7GHz) (4x A53 @1.5GHz) (ewmayer) E7400 @2.8GHz E7400 @2.8GHz
FFT(K) Scalar? Scalar SIMD Scalar SIMD SIMD(1C) SIMD(2C) SIMD(1C) SIMD(2C)
----- ----- ------ ------ ------ ------ ------ ------ ------ ------
1024 121.70 61.37 44.27 58.10 26.39 33.76 20.01 16.70 15.64
1152 142.69 74.67 47.58 72.72 31.71 40.30 25.43
1280 161.44 88.43 52.05 85.30 34.63 45.42 28.85 21.58 19.03
1408 185.52 96.07 60.55 92.33 40.95 52.31 35.14
1536 195.56 110.80 66.62 106.41 42.40 53.31 33.98 27.72 23.00
1664 208.36 114.43 74.46 111.07 47.24 61.81 38.98
1792 222.32 121.25 80.90 118.43 50.58 65.81 40.84
1920 243.65 134.09 88.92 131.35 54.25 70.98 45.63
2048 255.25 133.86 94.31 119.78 57.07 71.98 45.92 32.14 31.35
2304 297.26 158.86 110.96 140.42 67.93 81.60 54.36
2560 339.70 204.72 130.47 168.86 72.74 90.96 54.64 46.91 39.69
2816 384.56 223.44 147.72 183.47 84.00 102.69 63.06
3072 413.85 241.55 158.51 212.86 88.95 112.85 67.77
3328 259.77 174.36 230.13 99.93 123.71 74.36
3584 370.28 267.83 189.83 239.19 105.10 135.08 79.71 83.96 85.87
3840 304.38 206.49 259.15 114.34 135.08 87.04
4096 455.10 306.74 222.01 265.52 123.71 154.69 92.87 72.50 66.71
4608 361.18 260.47 303.17 143.00 177.26 106.31
5120 457.37 290.94 360.84 156.46 201.17 116.95 88.87 80.26
5632 504.79 331.82 387.65 180.17 224.76 147.80
6144 554.07 364.37 463.53 195.02 244.47 150.32 111.36 94.00
6656 598.31 389.86 470.08 209.32 271.08 164.51
7168 625.96 425.13 490.23 226.07 292.72 172.77 132.74 119.71
7680 704.32 469.40 533.57 243.69 312.74 191.50 147.63 128.26
|
|
|
|
|
|
#104 | |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
Code:
ARMv8 Core2Duo FFT(K) Scalar SIMD Scalar SIMD ----- ------ ------ ------ ------ 1024 61.37 44.27 58.10 26.39 ... 7680 704.32 469.40 533.57 243.69 ----- ------ ------ ------ ------ t2/t1: 11.48 10.60 9.18 9.23 Victor, thanks for the additional timings - so the ARM v8 gives a nice boost vs the v7 even for the scalar-double builds. On the hardware side, I see two issues as being of special interest: o Are there higher-end implementations than the A53 in my Odroid which provide appreciably better timings? If so, how does the performance stack up on a per-watt and per-hardware-cost basis? o How difficult/expensive is it to stack a bunch of Odroid-style minis into an array, perhaps run off a single DC source with multiple power-output taps? Based on the SIMD timings I get, I'd need ~20 Odroid C2s (half that many if we can find an octocore option based on the ARMv8) to match the total throughput of my Haswell quad. Physically one can stack that many into a much smaller volume than that of an ATX-case, but when one compares vs the multi-mobo and 1-mobo-plus-several-GPUs solutions commonly seen around this forum, the comparison becomes more fair. And again, how would such an array-solution stack up (pardon the pun) in per-watt and per-hardware-cost terms? (The hardkernel page has a volume discounts tab, but we'd need a much bigger price cut for, say, a 20-board option to compare close to a commodity Intel quad in terms of cost.) [Of course there is also the botnet-based-on-ARM-cores-embedded-into-various-e-devices line of thinking, but I'm not I want to encourage that. Of it's your own e-gizmos being so used, fine, but "My grandma's cable box slowed to uselessness because some hacker was running *your* code on it", that kind of publicity I don't need.] Last fiddled with by ewmayer on 2017-11-08 at 00:58 |
|
|
|
|
|
|
#105 | |
|
Banned
"Luigi"
Aug 2002
Team Italia
5×7×139 Posts |
Quote:
https://www.picocluster.com/collections/pico-20 |
|
|
|
|
|
|
#106 |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
614110 Posts |
The A9 wasn't the last ARMv7 version. The A15 had a good amount of performance improvement.
The A9 is also more comparable to the higher power A57/72/73/75 rather than the low power A53/A55 There is more performance to be gained from ARM chips than those tried. I don't know about performance per watt though. Whatever their power usage they would smoke Core 2s. |
|
|
|
|
|
#107 |
|
"Victor de Hollander"
Aug 2011
the Netherlands
32·131 Posts |
If I would build a new system now it would probably look something like this:
- quad-core CPU (i3-8100 or i5-7400 or i5-6500) 3.5GHz-ish - 8GB DDR4-2800 or higher - microATX motherboard - 80+Gold PSU - USB stick for booting That would be something like €120-180 for the CPU, €100 for the memory, €50-100 for the motherboard and €80-100 for the PSU. Lets round it off to €500 (in Dutch webstore, including tax, shipping, etc.) That system would be capable of producing 160 iters/sec using 4096K FFT with mprime (see George's dream build thread) using 70-80W. You could even stack 4-7 mobos in a case, using a single PSU (so that would drop the costs even further). The Odroid-C2 does only ~5 iters/sec using 4096K FFT, so you would need 32!!! Odroid-C2s to match the performance of a new quadcore. If I order the Odroid parts in the German store (my German is not so good, but ordering should not be too difficult). Odroid-C2 https://www.pollin.de/p/odroid-c2-ei...-4x-usb-810491 €58.95 (but not actually in stock) Power suppy 5V 2A https://www.pollin.de/p/steckernetzt...-0-8-mm-351536 €4.95 Or to power 5-6 boards, something like this (6x USB, 5V, max 10A) https://www.pollin.de/p/usb-lader-go...ax-10-a-351898 €22.95 32 x €60 = €1920 upfront cost vs. €500Each Odroid-C2 must also consume NOT more than 2.5W in order to be equally energy efficient as the quad core Intel. I guess they ship them with a 5V 2A powersupply for a reason.... If I didn't make a mistake somewhere, the economies seem to be against the ARM board with the current prices/tech/code (at least for LL). But that could change in the future of course. Just as GPUs have changed the way we do TF. By the way, hardkernel now has a pre-build cluster in their store with 4 boards and a fan (Odroid-MC1): http://www.hardkernel.com/main/produ...=G150152508314 Based on the Samsung Exynos5422 (4x Cortex-A15 @2Ghz + 4x Cortex-A7) CPUs. The Cortex-A15 and A7 are ARMv7 right? Anybody got numbers on those? |
|
|
|
|
|
#108 |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3×23×89 Posts |
The A7 and A15 are both ARMv7. The A7 was used in the Raspberry Pi 2.
I would imagine that the A15 will be fairly similar to the A53 per mhz although it will use more power. |
|
|
|
|
|
#109 |
|
Jan 2008
France
3·199 Posts |
Note that the A15, and all ARMv7 CPU, won't be able to use the SIMD work Ernst has done, since 32-bit mode lacks the support of 64-bit FP SIMD instructions.
|
|
|
|
|
|
#110 | |
|
"Composite as Heck"
Oct 2017
95010 Posts |
Quote:
But that's without tweaking settings for power efficiency, that link also shows 350 mA at idle. If it's similar to the raspberry pi 3, the USB controller uses 240 mA just existing: https://github.com/superjamie/lazywe...berry-Pi-Power To get maximal power efficiency you'd have to disable everything, including networking (not just to save power at the device, a switch isn't free either). It might be possible to do this in software, enabling when needed, but I don't know where this leaves the network switch on power consumption. You also have to consider the power source, a single transformer powering 32 devices is probably more efficient than 1 per device (?). This link pegs a random sample of USB AC chargers at 63% to 80% efficient, which ranges from not great to OK compared to PC PSUs: http://www.righto.com/2012/10/a-doze...-apple-is.html You can get PSUs specifically for 5V at high amps, but they look specialist and hence expensive, and you'd still need to DIY the connectors. I don't know how to best DIY your own solution on the cheap. You'd could use an efficient ATX PSU, but you'd have to downstep from 12V as the 5V rail typically doesn't supply enough current (which can be done at around 90% efficiency AFAIK). Sounds like a Frankenstein job if I tried it ;) tl;dr I think you can make a cluster that is comparable in perf/watt (but you have to work for it), but not perf/cost relative to a typical quad core. |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Economic prospects for solar photovoltaic power | cheesehead | Science & Technology | 137 | 2018-06-26 15:46 |
| Which SIMD flag to use for Raspberry Pi | BrainStone | Mlucas | 14 | 2017-11-19 00:59 |
| compiler/assembler optimizations possible? | ixfd64 | Software | 7 | 2011-02-25 20:05 |
| Running 32-bit builds on a Win7 system | ewmayer | Programming | 34 | 2010-10-18 22:36 |
| SIMD string->int | fivemack | Software | 7 | 2009-03-23 18:15 |