![]() |
NanoPi NEO4 RK3399
Paper specs:
[quote]Model: Rockchip RK3399; Number of Cores: big.LITTLE, 64-bit Dual Core Cortex-A72 + Quad Core Cortex-A53; Frequency: Cortex-A72(up to 2.0GHz), Cortex-A53(up to 1.5GHz) 1GB DDR3-1866[/quote]Using rk3399-sd-friendlycore-bionic-4.4-arm64-20181219.img (headless Ubuntu 18.04). It compiled Mlucas but seg faulted at runtime, so I used the c2 build you gave the N1 tester. Under load core frequencies: [code]pi@NanoPi-NEO4:/sys/devices/system/cpu$ sudo cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq 1416000 1416000 1416000 1416000 1800000 1800000[/code]Solo benchmarks not very accurate or interesting so omitted. Simultaneous benchmarks at key FFT lengths: 1024K: [code] pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p20000047.stat INFO: no restart file found...starting run from scratch. M20000047: using FFT length 1024K = 1048576 8-byte floats. this gives an average 19.073531150817871 bits per digit Using complex FFT radices 16 8 16 16 16 [Jan 14 15:07:24] M20000047 Iter# = 10000 [ 0.05% complete] clocks = 00:09:50.612 [ 0.0591 sec/iter] Res64: 9A2AF744DE060296. AvgMaxErr = 0.220881351. MaxErr = 0.312500000. [Jan 14 15:17:14] M20000047 Iter# = 20000 [ 0.10% complete] clocks = 00:09:49.950 [ 0.0590 sec/iter] Res64: D99B4D255F5C0C74. AvgMaxErr = 0.221486410. MaxErr = 0.312500000. pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p20000047.stat INFO: no restart file found...starting run from scratch. M20000047: using FFT length 1024K = 1048576 8-byte floats. this gives an average 19.073531150817871 bits per digit Using complex FFT radices 64 32 16 16 [Jan 14 15:09:47] M20000047 Iter# = 10000 [ 0.05% complete] clocks = 00:12:14.866 [ 0.0735 sec/iter] Res64: 9A2AF744DE060296. AvgMaxErr = 0.230923241. MaxErr = 0.312500000. [Jan 14 15:22:02] M20000047 Iter# = 20000 [ 0.10% complete] clocks = 00:12:14.665 [ 0.0735 sec/iter] Res64: D99B4D255F5C0C74. AvgMaxErr = 0.231118560. MaxErr = 0.343750000. [/code]2560K: [code] pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p49005071.stat INFO: no restart file found...starting run from scratch. M49005071: using FFT length 2560K = 2621440 8-byte floats. this gives an average 18.693951034545897 bits per digit Using complex FFT radices 40 32 32 32 [Jan 14 15:52:39] M49005071 Iter# = 10000 [ 0.02% complete] clocks = 00:23:46.485 [ 0.1426 sec/iter] Res64: 8E7E56F23C735CF2. AvgMaxErr = 0.245389789. MaxErr = 0.375000000. pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p49005071.stat INFO: no restart file found...starting run from scratch. M49005071: using FFT length 2560K = 2621440 8-byte floats. this gives an average 18.693951034545897 bits per digit Using complex FFT radices 160 8 8 8 16 M49005071 Roundoff warning on iteration 181, maxerr = 0.437500000000 M49005071 Roundoff warning on iteration 4240, maxerr = 0.437500000000 M49005071 Roundoff warning on iteration 8110, maxerr = 0.437500000000 [Jan 14 15:58:12] M49005071 Iter# = 10000 [ 0.02% complete] clocks = 00:29:18.200 [ 0.1758 sec/iter] Res64: 8E7E56F23C735CF2. AvgMaxErr = 0.282188452. MaxErr = 0.437500000. [/code]4608K: [code] pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p87068977.stat INFO: no restart file found...starting run from scratch. M87068977: using FFT length 4608K = 4718592 8-byte floats. this gives an average 18.452321582370335 bits per digit Using complex FFT radices 144 32 32 16 [Jan 14 16:48:14] M87068977 Iter# = 10000 [ 0.01% complete] clocks = 00:36:37.950 [ 0.2198 sec/iter] Res64: 13BB5C9DDF0CD3D6. AvgMaxErr = 0.256021777. MaxErr = 0.375000000. [Jan 14 17:24:52] M87068977 Iter# = 20000 [ 0.02% complete] clocks = 00:36:37.133 [ 0.2197 sec/iter] Res64: C43069A17478EF46. AvgMaxErr = 0.256765161. MaxErr = 0.343750000. pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p87068977.stat INFO: no restart file found...starting run from scratch. M87068977: using FFT length 4608K = 4718592 8-byte floats. this gives an average 18.452321582370335 bits per digit Using complex FFT radices 288 32 16 16 [Jan 14 17:04:38] M87068977 Iter# = 10000 [ 0.01% complete] clocks = 00:53:00.131 [ 0.3180 sec/iter] Res64: 13BB5C9DDF0CD3D6. AvgMaxErr = 0.249227525. MaxErr = 0.375000000. [Jan 14 17:57:39] M87068977 Iter# = 20000 [ 0.02% complete] clocks = 00:52:59.939 [ 0.3180 sec/iter] Res64: C43069A17478EF46. AvgMaxErr = 0.249523122. MaxErr = 0.343750000. [/code]7680K: [code] pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p143472073.stat INFO: no restart file found...starting run from scratch. M143472073: using FFT length 7680K = 7864320 8-byte floats. this gives an average 18.243417485555014 bits per digit Using complex FFT radices 240 16 32 32 [Jan 14 19:06:56] M143472073 Iter# = 10000 [ 0.01% complete] clocks = 01:02:23.065 [ 0.3743 sec/iter] Res64: C7B182C990710B46. AvgMaxErr = 0.241344566. MaxErr = 0.343750000. [Jan 14 20:09:15] M143472073 Iter# = 20000 [ 0.01% complete] clocks = 01:02:18.024 [ 0.3738 sec/iter] Res64: 181335759D5BB711. AvgMaxErr = 0.241826627. MaxErr = 0.375000000. [Jan 14 21:11:38] M143472073 Iter# = 30000 [ 0.02% complete] clocks = 01:02:21.381 [ 0.3741 sec/iter] Res64: 126EDB1E9B6580C4. AvgMaxErr = 0.241919198. MaxErr = 0.343750000. pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p143472073.stat INFO: no restart file found...starting run from scratch. M143472073: using FFT length 7680K = 7864320 8-byte floats. this gives an average 18.243417485555014 bits per digit Using complex FFT radices 240 32 32 16 [Jan 14 19:44:16] M143472073 Iter# = 10000 [ 0.01% complete] clocks = 01:39:44.262 [ 0.5984 sec/iter] Res64: C7B182C990710B46. AvgMaxErr = 0.235340244. MaxErr = 0.343750000. [Jan 14 21:23:58] M143472073 Iter# = 20000 [ 0.01% complete] clocks = 01:39:39.004 [ 0.5979 sec/iter] Res64: 181335759D5BB711. AvgMaxErr = 0.236132050. MaxErr = 0.375000000. [/code]18432K: [code] pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p332220523.stat INFO: no restart file found...starting run from scratch. M332220523: using FFT length 18432K = 18874368 8-byte floats. this gives an average 17.601676676008438 bits per digit Using complex FFT radices 288 32 32 32 [Jan 15 00:29:12] M332220523 Iter# = 10000 [ 0.00% complete] clocks = 02:57:36.013 [ 1.0656 sec/iter] Res64: 1A313D709BFA6663. AvgMaxErr = 0.186972266. MaxErr = 0.250000000. M332220523 Roundoff warning on iteration 11467, maxerr = 0.500000000000 Retrying iteration interval to see if roundoff error is reproducible. Restarting M332220523 at iteration = 10000. Res64: 1A313D709BFA6663 M332220523: using FFT length 18432K = 18874368 8-byte floats. this gives an average 17.601676676008438 bits per digit Retry of iteration interval with fatal roundoff error was successful. [Jan 15 03:52:50] M332220523 Iter# = 20000 [ 0.01% complete] clocks = 02:57:28.763 [ 1.0649 sec/iter] Res64: 73DC7A5C8B839081. AvgMaxErr = 0.187356934. MaxErr = 0.250000000. [Jan 15 06:50:22] M332220523 Iter# = 30000 [ 0.01% complete] clocks = 02:57:28.523 [ 1.0649 sec/iter] Res64: B928CD22434EEC7C. AvgMaxErr = 0.187289062. MaxErr = 0.281250000. [Jan 15 09:47:49] M332220523 Iter# = 40000 [ 0.01% complete] clocks = 02:57:24.003 [ 1.0644 sec/iter] Res64: 307ECB47139AEB31. AvgMaxErr = 0.187450000. MaxErr = 0.250000000. pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p332220523.stat INFO: no restart file found...starting run from scratch. M332220523: using FFT length 18432K = 18874368 8-byte floats. this gives an average 17.601676676008438 bits per digit Using complex FFT radices 288 32 32 32 [Jan 15 03:04:59] M332220523 Iter# = 10000 [ 0.00% complete] clocks = 05:33:22.437 [ 2.0002 sec/iter] Res64: 1A313D709BFA6663. AvgMaxErr = 0.186969141. MaxErr = 0.250000000. [Jan 15 08:38:04] M332220523 Iter# = 20000 [ 0.01% complete] clocks = 05:32:58.179 [ 1.9978 sec/iter] Res64: 73DC7A5C8B839081. AvgMaxErr = 0.187339746. MaxErr = 0.250000000. [/code]Giving combined synthetic timings of: [code] 1024K 32.73 ms/it 2560K 78.73 ms/it 4608K 129.93 ms/it 7680K 230.12 ms/it 18432K 694.90 ms/it [/code]As we increase FFT length the A53 performs relatively worse than the A72. Using the A53 at lower FFT and A72 at higher FFT could be more optimal so lets test: [code]pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p87068977.stat INFO: no restart file found...starting run from scratch. M87068977: using FFT length 4608K = 4718592 8-byte floats. this gives an average 18.452321582370335 bits per digit Using complex FFT radices 144 32 32 16 [Jan 15 11:18:17] M87068977 Iter# = 10000 [ 0.01% complete] clocks = 00:37:42.276 [ 0.2262 sec/iter] Res64: 13BB5C9DDF0CD3D6. AvgMaxErr = 0.256021777. MaxErr = 0.375000000. [Jan 15 11:55:59] M87068977 Iter# = 20000 [ 0.02% complete] clocks = 00:37:41.853 [ 0.2262 sec/iter] Res64: C43069A17478EF46. AvgMaxErr = 0.256765161. MaxErr = 0.343750000. pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p49005071.stat INFO: no restart file found...starting run from scratch. M49005071: using FFT length 2560K = 2621440 8-byte floats. this gives an average 18.693951034545897 bits per digit Using complex FFT radices 160 8 8 8 16 M49005071 Roundoff warning on iteration 181, maxerr = 0.437500000000 M49005071 Roundoff warning on iteration 4240, maxerr = 0.437500000000 M49005071 Roundoff warning on iteration 8110, maxerr = 0.437500000000 [Jan 15 11:07:39] M49005071 Iter# = 10000 [ 0.02% complete] clocks = 00:27:03.713 [ 0.1624 sec/iter] Res64: 8E7E56F23C735CF2. AvgMaxErr = 0.282188452. MaxErr = 0.437500000. [Jan 15 11:34:45] M49005071 Iter# = 20000 [ 0.04% complete] clocks = 00:27:05.668 [ 0.1626 sec/iter] Res64: 6CD0428337CA1430. AvgMaxErr = 0.282933594. MaxErr = 0.406250000. M49005071 Roundoff warning on iteration 20522, maxerr = 0.437500000000 M49005071 Roundoff warning on iteration 24876, maxerr = 0.437500000000 M49005071 Roundoff warning on iteration 25658, maxerr = 0.437500000000 [Jan 15 12:01:52] M49005071 Iter# = 30000 [ 0.06% complete] clocks = 00:27:05.741 [ 0.1626 sec/iter] Res64: 106C93EFA0800D81. AvgMaxErr = 0.282969043. MaxErr = 0.437500000.[/code]It's interesting that the A53 sped up and the A72 slowed down relative to their performance when the other cluster was at the same FFT. I expected the A72 to be slightly faster than it's previous result because the A53 should be working in its own cache more and using system memory less, but maybe because the A53 is less starved and doing more work it's using more power or something else shared. Reversing the allocation to see what would happen, A53 on 4608K A72 on 2560K: [code]pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p49005071.stat INFO: no restart file found...starting run from scratch. M49005071: using FFT length 2560K = 2621440 8-byte floats. this gives an average 18.693951034545897 bits per digit Using complex FFT radices 40 32 32 32 [Jan 15 12:54:52] M49005071 Iter# = 10000 [ 0.02% complete] clocks = 00:23:16.719 [ 0.1397 sec/iter] Res64: 8E7E56F23C735CF2. AvgMaxErr = 0.245389789. MaxErr = 0.375000000. [Jan 15 13:18:09] M49005071 Iter# = 20000 [ 0.04% complete] clocks = 00:23:16.597 [ 0.1397 sec/iter] Res64: 6CD0428337CA1430. AvgMaxErr = 0.246327235. MaxErr = 0.375000000. [Jan 15 13:41:25] M49005071 Iter# = 30000 [ 0.06% complete] clocks = 00:23:16.016 [ 0.1396 sec/iter] Res64: 106C93EFA0800D81. AvgMaxErr = 0.246085634. MaxErr = 0.375000000. pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p87068977.stat INFO: no restart file found...starting run from scratch. M87068977: using FFT length 4608K = 4718592 8-byte floats. this gives an average 18.452321582370335 bits per digit Using complex FFT radices 288 32 16 16 [Jan 15 13:28:13] M87068977 Iter# = 10000 [ 0.01% complete] clocks = 00:56:36.492 [ 0.3396 sec/iter] Res64: 13BB5C9DDF0CD3D6. AvgMaxErr = 0.249227525. MaxErr = 0.375000000.[/code]Works the same when roles are reversed. I don't know enough to analyse further but in any case the difference is pretty small, for an easy life it's probably best to stick both clusters on DC work and maybe switch the A72 to first time PRP when it's implemented just for fun. This board is tiny (60x45mm), the SoC is on the underside and the heatsink covers the entire underside. It might not make sense from a power or hardware cost perspective to use these boards for GIMPS, but creating a DIY radiator to heat your house with a cluster of these is tempting. Using the 2200G and 8100 numbers from above, we need ~15.5 NEO4 to match a 2200G, ~21.5 to match an 8100. I don't have a wattmeter handy but online benchmarks indicate power usage is probably give or take 11W per NEO4, a win for x86 by some margin I think. |
Thanks for the data, M344587487! That is indeed a dramatic falling-off of the A53 throughput once you get above 4M FFT length - on my A53-quad-based C2 I see falloff from the strictly-arithmetic-opcount-based O(n log n) scaling, but nowhere near what you see in your big+little combined tests.
How much did the full kit cost you? And do you have anything in mind to reduce per-node cost for the possible homebuilt cluster you describe? |
I'd be interested in building a small 4 or 7 node cluster of devices like this, if it's more cost effective for DC than running mprime on x86-64.
|
1 Attachment(s)
[QUOTE=ewmayer;506032]Thanks for the data, M344587487! That is indeed a dramatic falling-off of the A53 throughput once you get above 4M FFT length - on my A53-quad-based C2 I see falloff from the strictly-arithmetic-opcount-based O(n log n) scaling, but nowhere near what you see in your big+little combined tests.
How much did the full kit cost you? And do you have anything in mind to reduce per-node cost for the possible homebuilt cluster you describe?[/QUOTE] You buy these directly and they ship worldwide: [URL]https://www.friendlyarm.com/index.php?route=product/product&path=69&product_id=241[/URL] $45 for the board, $6 for the heatsink, $5 postage, £3 for a USB-C cable, £3 for an SD card, I already had a USB-C power source. You can probably get a 10 port USB power hub on ebay for £10-£20 but they're likely terrible efficiency. I'm not certain but have a feeling that the best solution for efficiency would be to mod an ATX PSU or maybe a laptop transformer, massive headache though. You absolutely need the board, heatsink and USB cable per node. You might be able to DIY a shared heatsink on the cheap if you're building a wall of these, you'd just need some thermal pads on the SoCs and the backside of the heatsink can be flat. No need for a switch as there's wifi. I don't think you can network boot these, but even if you could you'd need a switch and would gain nothing (except the reliability of not using SD cards). It may be possible to eliminate the SD card if the OS can be made small enough to run fully in RAM, but it's a bit of an admin nightmare on intital boot and if there's a crash or god forbid power cut. [QUOTE=Mark Rose;506034]I'd be interested in building a small 4 or 7 node cluster of devices like this, if it's more cost effective for DC than running mprime on x86-64.[/QUOTE] Unfortunately it's probably not, but when I find the wattmeter I'll disable all non-critical hardware and see how low it can go. I have hope that newer chips will make ARM more power competitive, but by the time they get cheap enough to make sense x86 will also have made progress. AMD 7nm is out this year with AVX2 parity to intel, that promises to be a massive leap forwards in power efficiency for x86. I didn't buy a NEO4 heatsink and instead used one salvaged from an old computer. It's pretty ridiculous as it dwarfs the board in all three dimensions. |
Hmm, so let's think along extreme cost-cutting lines, how low can we go?
o Was hoping that (like Odroid) one might be able to get a bulk discount (say 10%) on these - might be worth e-mailing the mfr to ask. (I just did so.) o Heat sink: One should able to get a properly-sized set of these in either a cheap bulk-pack (I see a 10-pack of 25x25X5mm ones on ebay for ~$10) or in form of some cut-to-desired-length extruded Alu. finned stock. A total per-unit cost < $50 is getting pretty close to "worth a try as a feasibility study" range. I'll be very interested to seeing accurate TPD numbers for these. When one factors in the wattage of the entire package (CPU + rest-of-mobo + PSU + SSD + case fans, what is the typical TPD at the wall outlet for a typical high-bang-per-watt Intel or AMD multicore system? |
[QUOTE=ewmayer;506051]When one factors in the wattage of the entire package (CPU + rest-of-mobo + PSU + SSD + case fans, what is the typical TPD at the wall outlet for a typical high-bang-per-watt Intel or AMD multicore system?[/QUOTE]
With a single gold power supply delivering to 4, 4 core boards underclocked and undervolted to match memory, about 67.5 watts at the wall each to give 186 iter/sec for a 4k fft (143 iter/sec for 5k, 295 iter/sec for a 2.5k). |
I started a thread about the RockPi4 on the [url=https://forum.odroid.com/viewtopic.php?f=149&t=33468]Odroid forums[/url] - Local-expert tkaiser has some useful insights there re. suitable OS images.
|
Raspberry Pi 3A+ power consumption
Just a quick note as a reference.
I've seen many tables and measurements for Raspberry Pi power consumption on the interwebs, but they are somewhat misleading because apparently their "full load" isn't anywhere near what Mlucas and ASIMD is capable of achieving while running. So, Mlucas (still v17.1, sorry) - on Raspberry Pi 3A+ at stock 1.4 GHz. 64-bit Gentoo "sakaki" build, fresher image from December 2018 so that it can run on the 3A as well. There is some slight difference in the firmware, and the old image from June 2018 wouldn't start on 3A+. X disabled, of course. On 1 GB it makes only a small difference in the running time, but 512MB seems to be too small for a graphical environment, even doing nothing. Idle 220 mA (from 5V) Proper full load 840 mA with 880mA spikes, perhaps more, my current meter isn't that fast. What I've seen on the net has generally been in the 400-500 mA range "full load" so take those figures with a grain of salt... As a side note, the 3A+ "should" be as fast as the 3B+, but for whatever reason, is actually a few percent slower. Maybe the smaller memory chip makes that difference? The Elpida chip is marked -1D-F on the end of the device code which means 533 MHz, and the default speed should be 500 MHz, so no difference there. (The memory on my 3B+ cards is -8D-F, by the way, which is 400 MHz so the default setting is already overclocking it a bit!) |
| All times are UTC. The time now is 04:24. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.