mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2017-12-18, 21:28   #144
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1175610 Posts
Default

Quote:
Originally Posted by ET_ View Post
How hard would it be to give Mlucas those PRP capabilities added to mprime in the last month? I am asking because PRP-C workunits are quite small (between 3M and 6M exponents) and they would be a wonderful task for small Berries.
Adding PRP/Gerbicz-check support is my #1 to-do item for the next release. Time frame unclear, but def. 1H2018, hopefully 1Q.
ewmayer is offline   Reply With Quote
Old 2017-12-18, 21:49   #145
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

5·7·139 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Adding PRP/Gerbicz-check support is my #1 to-do item for the next release. Time frame unclear, but def. 1H2018, hopefully 1Q.
ET_ is offline   Reply With Quote
Old 2018-01-11, 01:00   #146
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

Using a v17.1 SIMD build, my little Odroid C2 just successfully completed its first double-check. Woo Hoo! To paraphrase a certain wildly popular (and even more-wildly commercialized) SciFi movie of the 1970s, "Help me, Internet of Things and your billions of connected devices using ARM cores, you're my only hope." In other words, the business model here must needs be an "ARMy ant" one.
ewmayer is offline   Reply With Quote
Old 2018-01-11, 21:08   #147
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

117910 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Using a v17.1 SIMD build, my little Odroid C2 just successfully completed its first double-check. Woo Hoo!
Congrats! How many days/hours did it take?
VictordeHolland is offline   Reply With Quote
Old 2018-01-11, 22:20   #148
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267548 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
Congrats! How many days/hours did it take?
Pretty much 24/7 since 9 Nov, so almost exactly 2 months.
ewmayer is offline   Reply With Quote
Old 2018-01-14, 23:00   #149
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1175610 Posts
Default

Couple of issues ARM-builders brought to my attention in the past week which I'd like to run by the readership:

1. An Odroid XU4 owner reported that he needed to specify the architecture (in his case via '-march=armv7ve') in order to get a working build. (Sans the -march specifier his build segfaulted.) I'd like to add some verbiage about that to the readme page, but as the laundry list of possible arch-types here is long, I'd really like a simpler solution if possible. Would suggesting use of '-march=native' be the portable way to go here?

2. A user whose ARM device runs the Open Pandora OS reports that the ARM-specific code around the has_asimd function in the util.c file gives a compile error because the '#include <sys/auxv.h>' fails under his OS. I've asked him for more details re. his platform's header-file tree and am awaiting a reply, but I wonder if the inelegant-but-portable way to go here would be to replace said header-file-include with code which simply parses the /proc/cpuinfo file and searches for the string 'asimd'. Thoughts?
ewmayer is offline   Reply With Quote
Old 2018-02-12, 14:11   #150
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

32×131 Posts
Default

Not sure this is the right thread, but most people interested in ARM stuff will probably read it here:

Qualcomm Snapdragon 845 (extensive) performance preview:
https://www.anandtech.com/show/12420...rmance-preview

Custom implementation of ARM Cortex A75 and A55 cores:
4x Kryo 385 gold (custom A75 with 256KB L2) @2.8GHz
4x Kryo 385 silver (custom A55 with 128KB L2) @1.77GHz

2MB L3 (shared between cores)
3MB system cache (shared between CPU, GPU, accelerators, etc.)
LPDDR4x (29.9GB/s bandwidth)
process: Samsung 10nm LPP (2nd gen 10nm)

- DynamIQ allows for more flexible core combinations (1x A75 + 7x A55 for instance or 2+6 etc.)
- Private L2 per core for lower latency (configurable)
- A75 (3 wide decode) vs. A73 (2 wide decode)

Most of the performance improvements are probably due to the new cache structure and the better memory subsystem. Architecturally there are some small improvements (wider decode and issue queues) to extract more IPS.

NEON/FP pipe stays more or less the same between A73 and A75 (source Anandtech):
Quote:
The 2x 64-bit NEON/Floating Point pipes have their own dedicated Rename stage and 128-bit register file, with each SIMD NEON pipe in the A73/A75 capable of performing 8x 8-bit integer, 4x 16-bit integer, 2x 32-bit integer or single-precision floating-point (FP), or 1x 64-bit integer or double-precision FP operations per cycle
- A75 FP multiply-accumulate (MAC) latency reduced to 5 cyles compared to 6 cyles (A73)
- dedicated store pipeline for the NEON/FP

A55 and A53 NEON/FP is also almost the same (source Anandtech):
Quote:
The 2x 64-bit NEON/Floating Point pipes are still optional (some markets do not require them) and are served from a dedicated 128-bit register file like the A53. Each SIMD NEON pipe in the A53/A55 can perform 8x 8-bit integer, 4x 16-bit integer, 2x 32-bit integer or single-precision floating-point (FP), or 1x 64-bit integer or double-precision FP operations per cycle
- A55 does a FP multiply–add (FMA) in a single pass instead of two passes for the A53, reducing the latency from 8 cycles to 4.
- A55 gains separate AGUs (Address Generation Units) for loads and stores (instead of 1 AGU that does both), so it can dual-issue a load and store at the same time.

- A75 and A55 both gain native support for FP16



ARMs primary goals for (big, OoO designs) over the last generations:
A57 --> (high) performance 64bit
A72 --> reducing power
A73 --> improving power efficiency
A75 --> improving performance

LITTLE in-order:
A53 --> high power efficiency, small die area
A55 --> improving performance

More info on the A75/A55
https://www.anandtech.com/show/11441...cortex-a75-a55
Attached Thumbnails
Click image for larger version

Name:	arm-a53-cpu_diagram.png
Views:	168
Size:	22.9 KB
ID:	17688   Click image for larger version

Name:	arm-a55-cpu_diagram.png
Views:	165
Size:	23.4 KB
ID:	17689   Click image for larger version

Name:	arm-a73-cpu_diagram.png
Views:	164
Size:	34.1 KB
ID:	17690   Click image for larger version

Name:	arm-a75-cpu_diagram.png
Views:	158
Size:	37.9 KB
ID:	17691   Click image for larger version

Name:	arm-a75_cpu-core.png
Views:	158
Size:	219.3 KB
ID:	17692  

VictordeHolland is offline   Reply With Quote
Old 2018-02-12, 16:13   #151
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

32·131 Posts
Default

By the way I tested throttling of a smartphone SoC (Samsung Exynos 7420 inside the Samsung Galaxy S6) in my Dutch blog:
https://victordehollander.tweakblogs...martphone-socs

For putting a FP heavy load on the cores I used BOINC and theSkyNet POGS project (so the cycles were put to some good use :) ).
https://pogs.theskynet.org/pogs/

Samsung Exynos 7420 Octa inside Samsung Galaxy S6
Instruction set: ARMv8-A (‘64bit’)
Process: Samsung 14nm LPE (Low Power Early) FINFETs
Microarchitecture: big.LITTLE 4x Cortex-A57 + 4x Cortex-A53 (GTS)
Max Frequency: 2.1GHz (A57 cluster) 1.5GHz (A53 cluster)
Memory: 3GB 64-bit (2x32bit) dual-channel LPDDR4 @1553MHz (24.88 GB/s bandwidth)

Some interesting conclusions:
- Geekbench 4 load is very spiky and has sometimes several seconds between tests in which the SoC can cool down.
- The glass front and back side of the S6 seem to limit the thermal dissipation to about 2W.
- When using 2 cores the 7420 throttles after about 1 minute, frequency never quite stabilises, keeps trying to find a balance between performance/power/heat.
- 3 cores can only run at max speed (2100MHz) for about 10-15 seconds. After 3 minutes stabilises @1200MHz
- 4 cores throttling sets in almost immediately, keeps switching between 1700 and 1200MHz.
- a A57 core consumes almost 1.5W @2100MHz
- @1200MHz the A57 core uses ~0.5W
- @1500MHz the A57 core uses ~0.7W

- The SoC gets quite warm (70C+) when using more than 1 core. The phone itself only feels luke warm, never hot to the touch

In terms of performance for BOINC theSkynetPOGS:
Intel i5-2500K @4,0GHz core with DDR3-2133 (2133*64bits*2channels/8bits=34GB/s ?)
5476.59 BOINC credits in 103,484 CPUseconds
0.0529 credits/CPUsecs (=100%)
0.0132 credits/CPUsec per 1GHz (=100%)

Samsung Galaxy S6 (Exynos 7420) A57 core @1500MHz
99.06 BOINC credits in 11,638 CPUseconds
0.0085 credits/CPUsecs (=16%)
0.0057 credits/CPUsecs per 1GHz (=43%)

So in terms of absolute performance a single SandyBridge core outperforms the entire SoC (no surprise there) in this scenario. In terms of performance per GHz the A57 scores about halve that of SandyBridge. If we look at performance per watt it is a whole different story:
Lets take 100W @wall for the SandyBridge, (5 seconds for a point) that gives about 472 Joule/BOINCpoint
vs.
A57 @1500MHz (0.7W) which does take more than 100 seconds, but that works out to just 82J / BOINCpoint.

Note: there are two generations of manufacturing process difference (32nm Planar vs. 14nm FINFET).
Smartphone SoCs have high memory bandwidth
Attached Thumbnails
Click image for larger version

Name:	004_1C_freq.png
Views:	157
Size:	17.5 KB
ID:	17694   Click image for larger version

Name:	007_2C_freq.png
Views:	158
Size:	29.4 KB
ID:	17695   Click image for larger version

Name:	010_3C_freq.png
Views:	145
Size:	20.8 KB
ID:	17696   Click image for larger version

Name:	013_4C_freq.png
Views:	154
Size:	44.8 KB
ID:	17697   Click image for larger version

Name:	Screenshot_CPU-Z_temps145714_4cores.png
Views:	165
Size:	98.5 KB
ID:	17698  

VictordeHolland is offline   Reply With Quote
Old 2018-02-13, 00:09   #152
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267548 Posts
Default

Thanks for the detailed data, Victor - so this again raises the fairly self-evident idea of ganging up a bunch of such phone chassis into a single larger 'compute block'. Perhaps a year or so following the release of a given popular phone model using the desired chipset, start looking for used ones on the cheap, e.g. due to cracked/missing screens and/or no-longer-working smartphone features which leave the cortex CPUs and basic OS intact. Affix suitable-sized mini-alu-fins to the CPUs and mount all the thus-modified screen-removed phones in a basic chassis which spaces them apart suitably (say 5-10mm gap) to permit external airflow, said chassis also providing a simply shared power rail, cooling fan and whatever kind of basic cabling/switch-setup is needed to allow the phoes to to be interfaced with in order to load software and monitor processes. I wonder how feasible - mainly in term of cost-effectiveness - such a setup would be.
ewmayer is offline   Reply With Quote
Old 2018-02-13, 17:01   #153
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

32×131 Posts
Default

Quote:
Originally Posted by ewmayer View Post
so this again raises the fairly self-evident idea of ganging up a bunch of such phone chassis into a single larger 'compute block'. Perhaps a year or so following the release of a given popular phone model using the desired chipset, start looking for used ones on the cheap, e.g. due to cracked/missing screens and/or no-longer-working smartphone features which leave the cortex CPUs and basic OS intact. Affix suitable-sized mini-alu-fins to the CPUs and mount all the thus-modified screen-removed phones in a basic chassis which spaces them apart suitably (say 5-10mm gap) to permit external airflow, said chassis also providing a simply shared power rail, cooling fan and whatever kind of basic cabling/switch-setup is needed to allow the phones to to be interfaced with in order to load software and monitor processes. I wonder how feasible - mainly in term of cost-effectiveness - such a setup would be.
I like your idea, but I do see some challenges:
- Most phones have locked boot-loaders, so you would have to deal with whatever version of Android they come with. Chances of putting a Linux distro on them are small. Even if somehow the boot-loader could be hacked, there would still be the need for specialized drivers for the SoC?!
- Without displays and ethernet, is it even possible to SSH/remote desktop to an
Android device?
- Price of a second hand (fully working) for instance Samsung S6 (released April 2015) is still well over 100 Euros. Compared to 55 euro for a Odroid-C2 and the price of about 110$ for the announced Odroid-N
VictordeHolland is offline   Reply With Quote
Old 2018-02-23, 10:02   #154
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2·52·19 Posts
Default

ROC-RK3328-CC (Renegade)
https://www.indiegogo.com/projects/r...ndroid-linux#/
GCC: 7.2.0
Image: ROC-RK3328-CC_Ubuntu16.04_Arch64_20180124

It's an A53 board that's faster than a pi3b on paper, but my Renegade benchmarks are slower. I'm hoping that it's a poorly configured image (released weeks before anyone got their boards, still not many have the board, there are other problems with the image). Will retest if they get their act together.

Scalar 4 thread
Code:
17.1
      1024  msec/iter =   89.21  ROE[avg,max] = [0.261718750, 0.312500000]  radices = 256  8 16 16  0  0  0  0  0  0
      1152  msec/iter =  108.25  ROE[avg,max] = [0.210023717, 0.250000000]  radices = 288  8 16 16  0  0  0  0  0  0
      1280  msec/iter =  133.32  ROE[avg,max] = [0.224469866, 0.250000000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =  148.25  ROE[avg,max] = [0.225892857, 0.250000000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =  167.77  ROE[avg,max] = [0.231222098, 0.250000000]  radices = 192 16 16 16  0  0  0  0  0  0
      1664  msec/iter =  179.01  ROE[avg,max] = [0.226935686, 0.281250000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =  198.90  ROE[avg,max] = [0.217843192, 0.281250000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =  213.38  ROE[avg,max] = [0.258705357, 0.312500000]  radices = 240 16 16 16  0  0  0  0  0  0
      2048  msec/iter =  217.56  ROE[avg,max] = [0.320089286, 0.375000000]  radices = 256 16 16 16  0  0  0  0  0  0
      2304  msec/iter =  259.63  ROE[avg,max] = [0.252232143, 0.312500000]  radices = 288 16 16 16  0  0  0  0  0  0
      2560  msec/iter =  323.83  ROE[avg,max] = [0.302678571, 0.375000000]  radices = 160 16 16 32  0  0  0  0  0  0
      2816  msec/iter =  340.89  ROE[avg,max] = [0.265848214, 0.312500000]  radices = 176 16 16 32  0  0  0  0  0  0
      3072  msec/iter =  381.85  ROE[avg,max] = [0.219266183, 0.281250000]  radices = 192 16 16 32  0  0  0  0  0  0
      3328  msec/iter =  416.90  ROE[avg,max] = [0.290401786, 0.343750000]  radices = 208 16 16 32  0  0  0  0  0  0
      3584  msec/iter =  438.84  ROE[avg,max] = [0.211718750, 0.250000000]  radices = 224 16 16 32  0  0  0  0  0  0
      3840  msec/iter =  490.59  ROE[avg,max] = [0.228404018, 0.257812500]  radices = 240 16 16 32  0  0  0  0  0  0
      4096  msec/iter =  513.95  ROE[avg,max] = [0.228599330, 0.312500000]  radices = 256 16 16 32  0  0  0  0  0  0
      4608  msec/iter =  612.99  ROE[avg,max] = [0.221770368, 0.250000000]  radices = 288 16 16 32  0  0  0  0  0  0
      5120  msec/iter =  706.41  ROE[avg,max] = [0.248325893, 0.312500000]  radices = 160 16 32 32  0  0  0  0  0  0
      5632  msec/iter =  788.81  ROE[avg,max] = [0.218415179, 0.281250000]  radices = 176 16 32 32  0  0  0  0  0  0
      6144  msec/iter =  891.81  ROE[avg,max] = [0.213281250, 0.250000000]  radices =  24 16 16 16 32  0  0  0  0  0
      6656  msec/iter =  978.03  ROE[avg,max] = [0.303348214, 0.375000000]  radices = 208 16 32 32  0  0  0  0  0  0
      7168  msec/iter = 1091.41  ROE[avg,max] = [0.215611049, 0.250000000]  radices =  28 16 16 16 32  0  0  0  0  0
      7680  msec/iter = 1270.44  ROE[avg,max] = [0.221010045, 0.281250000]  radices = 240 16 32 32  0  0  0  0  0  0
asimd 4 thread
Code:
17.1
      1024  msec/iter =   81.63  ROE[avg,max] = [0.254687500, 0.312500000]  radices = 256  8 16 16  0  0  0  0  0  0
      1152  msec/iter =   91.43  ROE[avg,max] = [0.221044922, 0.250000000]  radices = 288  8 16 16  0  0  0  0  0  0
      1280  msec/iter =  104.53  ROE[avg,max] = [0.264508929, 0.343750000]  radices = 160 16 16 16  0  0  0  0  0  0
      1408  msec/iter =  120.61  ROE[avg,max] = [0.227343750, 0.265625000]  radices = 176 16 16 16  0  0  0  0  0  0
      1536  msec/iter =  153.76  ROE[avg,max] = [0.254241071, 0.312500000]  radices = 192 16 16 16  0  0  0  0  0  0
      1664  msec/iter =  147.07  ROE[avg,max] = [0.270758929, 0.312500000]  radices = 208 16 16 16  0  0  0  0  0  0
      1792  msec/iter =  161.98  ROE[avg,max] = [0.220532663, 0.250000000]  radices = 224 16 16 16  0  0  0  0  0  0
      1920  msec/iter =  172.94  ROE[avg,max] = [0.234137835, 0.265625000]  radices =  60 32 32 16  0  0  0  0  0  0
      2048  msec/iter =  185.06  ROE[avg,max] = [0.223493304, 0.250000000]  radices =  64 32 32 16  0  0  0  0  0  0
      2304  msec/iter =  213.12  ROE[avg,max] = [0.268526786, 0.312500000]  radices = 144 32 16 16  0  0  0  0  0  0
      2560  msec/iter =  237.47  ROE[avg,max] = [0.236908831, 0.312500000]  radices = 160 32 16 16  0  0  0  0  0  0
      2816  msec/iter =  274.68  ROE[avg,max] = [0.224888393, 0.312500000]  radices =  44 32 32 32  0  0  0  0  0  0
      3072  msec/iter =  301.81  ROE[avg,max] = [0.224818638, 0.251953125]  radices =  48 32 32 32  0  0  0  0  0  0
      3328  msec/iter =  327.87  ROE[avg,max] = [0.220971680, 0.250000000]  radices =  52 32 32 32  0  0  0  0  0  0
      3584  msec/iter =  360.36  ROE[avg,max] = [0.223172433, 0.250000000]  radices =  56 32 32 32  0  0  0  0  0  0
      3840  msec/iter =  382.60  ROE[avg,max] = [0.224260603, 0.250000000]  radices =  60 32 32 32  0  0  0  0  0  0
      4096  msec/iter =  403.07  ROE[avg,max] = [0.295089286, 0.343750000]  radices = 128 32 32 16  0  0  0  0  0  0
      4608  msec/iter =  462.49  ROE[avg,max] = [0.258928571, 0.312500000]  radices = 144 32 32 16  0  0  0  0  0  0
      5120  msec/iter =  513.99  ROE[avg,max] = [0.237137277, 0.281250000]  radices = 160 32 32 16  0  0  0  0  0  0
      5632  msec/iter =  584.83  ROE[avg,max] = [0.256919643, 0.312500000]  radices = 176 32 32 16  0  0  0  0  0  0
      6144  msec/iter =  667.16  ROE[avg,max] = [0.246651786, 0.281250000]  radices = 192 32 32 16  0  0  0  0  0  0
      6656  msec/iter =  702.36  ROE[avg,max] = [0.266085379, 0.312500000]  radices = 208 16 32 32  0  0  0  0  0  0
      7168  msec/iter =  798.79  ROE[avg,max] = [0.224874442, 0.281250000]  radices = 224 32 32 16  0  0  0  0  0  0
      7680  msec/iter =  882.90  ROE[avg,max] = [0.237053571, 0.281250000]  radices = 240 32 32 16  0  0  0  0  0  0
M344587487 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Economic prospects for solar photovoltaic power cheesehead Science & Technology 137 2018-06-26 15:46
Which SIMD flag to use for Raspberry Pi BrainStone Mlucas 14 2017-11-19 00:59
compiler/assembler optimizations possible? ixfd64 Software 7 2011-02-25 20:05
Running 32-bit builds on a Win7 system ewmayer Programming 34 2010-10-18 22:36
SIMD string->int fivemack Software 7 2009-03-23 18:15

All times are UTC. The time now is 04:24.


Fri Jul 7 04:24:59 UTC 2023 up 323 days, 1:53, 0 users, load averages: 1.62, 1.67, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔