20201003, 03:12  #1 
"Mike"
Aug 2002
17412_{8} Posts 
i9 observations
We have a new toy (i910900KF) to play with.
It uses a lot of power if you let it. (With our setup it will take >250W without thermal throttling!) We have attached an interesting chart. The CPU is in a small case with a 2060 Super running at 125W. The CPU is cooled by an AIO liquid cooler. The case has several big fans. We set the BIOS to obey the Intel specifications for this CPU, which are 125W PL1 and 250W PL2. By using Intel's XTU program, we can modify the power limits in real time. In the chart you can see that the wattage is capped when it hits 125W. In the lower part of the chart, we introduce lower power caps of 100, 75, 50, 25 and 9 watts. (9W is apparently the lowest you can go with 10 cores.) We colorcoded lines that kinda match up when looking at the ms/iteration column. FWIW, this is all with a ~10M exponent and a 560K FFT. We might have missed something so if you see something weird or wrong let us know. Your observations are appreciated. PS  We know Intel < AMD for this work load. 
20201003, 04:26  #2 
"Curtis"
Feb 2005
Riverside, CA
11021_{8} Posts 
100W and 125W having the same ms/iter suggests the 100W setting is already saturating the memory bandwidth, so for P95 work there's little reason to run at higher power than 100W (or you need to retest after enabling XMP, if you forgot).
If there are more settings available, you might throttle a bit lower than 100 and still get nearlyfull or full performance. I wonder how different this effect is with an FFT ten times as big. 
20201003, 07:22  #3 
Feb 2016
UK
3·7·19 Posts 
How many workers were used? At 560k FFT, one worker should fit in CPU cache and not be memory bound, but that FFT is quite small so a large number of cores may be inefficient. On the other end, 10 workers would almost certainly be memory bound. 2 and 5 workers are the other logical steps in between.

20201003, 08:09  #4  
Sep 2006
Brussels, Belgium
11001101111_{2} Posts 
Quote:
Those CPUs come with two AVX512 FMA units : IMHO that is the limiting factor. I have an i910920X which I limited at 3GHz (3,5 GHz is nominal) AND at a power draw of 140 W (165 W being nominal). 2880K FFT require 0,93 ms per iteration (one worker twelve cores) on those settings. The CPU is then at a bit less than 80% utilisation (38% if taking hyperthreading into account.) Jacob 

20201003, 11:33  #5 
"Mike"
Aug 2002
2×29×137 Posts 

20201003, 14:01  #6  
Feb 2016
UK
110001111_{2} Posts 
Quote:
Probably ram bandwidth limited. Try some other combinations. I'd guess 3 workers of 3 cores each is likely better, even if that leaves you with a core left over. 

20201003, 14:53  #7 
Sep 2006
Brussels, Belgium
3^{3}×61 Posts 

20201003, 15:39  #8 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
13×373 Posts 
I suggest turning mprime benchmarking loose to determine optimalthroughput number of workers at a fixed power setting at firsttest wavefront PRP fft length. It's unlikely to be 1 core per worker in my experience on a variety of cpu models old and new. For most reliable results, minimize other system activity throughout the benchmarking run. Enjoy your toy!
Last fiddled with by kriesel on 20201003 at 15:44 
20201003, 17:10  #9 
"Bill Staffen"
Jan 2013
Pittsburgh, PA, USA
633_{8} Posts 
I agree  1 core per worker would be optimal in a perfect world with infinite lvl 3 cache, but you only have a 20 MB cache and so there is no way in hell you're fitting 10 PRPs in there.
You might generate an overall increase in throughput with just 2 workers, because even if it isn't as efficient per core you would be entirely on the chip. I know that thing has quad channel memory so it might be faster running 10 workers against system ram, but it really might not, either. Staying on the chip is a big advantages, and that's why I got the ryzen5 6 core instead of the ryzen7 8 core  they have the same 32MB cache and the ryzen5 cost a lot less. Last fiddled with by Aramis Wyler on 20201003 at 17:11 
20201003, 18:35  #10 
"Mike"
Aug 2002
2×29×137 Posts 
We ran two benchmarks.
The first is with a power limit of 25W. The second is at 250W. We turned off the short term "turbo" thingie. In all cases, using one worker yields the best throughput. We did not test hyperthreading. With the 250W limiter a different limit kicks in at around 145W. It is called the "current/EDP" limit. We haven't messed around with changing that yet. It sounds kinda scary. As more cores are added and the TDP increases the processor automatically drops its core and cache frequency. The memory frequency is fixed at all times. Perhaps the 25W benchmark is able to use more cores because it is jamming less data per (slower) core through a fixed (memory) pipe. So far here are the best timings: 25W limit: Timings for 6144K FFT length (8 cores, 1 worker): 7.69 ms. Throughput: 129.97 iter/sec. 250W limit: Timings for 6144K FFT length (4 cores, 1 worker): 6.89 ms. Throughput: 145.17 iter/sec. This is all with a 6144K (6M) FFT which should be at the wavefront for first time (110M) PRP work. We only tested one 6M FFT variant to save time, so these timing are probably not optimal. For future benchmarking, to save time, we will only investigate one worker per instance. 
20201003, 19:03  #11 
"Mike"
Aug 2002
2·29·137 Posts 
Here is the data for a 560K FFT AKA 10M CPRP.
Code:
125W Timings for 560K FFT length (1 core, 1 worker): 1.47 ms. Throughput: 679.61 iter/sec. Timings for 560K FFT length (2 cores, 1 worker): 0.80 ms. Throughput: 1253.39 iter/sec. Timings for 560K FFT length (3 cores, 1 worker): 0.55 ms. Throughput: 1805.90 iter/sec. Timings for 560K FFT length (4 cores, 1 worker): 0.45 ms. Throughput: 2235.83 iter/sec. Timings for 560K FFT length (5 cores, 1 worker): 0.35 ms. Throughput: 2841.27 iter/sec. Timings for 560K FFT length (6 cores, 1 worker): 0.32 ms. Throughput: 3170.52 iter/sec. Timings for 560K FFT length (7 cores, 1 worker): 0.29 ms. Throughput: 3488.97 iter/sec. Timings for 560K FFT length (8 cores, 1 worker): 0.27 ms. Throughput: 3742.76 iter/sec. Timings for 560K FFT length (9 cores, 1 worker): 0.25 ms. Throughput: 3945.33 iter/sec. Timings for 560K FFT length (10 cores, 1 worker): 0.24 ms. Throughput: 4107.49 iter/sec. Code:
25W Timings for 560K FFT length (1 core, 1 worker): 1.60 ms. Throughput: 625.15 iter/sec. Timings for 560K FFT length (2 cores, 1 worker): 1.09 ms. Throughput: 915.82 iter/sec. Timings for 560K FFT length (3 cores, 1 worker): 0.87 ms. Throughput: 1151.95 iter/sec. Timings for 560K FFT length (4 cores, 1 worker): 0.73 ms. Throughput: 1363.88 iter/sec. Timings for 560K FFT length (5 cores, 1 worker): 0.67 ms. Throughput: 1500.90 iter/sec. Timings for 560K FFT length (6 cores, 1 worker): 0.63 ms. Throughput: 1588.71 iter/sec. Timings for 560K FFT length (7 cores, 1 worker): 0.60 ms. Throughput: 1673.46 iter/sec. Timings for 560K FFT length (8 cores, 1 worker): 0.58 ms. Throughput: 1732.06 iter/sec. Timings for 560K FFT length (9 cores, 1 worker): 0.58 ms. Throughput: 1718.79 iter/sec. Timings for 560K FFT length (10 cores, 1 worker): 0.57 ms. Throughput: 1748.89 iter/sec. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Observations of Wieferich primes and Wieferich1 friendly club  hansl  Math  3  20200902 10:40 
2020 Prime95 observations, issues, and suggestions  rainchill  Software  43  20200506 22:19 
Observations with MaxHighMemWorkers  petrw1  PrimeNet  5  20110420 15:56 
GIMPS emotions and random observations  stars10250  Lounge  6  20080910 05:01 