mersenneforum.org i9 observations
 Register FAQ Search Today's Posts Mark Forums Read

 2020-10-03, 03:12 #1 Xyzzy     "Mike" Aug 2002 8,167 Posts i9 observations We have a new toy (i9-10900KF) to play with. It uses a lot of power if you let it. (With our setup it will take >250W without thermal throttling!) We have attached an interesting chart. The CPU is in a small case with a 2060 Super running at 125W. The CPU is cooled by an AIO liquid cooler. The case has several big fans. We set the BIOS to obey the Intel specifications for this CPU, which are 125W PL1 and 250W PL2. By using Intel's XTU program, we can modify the power limits in real time. In the chart you can see that the wattage is capped when it hits 125W. In the lower part of the chart, we introduce lower power caps of 100, 75, 50, 25 and 9 watts. (9W is apparently the lowest you can go with 10 cores.) We color-coded lines that kinda match up when looking at the ms/iteration column. FWIW, this is all with a ~10M exponent and a 560K FFT. We might have missed something so if you see something weird or wrong let us know. Your observations are appreciated. PS - We know Intel < AMD for this work load. Attached Thumbnails
 2020-10-03, 04:26 #2 VBCurtis     "Curtis" Feb 2005 Riverside, CA 2·2,393 Posts 100W and 125W having the same ms/iter suggests the 100W setting is already saturating the memory bandwidth, so for P95 work there's little reason to run at higher power than 100W (or you need to re-test after enabling XMP, if you forgot). If there are more settings available, you might throttle a bit lower than 100 and still get nearly-full or full performance. I wonder how different this effect is with an FFT ten times as big.
 2020-10-03, 07:22 #3 mackerel     Feb 2016 UK 419 Posts How many workers were used? At 560k FFT, one worker should fit in CPU cache and not be memory bound, but that FFT is quite small so a large number of cores may be inefficient. On the other end, 10 workers would almost certainly be memory bound. 2 and 5 workers are the other logical steps in between.
2020-10-03, 08:09   #4
S485122

Sep 2006
Brussels, Belgium

110100001102 Posts

Quote:
 Originally Posted by VBCurtis 100W and 125W having the same ms/iter suggests the 100W setting is already saturating the memory bandwidth. ...
Those i9-10L CPUs support quad-channel memory, then as mackerel remarked the FFT size means memory will not be solicited much.

Those CPUs come with two AVX-512 FMA units : IMHO that is the limiting factor.

I have an i9-10920X which I limited at 3GHz (3,5 GHz is nominal) AND at a power draw of 140 W (165 W being nominal). 2880K FFT require 0,93 ms per iteration (one worker twelve cores) on those settings. The CPU is then at a bit less than 80% utilisation (38% if taking hyperthreading into account.)

Jacob

2020-10-03, 11:33   #5
Xyzzy

"Mike"
Aug 2002

8,167 Posts

Quote:
 Originally Posted by VBCurtis I wonder how different this effect is with an FFT ten times as big.
We will test that later today.

Quote:
 Originally Posted by mackerel How many workers were used?
One worker per core.

2020-10-03, 14:01   #6
mackerel

Feb 2016
UK

6438 Posts

Quote:
 Originally Posted by S485122 Those i9-10L CPUs support quad-channel memory, then as mackerel remarked the FFT size means memory will not be solicited much. Those CPUs come with two AVX-512 FMA units : IMHO that is the limiting factor.
The model mentioned is a consumer one, dual channel, no AVX-512.

Quote:
 Originally Posted by Xyzzy One worker per core.
Probably ram bandwidth limited. Try some other combinations. I'd guess 3 workers of 3 cores each is likely better, even if that leaves you with a core left over.

2020-10-03, 14:53   #7
S485122

Sep 2006
Brussels, Belgium

2×5×167 Posts

Quote:
 Originally Posted by mackerel The model mentioned is a consumer one, dual channel, no AVX-512. ...
Indeed, I thought it was 19-10900X :-( (A wee bit of dyslexia, too quick to answer ... sloppiness.) Sorry.

Jacob

2020-10-03, 15:39   #8
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3·29·59 Posts

Quote:
 Originally Posted by Xyzzy We will test that later today. One worker per core.
I suggest turning mprime benchmarking loose to determine optimal-throughput number of workers at a fixed power setting at first-test wavefront PRP fft length. It's unlikely to be 1 core per worker in my experience on a variety of cpu models old and new. For most reliable results, minimize other system activity throughout the benchmarking run. Enjoy your toy!

Last fiddled with by kriesel on 2020-10-03 at 15:44

 2020-10-03, 17:10 #9 Aramis Wyler     "Bill Staffen" Jan 2013 Pittsburgh, PA, USA 19B16 Posts I agree - 1 core per worker would be optimal in a perfect world with infinite lvl 3 cache, but you only have a 20 MB cache and so there is no way in hell you're fitting 10 PRPs in there. You might generate an overall increase in throughput with just 2 workers, because even if it isn't as efficient per core you would be entirely on the chip. I know that thing has quad channel memory so it might be faster running 10 workers against system ram, but it really might not, either. Staying on the chip is a big advantages, and that's why I got the ryzen5 6 core instead of the ryzen7 8 core - they have the same 32MB cache and the ryzen5 cost a lot less. Last fiddled with by Aramis Wyler on 2020-10-03 at 17:11
2020-10-03, 18:35   #10
Xyzzy

"Mike"
Aug 2002

8,167 Posts

We ran two benchmarks.

The first is with a power limit of 25W. The second is at 250W. We turned off the short term "turbo" thingie.

In all cases, using one worker yields the best throughput. We did not test hyper-threading.

With the 250W limiter a different limit kicks in at around 145W. It is called the "current/EDP" limit. We haven't messed around with changing that yet. It sounds kinda scary.

As more cores are added and the TDP increases the processor automatically drops its core and cache frequency. The memory frequency is fixed at all times.

Perhaps the 25W benchmark is able to use more cores because it is jamming less data per (slower) core through a fixed (memory) pipe.

So far here are the best timings:

25W limit:
Timings for 6144K FFT length (8 cores, 1 worker): 7.69 ms. Throughput: 129.97 iter/sec.

250W limit:
Timings for 6144K FFT length (4 cores, 1 worker): 6.89 ms. Throughput: 145.17 iter/sec.

This is all with a 6144K (6M) FFT which should be at the wavefront for first time (110M) PRP work. We only tested one 6M FFT variant to save time, so these timing are probably not optimal.

For future benchmarking, to save time, we will only investigate one worker per instance.

Attached Files
 25w.txt (6.1 KB, 65 views) 250w.txt (6.1 KB, 73 views)

 2020-10-03, 19:03 #11 Xyzzy     "Mike" Aug 2002 8,167 Posts Here is the data for a 560K FFT AKA 10M C-PRP. Code: 125W Timings for 560K FFT length (1 core, 1 worker): 1.47 ms. Throughput: 679.61 iter/sec. Timings for 560K FFT length (2 cores, 1 worker): 0.80 ms. Throughput: 1253.39 iter/sec. Timings for 560K FFT length (3 cores, 1 worker): 0.55 ms. Throughput: 1805.90 iter/sec. Timings for 560K FFT length (4 cores, 1 worker): 0.45 ms. Throughput: 2235.83 iter/sec. Timings for 560K FFT length (5 cores, 1 worker): 0.35 ms. Throughput: 2841.27 iter/sec. Timings for 560K FFT length (6 cores, 1 worker): 0.32 ms. Throughput: 3170.52 iter/sec. Timings for 560K FFT length (7 cores, 1 worker): 0.29 ms. Throughput: 3488.97 iter/sec. Timings for 560K FFT length (8 cores, 1 worker): 0.27 ms. Throughput: 3742.76 iter/sec. Timings for 560K FFT length (9 cores, 1 worker): 0.25 ms. Throughput: 3945.33 iter/sec. Timings for 560K FFT length (10 cores, 1 worker): 0.24 ms. Throughput: 4107.49 iter/sec. Code: 25W Timings for 560K FFT length (1 core, 1 worker): 1.60 ms. Throughput: 625.15 iter/sec. Timings for 560K FFT length (2 cores, 1 worker): 1.09 ms. Throughput: 915.82 iter/sec. Timings for 560K FFT length (3 cores, 1 worker): 0.87 ms. Throughput: 1151.95 iter/sec. Timings for 560K FFT length (4 cores, 1 worker): 0.73 ms. Throughput: 1363.88 iter/sec. Timings for 560K FFT length (5 cores, 1 worker): 0.67 ms. Throughput: 1500.90 iter/sec. Timings for 560K FFT length (6 cores, 1 worker): 0.63 ms. Throughput: 1588.71 iter/sec. Timings for 560K FFT length (7 cores, 1 worker): 0.60 ms. Throughput: 1673.46 iter/sec. Timings for 560K FFT length (8 cores, 1 worker): 0.58 ms. Throughput: 1732.06 iter/sec. Timings for 560K FFT length (9 cores, 1 worker): 0.58 ms. Throughput: 1718.79 iter/sec. Timings for 560K FFT length (10 cores, 1 worker): 0.57 ms. Throughput: 1748.89 iter/sec.

 Similar Threads Thread Thread Starter Forum Replies Last Post rainchill Software 53 2021-05-04 13:07 hansl Math 3 2020-09-02 10:40 petrw1 PrimeNet 5 2011-04-20 15:56 stars10250 Lounge 6 2008-09-10 05:01

All times are UTC. The time now is 06:54.

Sun May 16 06:54:23 UTC 2021 up 38 days, 1:35, 0 users, load averages: 1.99, 2.24, 2.21