mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2020-10-03, 03:12   #1
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

786210 Posts
Default i9 observations

We have a new toy (i9-10900KF) to play with.

It uses a lot of power if you let it. (With our setup it will take >250W without thermal throttling!)

We have attached an interesting chart.

The CPU is in a small case with a 2060 Super running at 125W. The CPU is cooled by an AIO liquid cooler. The case has several big fans.

We set the BIOS to obey the Intel specifications for this CPU, which are 125W PL1 and 250W PL2. By using Intel's XTU program, we can modify the power limits in real time. In the chart you can see that the wattage is capped when it hits 125W. In the lower part of the chart, we introduce lower power caps of 100, 75, 50, 25 and 9 watts. (9W is apparently the lowest you can go with 10 cores.)

We color-coded lines that kinda match up when looking at the ms/iteration column. FWIW, this is all with a ~10M exponent and a 560K FFT.

We might have missed something so if you see something weird or wrong let us know.

Your observations are appreciated.

PS - We know Intel < AMD for this work load.
Attached Thumbnails
Click image for larger version

Name:	i9.png
Views:	91
Size:	26.5 KB
ID:	23464   Click image for larger version

Name:	computer.jpg
Views:	70
Size:	171.8 KB
ID:	23465  
Xyzzy is offline   Reply With Quote
Old 2020-10-03, 04:26   #2
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

106158 Posts
Default

100W and 125W having the same ms/iter suggests the 100W setting is already saturating the memory bandwidth, so for P95 work there's little reason to run at higher power than 100W (or you need to re-test after enabling XMP, if you forgot).

If there are more settings available, you might throttle a bit lower than 100 and still get nearly-full or full performance.

I wonder how different this effect is with an FFT ten times as big.
VBCurtis is online now   Reply With Quote
Old 2020-10-03, 07:22   #3
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

2×197 Posts
Default

How many workers were used? At 560k FFT, one worker should fit in CPU cache and not be memory bound, but that FFT is quite small so a large number of cores may be inefficient. On the other end, 10 workers would almost certainly be memory bound. 2 and 5 workers are the other logical steps in between.
mackerel is online now   Reply With Quote
Old 2020-10-03, 08:09   #4
S485122
 
S485122's Avatar
 
Sep 2006
Brussels, Belgium

32×181 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
100W and 125W having the same ms/iter suggests the 100W setting is already saturating the memory bandwidth.
...
Those i9-10L CPUs support quad-channel memory, then as mackerel remarked the FFT size means memory will not be solicited much.

Those CPUs come with two AVX-512 FMA units : IMHO that is the limiting factor.

I have an i9-10920X which I limited at 3GHz (3,5 GHz is nominal) AND at a power draw of 140 W (165 W being nominal). 2880K FFT require 0,93 ms per iteration (one worker twelve cores) on those settings. The CPU is then at a bit less than 80% utilisation (38% if taking hyperthreading into account.)

Jacob
S485122 is offline   Reply With Quote
Old 2020-10-03, 11:33   #5
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

2·3,931 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
I wonder how different this effect is with an FFT ten times as big.
We will test that later today.

Quote:
Originally Posted by mackerel View Post
How many workers were used?
One worker per core.

Xyzzy is offline   Reply With Quote
Old 2020-10-03, 14:01   #6
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

2·197 Posts
Default

Quote:
Originally Posted by S485122 View Post
Those i9-10L CPUs support quad-channel memory, then as mackerel remarked the FFT size means memory will not be solicited much.

Those CPUs come with two AVX-512 FMA units : IMHO that is the limiting factor.
The model mentioned is a consumer one, dual channel, no AVX-512.

Quote:
Originally Posted by Xyzzy View Post
One worker per core.
Probably ram bandwidth limited. Try some other combinations. I'd guess 3 workers of 3 cores each is likely better, even if that leaves you with a core left over.
mackerel is online now   Reply With Quote
Old 2020-10-03, 14:53   #7
S485122
 
S485122's Avatar
 
Sep 2006
Brussels, Belgium

110010111012 Posts
Default

Quote:
Originally Posted by mackerel View Post
The model mentioned is a consumer one, dual channel, no AVX-512.
...
Indeed, I thought it was 19-10900X :-( (A wee bit of dyslexia, too quick to answer ... sloppiness.) Sorry.

Jacob
S485122 is offline   Reply With Quote
Old 2020-10-03, 15:39   #8
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

128816 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
We will test that later today.

One worker per core.

I suggest turning mprime benchmarking loose to determine optimal-throughput number of workers at a fixed power setting at first-test wavefront PRP fft length. It's unlikely to be 1 core per worker in my experience on a variety of cpu models old and new. For most reliable results, minimize other system activity throughout the benchmarking run. Enjoy your toy!

Last fiddled with by kriesel on 2020-10-03 at 15:44
kriesel is online now   Reply With Quote
Old 2020-10-03, 17:10   #9
Aramis Wyler
 
Aramis Wyler's Avatar
 
"Bill Staffen"
Jan 2013
Pittsburgh, PA, USA

3×137 Posts
Default

I agree - 1 core per worker would be optimal in a perfect world with infinite lvl 3 cache, but you only have a 20 MB cache and so there is no way in hell you're fitting 10 PRPs in there.


You might generate an overall increase in throughput with just 2 workers, because even if it isn't as efficient per core you would be entirely on the chip. I know that thing has quad channel memory so it might be faster running 10 workers against system ram, but it really might not, either. Staying on the chip is a big advantages, and that's why I got the ryzen5 6 core instead of the ryzen7 8 core - they have the same 32MB cache and the ryzen5 cost a lot less.

Last fiddled with by Aramis Wyler on 2020-10-03 at 17:11
Aramis Wyler is offline   Reply With Quote
Old 2020-10-03, 18:35   #10
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

2·3,931 Posts
Default

We ran two benchmarks.

The first is with a power limit of 25W. The second is at 250W. We turned off the short term "turbo" thingie.

In all cases, using one worker yields the best throughput. We did not test hyper-threading.

With the 250W limiter a different limit kicks in at around 145W. It is called the "current/EDP" limit. We haven't messed around with changing that yet. It sounds kinda scary.

As more cores are added and the TDP increases the processor automatically drops its core and cache frequency. The memory frequency is fixed at all times.

Perhaps the 25W benchmark is able to use more cores because it is jamming less data per (slower) core through a fixed (memory) pipe.

So far here are the best timings:

25W limit:
Timings for 6144K FFT length (8 cores, 1 worker): 7.69 ms. Throughput: 129.97 iter/sec.

250W limit:
Timings for 6144K FFT length (4 cores, 1 worker): 6.89 ms. Throughput: 145.17 iter/sec.


This is all with a 6144K (6M) FFT which should be at the wavefront for first time (110M) PRP work. We only tested one 6M FFT variant to save time, so these timing are probably not optimal.

For future benchmarking, to save time, we will only investigate one worker per instance.

Attached Files
File Type: txt 25w.txt (6.1 KB, 17 views)
File Type: txt 250w.txt (6.1 KB, 20 views)
Xyzzy is offline   Reply With Quote
Old 2020-10-03, 19:03   #11
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

11110101101102 Posts
Default

Here is the data for a 560K FFT AKA 10M C-PRP.
Code:
125W
Timings for 560K FFT length (1 core, 1 worker):  1.47 ms.  Throughput: 679.61 iter/sec.
Timings for 560K FFT length (2 cores, 1 worker):  0.80 ms.  Throughput: 1253.39 iter/sec.
Timings for 560K FFT length (3 cores, 1 worker):  0.55 ms.  Throughput: 1805.90 iter/sec.
Timings for 560K FFT length (4 cores, 1 worker):  0.45 ms.  Throughput: 2235.83 iter/sec.
Timings for 560K FFT length (5 cores, 1 worker):  0.35 ms.  Throughput: 2841.27 iter/sec.
Timings for 560K FFT length (6 cores, 1 worker):  0.32 ms.  Throughput: 3170.52 iter/sec.
Timings for 560K FFT length (7 cores, 1 worker):  0.29 ms.  Throughput: 3488.97 iter/sec.
Timings for 560K FFT length (8 cores, 1 worker):  0.27 ms.  Throughput: 3742.76 iter/sec.
Timings for 560K FFT length (9 cores, 1 worker):  0.25 ms.  Throughput: 3945.33 iter/sec.
Timings for 560K FFT length (10 cores, 1 worker):  0.24 ms.  Throughput: 4107.49 iter/sec.
Code:
25W
Timings for 560K FFT length (1 core, 1 worker):  1.60 ms.  Throughput: 625.15 iter/sec.
Timings for 560K FFT length (2 cores, 1 worker):  1.09 ms.  Throughput: 915.82 iter/sec.
Timings for 560K FFT length (3 cores, 1 worker):  0.87 ms.  Throughput: 1151.95 iter/sec.
Timings for 560K FFT length (4 cores, 1 worker):  0.73 ms.  Throughput: 1363.88 iter/sec.
Timings for 560K FFT length (5 cores, 1 worker):  0.67 ms.  Throughput: 1500.90 iter/sec.
Timings for 560K FFT length (6 cores, 1 worker):  0.63 ms.  Throughput: 1588.71 iter/sec.
Timings for 560K FFT length (7 cores, 1 worker):  0.60 ms.  Throughput: 1673.46 iter/sec.
Timings for 560K FFT length (8 cores, 1 worker):  0.58 ms.  Throughput: 1732.06 iter/sec.
Timings for 560K FFT length (9 cores, 1 worker):  0.58 ms.  Throughput: 1718.79 iter/sec.
Timings for 560K FFT length (10 cores, 1 worker):  0.57 ms.  Throughput: 1748.89 iter/sec.
Xyzzy is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Observations of Wieferich primes and Wieferich-1 friendly club hansl Math 3 2020-09-02 10:40
2020 Prime95 observations, issues, and suggestions rainchill Software 43 2020-05-06 22:19
Observations with MaxHighMemWorkers petrw1 PrimeNet 5 2011-04-20 15:56
GIMPS emotions and random observations stars10250 Lounge 6 2008-09-10 05:01

All times are UTC. The time now is 18:48.

Tue Dec 1 18:48:53 UTC 2020 up 82 days, 15:59, 2 users, load averages: 1.53, 1.70, 1.94

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.