mersenneforum.org Expectations of Ryzen Threadripper 6000 Series
 Register FAQ Search Today's Posts Mark Forums Read

2021-07-27, 03:01   #23
tuckerkao

"Tucker Kao"
Jan 2020

25×3×7 Posts

Quote:
 Originally Posted by M344587487 The leaked Cinebench score seems too good to be true for the 12900k, at least my expectation of the LITTLE in big.LITTLE is that it would not be so performant. I guess removing AVX512 and SMT does make quite a large difference to the die size of the LITTLE. In any case, unless DDR5 speeds ramp very quickly the 12900k (dual channel ddr5?) should sit somewhere between the 5950X (dual channel DDR4) and 5970X (quad channel DDR4) for Prime95. Another question mark is if AVX512 and all 16 cores can be used simultaneously, it's my understanding that windows at least in its current state cannot and I don't know about Linux. Maybe pinning to the rescue?
Windows 11 will be available by around Thanksgiving time. Intel 12900k will probably be the most suitable CPU with the new operating system.

Until someone tested its performance on Prime95, it's hard to predict its efficiency and benchmark so early.

Last fiddled with by tuckerkao on 2021-07-27 at 03:21

 2021-07-27, 09:50 #24 M344587487     "Composite as Heck" Oct 2017 15518 Posts I was predicting throughput which is pretty easy as it's capped by memory bandwidth. Bandwidth per iteration is also affected by L3 but ignored originally for simplification, if we take the 12900k's much lower L3 cache into consideration it may lean closer to the 5950X in performance than the 5970X. Efficiency is much harder to predict. If left to its own devices with standard workloads the 12900k would likely be a power hungry monster (less so than previous gen as there are more cores operating under a similar budget, so they should on average be better placed on the efficiency curve offering more performance per watt, also better node). Being memory-bandwidth limited is better for efficiency as they can be even-better placed on the curve. A low tier Alder Lake part is one to watch. intel are normally very good at making sure there's Linux kernel support in place well ahead of a product launch, however they were slipping a little when 10nm was first introduced. The margins are smaller with Alder Lake than we're used to but basic support is already there now and has been for months. If anything I expect there to be more teething problems with windows which historically has had trouble adapting to non-uniform architectures (they may have overcome this problem by now).
2021-07-27, 13:27   #25
drkirkby

"David Kirkby"
Jan 2021
Althorne, Essex, UK

26×7 Posts

Quote:
 Originally Posted by ewmayer @drkirby: I believe once you get beyond 500-1000 memory buffers in stage 2, you get rapidly diminishing returns, as the stage 2 prime-pairing percentage approaches 100%. Do your mprime logs list #bufs and pairing %? Suggest you compare the for several different "how much mem" settings.
I've run quite a few benchmarks on M105211111 and saved the output of mprime -d to a file. The nearest I get to anything like you describe is lines similar to this.
Code:
[Worker #1 Jul 26 23:23] D: 2730, relative primes: 7107, stage 2 primes: 3087325, pair%=95.65
I do not see any lines with the text "buf" in them. I'm in the process of putting all the data in a spreadsheet. I was not including that pair percentage figure, but I can add that to the spreadsheet. I've tested for
• Saving primality test of 1, 1.05 (reported by mprime as 1.1) and 2.
• Constraining the RAM to 8, 16, 32, 64, 128, 256 and 368 GB, although mprime never used more than 304 GB, which happened when the saved primality tests was 2.
For each test 13 cores were used.

Unfortunately I had not realised it was possible to save the times to the nearest second, so all timings have an uncertainty of 1 minutes. But in all but one case, increasing the RAM made 4 changes
• B1 increased
• B2 increased
• Probability of finding a factor increased.
• Runtime increased
There was one exception to this which was when the primality tests were 2, and the available RAM increased from 256 GB, to 368 GB. Then
• B1 decreased
• B2 increased
• Probability of finding a factor remained unchanged at 4.66%.
• Runtime increased from 222 to 224 minutes.
• RAM increased to 304 GB - it did not quite use all the 368 GB permitted.
I did not expect to see B1 decrease with increasing RAM, but it did in this instance. Given the uncertainty in my timings, and the resolution of the reported probability of finding a factor, it could well explain this anomaly, but otherwise it would indicate for that test, the extra RAM was detrimental, as it increased the runtime without increasing the probability of finding a factor.

Sort of related - a power outage whilst benchmarking a PRP test.
Perhaps you, or someone else can answer a query I have about benchmarking a PRP test when a power outage occurred. I want to estimate the runtime as accurately as possible, but to do so I need to know the amount of time lost due to the power outage. The last entry in the log before the power outage, which was relevant to this exponent, was
Code:
 [Worker #1 Jul 27 11:29] Iteration: 11000000 / 105211111 [10.455169%], ms/iter:  3.078, ETA: 3d 08:33
The modification dates, times size on the files were

Code:
-rw-rw-r-- 1 drkirkby drkirkby   26303452 Jul 27 11:29 p105211111
-rw-rw-r-- 1 drkirkby drkirkby   39454848 Jul 27 11:15 p105211111.bu
-rw-rw-r-- 1 drkirkby drkirkby   26303452 Jul 27 10:38 p105211111.bu2
-rw-rw-r-- 1 drkirkby drkirkby   26303452 Jul 27 09:46 p105211111.bu3
-rw-rw-r-- 1 drkirkby drkirkby   26303452 Jul 27 08:54 p105211111.bu4
-rw-rw-r-- 1 drkirkby drkirkby 3366780928 Jul 27 11:13 p105211111.residues
mprime was then restarted. The log shows these lines relaven to the exponent.

Code:
[Main thread Jul 27 12:05] Mersenne number primality test program version 30.6
[Worker #1 Jul 27 12:05] Resuming Gerbicz error-checking PRP test of M105211111 using AVX-512 FFT length 5600K, Pass1=896, Pass2=6400, clm=1, 13 threads
[Worker #1 Jul 27 12:05] PRP proof using power=8 and 64-bit hash size.
[Worker #1 Jul 27 12:05] Proof requires 3.4GB of temporary disk space and uploading a 118MB proof file.
[Worker #1 Jul 27 12:05] Iteration: 11000001 / 105211111 [10.455170%].
[Worker #1 Jul 27 12:10] Iteration: 11100000 / 105211111 [10.550216%], ms/iter:  3.059, ETA: 3d 07:58
I'm trying to work out how much time was lost. The last modification time on p105211111 was 11:29, and there's evidence of the test resuming at 12:05, so I'm tempted to believe that 36 minutes were lost. Is that the best estimate?

Last fiddled with by drkirkby on 2021-07-27 at 13:31

2021-07-27, 14:16   #26
axn

Jun 2003

149116 Posts

Quote:
 Originally Posted by drkirkby The nearest I get to anything like you describe is lines similar to this. Code: [Worker #1 Jul 26 23:23] D: 2730, relative primes: 7107, stage 2 primes: 3087325, pair%=95.65 I do not see any lines with the text "buf" in them.
The "relative primes" is the number of buffers (i.e fft arrays) being used. For 5.5m FFT, 44MB per buffer, the allocation is 312708 MB. On top of it, it uses 50-100MB of memory for prime bitmaps and such.

2021-07-29, 07:42   #27
tuckerkao

"Tucker Kao"
Jan 2020

25·3·7 Posts

Quote:
 Originally Posted by M344587487 If we take the 12900k's much lower L3 cache into consideration it may lean closer to the 5950X in performance than the 5970X. Efficiency is much harder to predict. If left to its own devices with standard workloads the 12900k would likely be a power hungry monster
Since 12900k has the P-Core at 5.3 GHz and E-Core at 3.9 GHz, will Prime95 be able to efficiently use both when running P-1 and PRPs?

2021-07-29, 08:00   #28
Zhangrc

"University student"
May 2021
Beijing, China

197 Posts

Quote:
 Originally Posted by tuckerkao Will Prime95 be able to efficiently use both when running P-1 and PRPs?
It's too early to discuss anything. I suppose that Woltman will modify the code in v30.7 to fit the program in those cores as soon as 12900k release.

Last fiddled with by Zhangrc on 2021-07-29 at 08:01

2021-07-29, 10:40   #29
M344587487

"Composite as Heck"
Oct 2017

32·97 Posts

Quote:
 Originally Posted by tuckerkao Since 12900k has the P-Core at 5.3 GHz and E-Core at 3.9 GHz, will Prime95 be able to efficiently use both when running P-1 and PRPs?
I have a feeling that at the core (ignoring SMT and AVX512) there's not much difference between P and E cores, with all 16 cores running the P cores won't be going 5.3GHz and for anything that efficiently uses more than 8 cores the best configuration is probably to have all cores running at similar clocks (again all other things equal). Calling it big.LITTLE is clever marketing, but what I think the true goal is is to deprecate AVX512, at least make it less of a die-hog so they can better compete where they're weak. Disabling SMT on the E cores doesn't fit well with that theory (as it makes the cores less uniform).

Prime95 is a rare case in that it can use AVX512, there's many configurations that need extensive testing to find the most efficient, these being the main ones:
• 2 workers, AVX512 on P cores AVX2 on E cores
• 2 workers, AVX2 on P cores AVX2 on E cores
• AVX2 across a single worker on P and E cores
• 1 worker, AVX512 on P cores E cores idle
• 1 worker, AVX512 on P cores with the E cores doing something light on memory
Any other guesses anyone wants to throw into the ring?

 2021-07-29, 19:00 #30 mackerel     Feb 2016 UK 23×5×11 Posts I don't think Intel want to get rid of AVX-512 as they're still busy stuffing it where they can. How the support of it will be handled in Alder Lake remains to be seen and I think is a software (OS) problem, not a hardware one. Keep in mind the consumer implementations (mobile, Rocket Lake) are equivalent to Skylake-X single unit, and don't have 2 units. Peak FP throughput is in theory half and no better than AVX2. However practical throughput I was pleased to find on Rocket Lake was still significantly elevated over Skylake, I forget the exact number but it was over 40% IPC when not memory bandwidth limited. Skylake-X I was recently seeing around 80% from memory, not quite hitting the ideal 100% increase vs Skylake. Anyway, the goal of the small cores is to be small, so dropping HT I think does make sense. They're not meant for high throughput. They're probably slow enough without making them per-thread slower with HT. I've seen suggestions they're also not as slow as some might think, although where that lies remains to be seen. Do we know about the construction of Alder Lake yet? Was it going to have GPU on separate die, or am I imagining that? I'd guess all cores will be on same die. We're not going to see AMD chiplet style just yet. For sufficiently large workloads that don't become memory limited, I'd guess running single worker on big cores will probably be optimal performance. I'm making a guess that the small cores will be slow enough it isn't worth taking power budget away from the big cores in this type of workload, especially as you will run into scaling problems if you work across them, and may still run into headaches with a 2nd worker. If you want better power efficiency just power limit the big cores. Only for light workloads like sieve or Cinebench might using all cores help out.
2021-07-29, 19:58   #31
drkirkby

"David Kirkby"
Jan 2021
Althorne, Essex, UK

26·7 Posts

Quote:
 Originally Posted by axn The "relative primes" is the number of buffers (i.e fft arrays) being used. For 5.5m FFT, 44MB per buffer, the allocation is 312708 MB. On top of it, it uses 50-100MB of memory for prime bitmaps and such.
Thank you.

Sunday I will probably sort out the rest of my spreadsheet with all the timings. I inadvertently overlooked one P-1 test as I was manually changing worktodo.txt. Then I realised I needed to automate that task. I'm now running a PRP test on the same exponent I run the P-1 tests on. Once I know the time to complete the PRP test, I'll determine whether giving mprime lots of RAM on my machine is a good or bad idea. I possibly hit a problem when increasing the available memory above 256 GB with 2 saved tests. The reported chance of finding a factor was the same as at 256 GB, but the runtime increased by a couple of minutes. B1 dropped, and B2 went up. However, I need a couple of more digits of resolution on the chance of finding a factor to draw any meaningful conclusions.

Looking at the source code, I believe that the optimal bounds of B1 and B2 for P-1 are based on an assumption of a 1.8% error rate on the tests. PRP should be a lot better than LL.

I would not be too surprised if the optimal RAM allocation depends on things other than just the amount of RAM available. Different people have different number of memory channels, different RAM speeds etc. My RAM is only being clocked at 2400 MHz, which is a limit on the CPU. Other peoples RAM might be much faster. I do have some concerns about the speed of the RAM in my computer, and someone else expressed similar concerns about a very similar Dell machine. Here's his complaint about the Dell 7910
https://www.dell.com/community/Preci...w/td-p/7778630
https://www.dell.com/community/Preci...l/td-p/7945377

I don't know how useful Passmark is as a benchmark for RAM. Ultimately, the best benchmark is to run the application one wants to use.

2021-07-29, 20:15   #32
ewmayer
2ω=0

Sep 2002
República de California

2DA716 Posts

Quote:
 Originally Posted by drkirkby I've run quite a few benchmarks on M105211111 and saved the output of mprime -d to a file. The nearest I get to anything like you describe is lines similar to this. Code: [Worker #1 Jul 26 23:23] D: 2730, relative primes: 7107, stage 2 primes: 3087325, pair%=95.65 I do not see any lines with the text "buf" in them.
Axn beat me to it - does his 300-some-GB estimate jibe with what your system load app shows?

The 'pair%=95.65' means that out of every 10000 stage 2 primes, on average 9565 end up in bitmap pairs, thus you need 9565/2 + 435 ~= 5217 modmuls to process them. In the limit of infinite RAM, 100%-paired would mean 5000 modmuls per 10000 stage 2 primes. Thus there's only a potential 4% speedup remaining. My guess is that if you doubled the memory allocation from your current, you might see a 1% speedup.

2021-07-29, 21:54   #33
drkirkby

"David Kirkby"
Jan 2021
Althorne, Essex, UK

1C016 Posts

Quote:
 Originally Posted by ewmayer Axn beat me to it - does his 300-some-GB estimate jibe with what your system load app shows?
Yes axn's estimate was pretty good. More of the log shows,
Code:
[Worker #1 Jul 26 23:23] Available memory is 376520MB.
[Worker #1 Jul 26 23:23] D: 2730, relative primes: 7107, stage 2 primes: 3087325, pair%=95.65
[Worker #1 Jul 26 23:23] Using 311330MB of memory.
I would say 312708 MB estimate was a pretty damn good estimate. The FFT length was actually 5600K.
Quote:
 Originally Posted by ewmayer The 'pair%=95.65' means that out of every 10000 stage 2 primes, on average 9565 end up in bitmap pairs, thus you need 9565/2 + 435 ~= 5217 modmuls to process them. In the limit of infinite RAM, 100%-paired would mean 5000 modmuls per 10000 stage 2 primes. Thus there's only a potential 4% speedup remaining. My guess is that if you doubled the memory allocation from your current, you might see a 1% speedup.
Note however that mprime only used 304 of the 369 GB it was permitted to use, so I don't think even an infinite amount of RAM would have increased the percentage to 100%. That was based on saving 2 primality tests. Normally as the available RAM increased, then B1, B2 and the chances of finding a factor all increased. But that was not so with saving 2 tests with 369 GB RAM. The increased availability of RAM caused a drop in B1.

When I was doing the P-1 based on saving 1 primality test, with 256 GB allowed for P-1
Code:
[Worker #1 Jul 25 08:59] M105211111 stage 1 complete. 1252604 transforms. Time: 1941.373 sec.
[Worker #1 Jul 25 08:59] Starting stage 1 GCD - please be patient.
[Worker #1 Jul 25 08:59] Stage 1 GCD complete. Time: 47.141 sec.
[Worker #1 Jul 25 08:59] Available memory is 262144MB.
[Worker #1 Jul 25 08:59] D: 2310, relative primes: 4743, stage 2 primes: 1313481, pair%=94.34
[Worker #1 Jul 25 08:59] Using 207828MB of memory.
the maximum ever used was 202 GB, so mprime is not using all the RAM.
Quote:
 Originally Posted by ewmayer p.s.: Your 36-minutes-lost estimate for your power-outage looks right.
Thank you. I'm keen to know as accurately as possible the time of the PRP test, although I'm still using the computer for non CPU intensive tasks, Once I know
1. Probabiltiy of finding a factor during the P-1 factoring
2. Time for P-1 factoring
3. Time for PRP test
I should be able to determine if giving mprime more RAM was actually benefitical or not. Since whilst increased RAM usually increased the chances of finding a factor, it was always accompanied by an increase in run time.

Last fiddled with by drkirkby on 2021-07-29 at 21:57

 Similar Threads Thread Thread Starter Forum Replies Last Post Viliam Furik Hardware 27 2022-01-14 09:41 megav13 Hardware 1 2019-09-03 15:10 fivemack Hardware 3 2017-10-03 03:11 justinsane Riesel Prime Data Collecting (k*2^n-1) 26 2010-12-31 12:27 amphoria Riesel Prime Data Collecting (k*2^n-1) 0 2009-04-12 16:58

All times are UTC. The time now is 19:50.

Wed Jan 19 19:50:30 UTC 2022 up 180 days, 14:19, 1 user, load averages: 2.20, 1.85, 1.66