20210727, 03:01  #23  
"Tucker Kao"
Jan 2020
Head Base M168202123
2^{2}×107 Posts 
Quote:
Until someone tested its performance on Prime95, it's hard to predict its efficiency and benchmark so early. Last fiddled with by tuckerkao on 20210727 at 03:21 

20210727, 09:50  #24 
"Composite as Heck"
Oct 2017
34F_{16} Posts 
I was predicting throughput which is pretty easy as it's capped by memory bandwidth. Bandwidth per iteration is also affected by L3 but ignored originally for simplification, if we take the 12900k's much lower L3 cache into consideration it may lean closer to the 5950X in performance than the 5970X.
Efficiency is much harder to predict. If left to its own devices with standard workloads the 12900k would likely be a power hungry monster (less so than previous gen as there are more cores operating under a similar budget, so they should on average be better placed on the efficiency curve offering more performance per watt, also better node). Being memorybandwidth limited is better for efficiency as they can be evenbetter placed on the curve. A low tier Alder Lake part is one to watch. intel are normally very good at making sure there's Linux kernel support in place well ahead of a product launch, however they were slipping a little when 10nm was first introduced. The margins are smaller with Alder Lake than we're used to but basic support is already there now and has been for months. If anything I expect there to be more teething problems with windows which historically has had trouble adapting to nonuniform architectures (they may have overcome this problem by now). 
20210727, 13:27  #25  
"David Kirkby"
Jan 2021
Althorne, Essex, UK
3·149 Posts 
Quote:
Code:
[Worker #1 Jul 26 23:23] D: 2730, relative primes: 7107, stage 2 primes: 3087325, pair%=95.65
Unfortunately I had not realised it was possible to save the times to the nearest second, so all timings have an uncertainty of 1 minutes. But in all but one case, increasing the RAM made 4 changes
Sort of related  a power outage whilst benchmarking a PRP test. Perhaps you, or someone else can answer a query I have about benchmarking a PRP test when a power outage occurred. I want to estimate the runtime as accurately as possible, but to do so I need to know the amount of time lost due to the power outage. The last entry in the log before the power outage, which was relevant to this exponent, was Code:
[Worker #1 Jul 27 11:29] Iteration: 11000000 / 105211111 [10.455169%], ms/iter: 3.078, ETA: 3d 08:33 Code:
rwrwr 1 drkirkby drkirkby 26303452 Jul 27 11:29 p105211111 rwrwr 1 drkirkby drkirkby 39454848 Jul 27 11:15 p105211111.bu rwrwr 1 drkirkby drkirkby 26303452 Jul 27 10:38 p105211111.bu2 rwrwr 1 drkirkby drkirkby 26303452 Jul 27 09:46 p105211111.bu3 rwrwr 1 drkirkby drkirkby 26303452 Jul 27 08:54 p105211111.bu4 rwrwr 1 drkirkby drkirkby 3366780928 Jul 27 11:13 p105211111.residues Code:
[Main thread Jul 27 12:05] Mersenne number primality test program version 30.6 [Worker #1 Jul 27 12:05] Resuming Gerbicz errorchecking PRP test of M105211111 using AVX512 FFT length 5600K, Pass1=896, Pass2=6400, clm=1, 13 threads [Worker #1 Jul 27 12:05] PRP proof using power=8 and 64bit hash size. [Worker #1 Jul 27 12:05] Proof requires 3.4GB of temporary disk space and uploading a 118MB proof file. [Worker #1 Jul 27 12:05] Iteration: 11000001 / 105211111 [10.455170%]. [Worker #1 Jul 27 12:10] Iteration: 11100000 / 105211111 [10.550216%], ms/iter: 3.059, ETA: 3d 07:58 Last fiddled with by drkirkby on 20210727 at 13:31 

20210727, 14:16  #26 
Jun 2003
7×17×43 Posts 
The "relative primes" is the number of buffers (i.e fft arrays) being used. For 5.5m FFT, 44MB per buffer, the allocation is 312708 MB. On top of it, it uses 50100MB of memory for prime bitmaps and such.

20210729, 07:42  #27  
"Tucker Kao"
Jan 2020
Head Base M168202123
654_{8} Posts 
Quote:


20210729, 08:00  #28 
"University student"
May 2021
Beijing, China
3·5^{2} Posts 
It's too early to discuss anything. I suppose that Woltman will modify the code in v30.7 to fit the program in those cores as soon as 12900k release.
Last fiddled with by Zhangrc on 20210729 at 08:01 
20210729, 10:40  #29  
"Composite as Heck"
Oct 2017
7×11^{2} Posts 
Quote:
Prime95 is a rare case in that it can use AVX512, there's many configurations that need extensive testing to find the most efficient, these being the main ones:


20210729, 19:00  #30 
Feb 2016
UK
2^{3}·5·11 Posts 
I don't think Intel want to get rid of AVX512 as they're still busy stuffing it where they can. How the support of it will be handled in Alder Lake remains to be seen and I think is a software (OS) problem, not a hardware one. Keep in mind the consumer implementations (mobile, Rocket Lake) are equivalent to SkylakeX single unit, and don't have 2 units. Peak FP throughput is in theory half and no better than AVX2. However practical throughput I was pleased to find on Rocket Lake was still significantly elevated over Skylake, I forget the exact number but it was over 40% IPC when not memory bandwidth limited. SkylakeX I was recently seeing around 80% from memory, not quite hitting the ideal 100% increase vs Skylake.
Anyway, the goal of the small cores is to be small, so dropping HT I think does make sense. They're not meant for high throughput. They're probably slow enough without making them perthread slower with HT. I've seen suggestions they're also not as slow as some might think, although where that lies remains to be seen. Do we know about the construction of Alder Lake yet? Was it going to have GPU on separate die, or am I imagining that? I'd guess all cores will be on same die. We're not going to see AMD chiplet style just yet. For sufficiently large workloads that don't become memory limited, I'd guess running single worker on big cores will probably be optimal performance. I'm making a guess that the small cores will be slow enough it isn't worth taking power budget away from the big cores in this type of workload, especially as you will run into scaling problems if you work across them, and may still run into headaches with a 2nd worker. If you want better power efficiency just power limit the big cores. Only for light workloads like sieve or Cinebench might using all cores help out. 
20210729, 19:58  #31  
"David Kirkby"
Jan 2021
Althorne, Essex, UK
3×149 Posts 
Quote:
Sunday I will probably sort out the rest of my spreadsheet with all the timings. I inadvertently overlooked one P1 test as I was manually changing worktodo.txt. Then I realised I needed to automate that task. I'm now running a PRP test on the same exponent I run the P1 tests on. Once I know the time to complete the PRP test, I'll determine whether giving mprime lots of RAM on my machine is a good or bad idea. I possibly hit a problem when increasing the available memory above 256 GB with 2 saved tests. The reported chance of finding a factor was the same as at 256 GB, but the runtime increased by a couple of minutes. B1 dropped, and B2 went up. However, I need a couple of more digits of resolution on the chance of finding a factor to draw any meaningful conclusions. Looking at the source code, I believe that the optimal bounds of B1 and B2 for P1 are based on an assumption of a 1.8% error rate on the tests. PRP should be a lot better than LL. I would not be too surprised if the optimal RAM allocation depends on things other than just the amount of RAM available. Different people have different number of memory channels, different RAM speeds etc. My RAM is only being clocked at 2400 MHz, which is a limit on the CPU. Other peoples RAM might be much faster. I do have some concerns about the speed of the RAM in my computer, and someone else expressed similar concerns about a very similar Dell machine. Here's his complaint about the Dell 7910 https://www.dell.com/community/Preci...w/tdp/7778630 and mine about the 7920 https://www.dell.com/community/Preci...l/tdp/7945377 I don't know how useful Passmark is as a benchmark for RAM. Ultimately, the best benchmark is to run the application one wants to use. 

20210729, 20:15  #32  
∂^{2}ω=0
Sep 2002
República de California
11,657 Posts 
Quote:
The 'pair%=95.65' means that out of every 10000 stage 2 primes, on average 9565 end up in bitmap pairs, thus you need 9565/2 + 435 ~= 5217 modmuls to process them. In the limit of infinite RAM, 100%paired would mean 5000 modmuls per 10000 stage 2 primes. Thus there's only a potential 4% speedup remaining. My guess is that if you doubled the memory allocation from your current, you might see a 1% speedup. p.s.: Your 36minuteslost estimate for your poweroutage looks right. 

20210729, 21:54  #33  
"David Kirkby"
Jan 2021
Althorne, Essex, UK
447_{10} Posts 
Quote:
Code:
[Worker #1 Jul 26 23:23] Available memory is 376520MB. [Worker #1 Jul 26 23:23] D: 2730, relative primes: 7107, stage 2 primes: 3087325, pair%=95.65 [Worker #1 Jul 26 23:23] Using 311330MB of memory. Quote:
When I was doing the P1 based on saving 1 primality test, with 256 GB allowed for P1 Code:
[Worker #1 Jul 25 08:59] M105211111 stage 1 complete. 1252604 transforms. Time: 1941.373 sec. [Worker #1 Jul 25 08:59] Starting stage 1 GCD  please be patient. [Worker #1 Jul 25 08:59] Stage 1 GCD complete. Time: 47.141 sec. [Worker #1 Jul 25 08:59] Available memory is 262144MB. [Worker #1 Jul 25 08:59] D: 2310, relative primes: 4743, stage 2 primes: 1313481, pair%=94.34 [Worker #1 Jul 25 08:59] Using 207828MB of memory. Quote:
Last fiddled with by drkirkby on 20210729 at 21:57 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Threadripper 3990X vs. Threadripper 3970X  Viliam Furik  Hardware  21  20200419 08:24 
Updated Bios on B350M Board for Ryzen Series 3000!  megav13  Hardware  1  20190903 15:10 
Has anyone tried linear algebra on a Threadripper yet?  fivemack  Hardware  3  20171003 03:11 
5000 < k < 6000  justinsane  Riesel Prime Data Collecting (k*2^n1)  26  20101231 12:27 
6000 < k < 7000  amphoria  Riesel Prime Data Collecting (k*2^n1)  0  20090412 16:58 