![]() |
|
|
#23 | |
|
Jan 2020
32·41 Posts |
Quote:
Until someone tested its performance on Prime95, it's hard to predict its efficiency and benchmark so early. Last fiddled with by tuckerkao on 2021-07-27 at 03:21 |
|
|
|
|
|
|
#24 |
|
"Composite as Heck"
Oct 2017
22×32×23 Posts |
I was predicting throughput which is pretty easy as it's capped by memory bandwidth. Bandwidth per iteration is also affected by L3 but ignored originally for simplification, if we take the 12900k's much lower L3 cache into consideration it may lean closer to the 5950X in performance than the 5970X.
Efficiency is much harder to predict. If left to its own devices with standard workloads the 12900k would likely be a power hungry monster (less so than previous gen as there are more cores operating under a similar budget, so they should on average be better placed on the efficiency curve offering more performance per watt, also better node). Being memory-bandwidth limited is better for efficiency as they can be even-better placed on the curve. A low tier Alder Lake part is one to watch. intel are normally very good at making sure there's Linux kernel support in place well ahead of a product launch, however they were slipping a little when 10nm was first introduced. The margins are smaller with Alder Lake than we're used to but basic support is already there now and has been for months. If anything I expect there to be more teething problems with windows which historically has had trouble adapting to non-uniform architectures (they may have overcome this problem by now). |
|
|
|
|
|
#25 | |
|
"David Kirkby"
Jan 2021
Althorne, Essex, UK
24·33 Posts |
Quote:
Code:
[Worker #1 Jul 26 23:23] D: 2730, relative primes: 7107, stage 2 primes: 3087325, pair%=95.65
Unfortunately I had not realised it was possible to save the times to the nearest second, so all timings have an uncertainty of 1 minutes. But in all but one case, increasing the RAM made 4 changes
Sort of related - a power outage whilst benchmarking a PRP test. Perhaps you, or someone else can answer a query I have about benchmarking a PRP test when a power outage occurred. I want to estimate the runtime as accurately as possible, but to do so I need to know the amount of time lost due to the power outage. The last entry in the log before the power outage, which was relevant to this exponent, was Code:
[Worker #1 Jul 27 11:29] Iteration: 11000000 / 105211111 [10.455169%], ms/iter: 3.078, ETA: 3d 08:33 Code:
-rw-rw-r-- 1 drkirkby drkirkby 26303452 Jul 27 11:29 p105211111 -rw-rw-r-- 1 drkirkby drkirkby 39454848 Jul 27 11:15 p105211111.bu -rw-rw-r-- 1 drkirkby drkirkby 26303452 Jul 27 10:38 p105211111.bu2 -rw-rw-r-- 1 drkirkby drkirkby 26303452 Jul 27 09:46 p105211111.bu3 -rw-rw-r-- 1 drkirkby drkirkby 26303452 Jul 27 08:54 p105211111.bu4 -rw-rw-r-- 1 drkirkby drkirkby 3366780928 Jul 27 11:13 p105211111.residues Code:
[Main thread Jul 27 12:05] Mersenne number primality test program version 30.6 [Worker #1 Jul 27 12:05] Resuming Gerbicz error-checking PRP test of M105211111 using AVX-512 FFT length 5600K, Pass1=896, Pass2=6400, clm=1, 13 threads [Worker #1 Jul 27 12:05] PRP proof using power=8 and 64-bit hash size. [Worker #1 Jul 27 12:05] Proof requires 3.4GB of temporary disk space and uploading a 118MB proof file. [Worker #1 Jul 27 12:05] Iteration: 11000001 / 105211111 [10.455170%]. [Worker #1 Jul 27 12:10] Iteration: 11100000 / 105211111 [10.550216%], ms/iter: 3.059, ETA: 3d 07:58 Last fiddled with by drkirkby on 2021-07-27 at 13:31 |
|
|
|
|
|
|
#26 |
|
Jun 2003
5,087 Posts |
The "relative primes" is the number of buffers (i.e fft arrays) being used. For 5.5m FFT, 44MB per buffer, the allocation is 312708 MB. On top of it, it uses 50-100MB of memory for prime bitmaps and such.
|
|
|
|
|
|
#27 | |
|
Jan 2020
17116 Posts |
Quote:
|
|
|
|
|
|
|
#28 |
|
"University student"
May 2021
Beijing, China
1011112 Posts |
It's too early to discuss anything. I suppose that Woltman will modify the code in v30.7 to fit the program in those cores as soon as 12900k release.
Last fiddled with by Zhangrc on 2021-07-29 at 08:01 |
|
|
|
|
|
#29 | |
|
"Composite as Heck"
Oct 2017
22·32·23 Posts |
Quote:
Prime95 is a rare case in that it can use AVX512, there's many configurations that need extensive testing to find the most efficient, these being the main ones:
|
|
|
|
|
|
|
#30 |
|
Feb 2016
UK
43610 Posts |
I don't think Intel want to get rid of AVX-512 as they're still busy stuffing it where they can. How the support of it will be handled in Alder Lake remains to be seen and I think is a software (OS) problem, not a hardware one. Keep in mind the consumer implementations (mobile, Rocket Lake) are equivalent to Skylake-X single unit, and don't have 2 units. Peak FP throughput is in theory half and no better than AVX2. However practical throughput I was pleased to find on Rocket Lake was still significantly elevated over Skylake, I forget the exact number but it was over 40% IPC when not memory bandwidth limited. Skylake-X I was recently seeing around 80% from memory, not quite hitting the ideal 100% increase vs Skylake.
Anyway, the goal of the small cores is to be small, so dropping HT I think does make sense. They're not meant for high throughput. They're probably slow enough without making them per-thread slower with HT. I've seen suggestions they're also not as slow as some might think, although where that lies remains to be seen. Do we know about the construction of Alder Lake yet? Was it going to have GPU on separate die, or am I imagining that? I'd guess all cores will be on same die. We're not going to see AMD chiplet style just yet. For sufficiently large workloads that don't become memory limited, I'd guess running single worker on big cores will probably be optimal performance. I'm making a guess that the small cores will be slow enough it isn't worth taking power budget away from the big cores in this type of workload, especially as you will run into scaling problems if you work across them, and may still run into headaches with a 2nd worker. If you want better power efficiency just power limit the big cores. Only for light workloads like sieve or Cinebench might using all cores help out. |
|
|
|
|
|
#31 | |
|
"David Kirkby"
Jan 2021
Althorne, Essex, UK
24×33 Posts |
Quote:
Sunday I will probably sort out the rest of my spreadsheet with all the timings. I inadvertently overlooked one P-1 test as I was manually changing worktodo.txt. Then I realised I needed to automate that task. I'm now running a PRP test on the same exponent I run the P-1 tests on. Once I know the time to complete the PRP test, I'll determine whether giving mprime lots of RAM on my machine is a good or bad idea. I possibly hit a problem when increasing the available memory above 256 GB with 2 saved tests. The reported chance of finding a factor was the same as at 256 GB, but the runtime increased by a couple of minutes. B1 dropped, and B2 went up. However, I need a couple of more digits of resolution on the chance of finding a factor to draw any meaningful conclusions. Looking at the source code, I believe that the optimal bounds of B1 and B2 for P-1 are based on an assumption of a 1.8% error rate on the tests. PRP should be a lot better than LL. I would not be too surprised if the optimal RAM allocation depends on things other than just the amount of RAM available. Different people have different number of memory channels, different RAM speeds etc. My RAM is only being clocked at 2400 MHz, which is a limit on the CPU. Other peoples RAM might be much faster. I do have some concerns about the speed of the RAM in my computer, and someone else expressed similar concerns about a very similar Dell machine. Here's his complaint about the Dell 7910 https://www.dell.com/community/Preci...w/td-p/7778630 and mine about the 7920 https://www.dell.com/community/Preci...l/td-p/7945377 I don't know how useful Passmark is as a benchmark for RAM. Ultimately, the best benchmark is to run the application one wants to use. |
|
|
|
|
|
|
#32 | |
|
∂2ω=0
Sep 2002
República de California
101101011111112 Posts |
Quote:
The 'pair%=95.65' means that out of every 10000 stage 2 primes, on average 9565 end up in bitmap pairs, thus you need 9565/2 + 435 ~= 5217 modmuls to process them. In the limit of infinite RAM, 100%-paired would mean 5000 modmuls per 10000 stage 2 primes. Thus there's only a potential 4% speedup remaining. My guess is that if you doubled the memory allocation from your current, you might see a 1% speedup. p.s.: Your 36-minutes-lost estimate for your power-outage looks right. |
|
|
|
|
|
|
#33 | |||
|
"David Kirkby"
Jan 2021
Althorne, Essex, UK
24×33 Posts |
Quote:
Code:
[Worker #1 Jul 26 23:23] Available memory is 376520MB. [Worker #1 Jul 26 23:23] D: 2730, relative primes: 7107, stage 2 primes: 3087325, pair%=95.65 [Worker #1 Jul 26 23:23] Using 311330MB of memory. The FFT length was actually 5600K.Quote:
When I was doing the P-1 based on saving 1 primality test, with 256 GB allowed for P-1 Code:
[Worker #1 Jul 25 08:59] M105211111 stage 1 complete. 1252604 transforms. Time: 1941.373 sec. [Worker #1 Jul 25 08:59] Starting stage 1 GCD - please be patient. [Worker #1 Jul 25 08:59] Stage 1 GCD complete. Time: 47.141 sec. [Worker #1 Jul 25 08:59] Available memory is 262144MB. [Worker #1 Jul 25 08:59] D: 2310, relative primes: 4743, stage 2 primes: 1313481, pair%=94.34 [Worker #1 Jul 25 08:59] Using 207828MB of memory. Quote:
Last fiddled with by drkirkby on 2021-07-29 at 21:57 |
|||
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Threadripper 3990X vs. Threadripper 3970X | Viliam Furik | Hardware | 21 | 2020-04-19 08:24 |
| Updated Bios on B350M Board for Ryzen Series 3000! | megav13 | Hardware | 1 | 2019-09-03 15:10 |
| Has anyone tried linear algebra on a Threadripper yet? | fivemack | Hardware | 3 | 2017-10-03 03:11 |
| 5000 < k < 6000 | justinsane | Riesel Prime Data Collecting (k*2^n-1) | 26 | 2010-12-31 12:27 |
| 6000 < k < 7000 | amphoria | Riesel Prime Data Collecting (k*2^n-1) | 0 | 2009-04-12 16:58 |