![]() |
![]() |
#12 |
Sep 2002
Database er0rr
353110 Posts |
![]()
Jean Penne is working on implementing it into LLR
![]() Last fiddled with by paulunderwood on 2019-07-06 at 19:21 |
![]() |
![]() |
![]() |
#13 | |
Apr 2019
Seattle / Spokane
416 Posts |
![]() Quote:
One of the CPUs I run Prime95 on is a i7-8750H (Code name is referred to as: "Products formally Coffee Lake"), 6 cores/12 threads typically running at about 2.9ghz--but can be pushed up to 4ghz if in a cold enough space--with DDR4-2666. It typically runs at ~6.5 ms/iter. Another CPU I run Prime95 on is a Ryzen 7 2700 (not a 2700X), 8 cores/16 threads that I've overclocked to 4.0ghz, with DDR4 rated at 3000 but running at 2933 (due to mobo/Ryzen compatibility issues). With this setup I've managed to get Prime95 to run as low as ~5.8 ms/iter. but there has been some fluctuation and sometimes it'll get back up to ~6.1 ms/iter. Both are running PRP tests. The current performance of the R7 is undoubtedly due to me tweaking the memory timings using this incredible DRAM calculator for Ryzen, but even before that it was only a little bit slower than the i7--doing around 6.4-6.8 ms/iter. I was honestly shocked to see such an improvement to the Prime95 performance on the R7 after changing the memory timings. Granted, it's not totally stable, but with PRP I have yet to have an amount of errors that reduce the confidence from "excellent"; I wonder if such fine tuning with the i7's memory is possible, but the impression I got was that it isn't. So to put Zen+ at the bottom of the ladder in terms of performance seems incorrect, from my experience. Last fiddled with by HairyCaul on 2019-07-07 at 16:50 |
|
![]() |
![]() |
![]() |
#14 | |
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
16A216 Posts |
![]() Quote:
Having said that I would have thought that Zen 1 would have been better than that. The difference between Skylake-X and Skylake is also too big. Yes, the vector units are twice as big but I don't think anyone has gotten anywhere near 2x the performance. Can anyone who has had more access than me to these systems comment? |
|
![]() |
![]() |
![]() |
#15 | |
Feb 2016
UK
3×7×19 Posts |
![]()
I'm not familiar with current Prime95 work but based on what I heard it does sound like that work is mostly ram bandwidth limited. The values I posted were mostly obtained through testing, although I didn't really do that much testing on Ryzen once I saw how low its performance was. A fast Intel quad of the time would easily outperform 8 core Zen(+).
I did short testing with my 7800X previously once LLR was updated to use AVX-512. Running small tasks one per core I could see close to the expected scaling. Excerpt from a post I made on another forum: Quote:
I never saw great thread scaling with that CPU even with FMA3 for some reason. I'll blame the cache like everyone else but can't prove it without access to higher core count equivalents. Oh, on ram scaling on Intel, I do have to take back a comment I made in the past. I had previously observed that primary timings made little to no difference. Since then I've started tinkering with more sub-timings and aggressively setting those can provide more of a benefit. I'm not sure what timing(s) affect it though and stability testing ram isn't my idea of fun. |
|
![]() |
![]() |
![]() |
#16 |
"W. Byerly"
Aug 2013
1423*2^2179023-1
1438 Posts |
![]()
The threadripper 3960x has 128MB of L3 cache and 24 cores= 5.1MB L3/core. I've heard that a rough estimate of how much an LLR test takes in memory is anywhere from 6 to 12* fftlen. So I should be able to do up to around roughly 500k fft lengths with 1 core 1 worker workloads before it hits the ram? Is my math correct?
also, in general is it better to increase cores per worker so the entire workload fits in L3 cache rather than letting it hit ram? threadrippers are quad channel but it seems with so many cores quad channel will still have throughput issues. For my purposes I'm doing LLR work on smaller exponents 1.5m-5m rather than gimps LL. Last fiddled with by Trilo on 2020-11-03 at 14:42 |
![]() |
![]() |
![]() |
#17 | |
Jun 2003
2·32·269 Posts |
![]()
8 * fftlen + overhead.
Quote:
So select the maximum number of workers such that the total FFT will fit comfortably within L3. Now, for the threadripper, we have another complication - the 128 MB is plit into 8 x 16 MB L3. So you would want each worker (or group of workers) to fit within 16 MB for maximum efficiency. You didn't say what FFT size(s) you're encountering, but in general for these small FFTs, you can do 8 workers with 3 cores each or 24 workers x 1 core each for maximum efficiency. Try out both and see. EDIT:- If 5m bits is the largest size, you should be able to easily hold 3 such FFTs in 16 MB, so I'd bet 24 worker x 1 core will be your best setting Last fiddled with by axn on 2020-11-03 at 15:07 |
|
![]() |
![]() |
![]() |
#18 |
"W. Byerly"
Aug 2013
1423*2^2179023-1
9910 Posts |
![]()
Thank you, this is what I originally figured. So I'm correct in assuming the discussion about ram is only relevant because the fft is too large to fit into L3 cache for GIMPS?
If so then it seems for my purposes this would open up a lot more processors to buy without running into memory constraints such as AMD's Ryzen series which is only dual channel. AMD seems to be the better choice as the processors have more L3 cache/core than similarly priced intel processors. I'm curious to see some benchmarks of the new Zen3s coming out in a few days. Another question, I have a i5-9600k running small 120k ffts, and I was considering overclocking it. In order to overclock it I'll have to buy a better fan as it is near thermal throttling with my current fan. Since it seems that the ffts should fit in L3 cache, would I see a speedup if I overclock my build? I don't want to spend money on a new fan if I wouldn't see any benefit. Last fiddled with by Trilo on 2020-11-03 at 17:44 |
![]() |
![]() |
![]() |
#19 | |
"Curtis"
Feb 2005
Riverside, CA
3×29×53 Posts |
![]() Quote:
If slowing RAM down a bit (say, 10%) slows your iteration times, it's unlikely that a CPU overclock would get you much. |
|
![]() |
![]() |
![]() |
#20 |
Dec 2020
2·3 Posts |
![]()
I'm glad I ran across this thread as I was getting confused.
I recently rebuilt with a Ryzen 3900X and the stock cooler, and things were running hotter than desired. That is, hot enough for the fan to be running noisily enough. My ms/iter were ~3.2 for all 12 cores going on a single worker, M110541719. I have no idea what fft size other than the default? "PRP test of M110541719 using FMA3 FFT length 6M, Pass1=1536, Pass2=4k, clm=1, 12 threads". I enabled Eco Mode, thinking I'd just run at a lower power draw and let things run slower, but I'm still ~3.2 ms/iter there. CPU package power went down to 87 watts from 142, CPU package temps dropped from mid 80s to high 60s, CPU clock speed went from 3.95GHz to 3.45GHz, but my total Prime95 throughput is unchanged. Might as well run in eco mode to save electricity and avoid thermal issues, if I'm not seeing any benefit at stock. So, two questions: 1) How do I calculate L3 cache (64MB), # of workers (up to 12 cores), and FFT size (where do I see/set this?), to find out whether everything will fit in cache? 2) If my ms/iter isn't any different when running at reduced power, what exactly is Prime95 doing to burn up the cpu when running at stock? Or is that just "it'll take whatever the system gives it, and if the system is bottlenecked elsewhere, that's not Prime95's fault; there's a reason this is used to stress test"? |
![]() |
![]() |
![]() |
#21 | |
Jun 2003
484210 Posts |
![]() Quote:
EDIT: You should be able to change FCLK independently of MCLK. It is just that normally, keeping them 1:1 is the best Last fiddled with by axn on 2020-12-21 at 05:03 |
|
![]() |
![]() |
![]() |
#22 |
Dec 2020
616 Posts |
![]()
DDR4 PC4-25600 3200MHz.
Is there a way I can set a smaller FFT on an assignment I get? |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Is there any sensible auxiliary task for HT logical cores when physical cores already used for PRP? | hansl | Information & Answers | 5 | 2019-06-17 14:07 |
LL speed vs cores | danmur | Hardware | 28 | 2018-05-06 06:09 |
laptop reporting wrong clock speed to PrimeNet | ixfd64 | Hardware | 1 | 2008-10-19 03:20 |
Mprime is faster on lower CPU clock speed | drewster1829 | Hardware | 6 | 2008-07-17 13:43 |
Adding RAM with different clock speed(bad idea?) | jasong | Hardware | 8 | 2006-10-25 10:05 |