mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-07-06, 19:20   #12
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

3,673 Posts
Default

Quote:
Originally Posted by R. Gerbicz View Post
Or even better use my error check with prp testing.
Jean Penne is working on implementing it into LLR

Last fiddled with by paulunderwood on 2019-07-06 at 19:21
paulunderwood is offline   Reply With Quote
Old 2019-07-07, 16:39   #13
HairyCaul
 
Apr 2019
Seattle / Spokane

1002 Posts
Default

Quote:
Originally Posted by mackerel View Post
AMD Ryzen 1000 and 2000 series CPUs only have half the FP performance (per core, per clock) of most Intel CPUs, so you're working at a big disadvantage there...

Ball park peak per core per clock performance relative to recent Intel CPUs, where not limited by ram bandwidth or other factors. Reality will likely be different but this gives a good indication.
200% Skylake-X, some expensive Xeons (with 2 unit AVX-512)
100% Skylake, Kaby Lake, Coffee Lake
100% (estimate) Zen 2 (Ryzen 3000 series, excluding APUs)
88% Haswell
82% Broadwell
58% Sandy Bridge
50%-ish Zen 1, Zen+ (Ryzen 1000, 2000)
Assume any Intel CPU older than Sandy Bridge is half that. I never tested in detail...
I may be in the wrong here, but I think the Zen+ chips are being underrated here.

One of the CPUs I run Prime95 on is a i7-8750H (Code name is referred to as: "Products formally Coffee Lake"), 6 cores/12 threads typically running at about 2.9ghz--but can be pushed up to 4ghz if in a cold enough space--with DDR4-2666. It typically runs at ~6.5 ms/iter.

Another CPU I run Prime95 on is a Ryzen 7 2700 (not a 2700X), 8 cores/16 threads that I've overclocked to 4.0ghz, with DDR4 rated at 3000 but running at 2933 (due to mobo/Ryzen compatibility issues). With this setup I've managed to get Prime95 to run as low as ~5.8 ms/iter. but there has been some fluctuation and sometimes it'll get back up to ~6.1 ms/iter.

Both are running PRP tests.

The current performance of the R7 is undoubtedly due to me tweaking the memory timings using this incredible DRAM calculator for Ryzen, but even before that it was only a little bit slower than the i7--doing around 6.4-6.8 ms/iter. I was honestly shocked to see such an improvement to the Prime95 performance on the R7 after changing the memory timings. Granted, it's not totally stable, but with PRP I have yet to have an amount of errors that reduce the confidence from "excellent"; I wonder if such fine tuning with the i7's memory is possible, but the impression I got was that it isn't.

So to put Zen+ at the bottom of the ladder in terms of performance seems incorrect, from my experience.

Last fiddled with by HairyCaul on 2019-07-07 at 16:50
HairyCaul is offline   Reply With Quote
Old 2019-07-07, 17:13   #14
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2×5×587 Posts
Default

Quote:
Originally Posted by HairyCaul View Post
I may be in the wrong here, but I think the Zen+ chips are being underrated here.

One of the CPUs I run Prime95 on is a i7-8750H (Code name is referred to as: "Products formally Coffee Lake"), 6 cores/12 threads typically running at about 2.9ghz--but can be pushed up to 4ghz if in a cold enough space--with DDR4-2666. It typically runs at ~6.5 ms/iter.

Another CPU I run Prime95 on is a Ryzen 7 2700 (not a 2700X), 8 cores/16 threads that I've overclocked to 4.0ghz, with DDR4 rated at 3000 but running at 2933 (due to mobo/Ryzen compatibility issues). With this setup I've managed to get Prime95 to run as low as ~5.8 ms/iter. but there has been some fluctuation and sometimes it'll get back up to ~6.1 ms/iter.

Both are running PRP tests.

The current performance of the R7 is undoubtedly due to me tweaking the memory timings using this incredible DRAM calculator for Ryzen, but even before that it was only a little bit slower than the i7--doing around 6.4-6.8 ms/iter. I was honestly shocked to see such an improvement to the Prime95 performance on the R7 after changing the memory timings. Granted, it's not totally stable, but with PRP I have yet to have an amount of errors that reduce the confidence from "excellent"; I wonder if such fine tuning with the i7's memory is possible, but the impression I got was that it isn't.

So to put Zen+ at the bottom of the ladder in terms of performance seems incorrect, from my experience.
I think that is because you are maxing memory bandwidth on both systems. The numbers you referenced are supposed to be when not at all memory limited.
Having said that I would have thought that Zen 1 would have been better than that. The difference between Skylake-X and Skylake is also too big. Yes, the vector units are twice as big but I don't think anyone has gotten anywhere near 2x the performance.
Can anyone who has had more access than me to these systems comment?
henryzz is offline   Reply With Quote
Old 2019-07-07, 19:02   #15
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

419 Posts
Default

I'm not familiar with current Prime95 work but based on what I heard it does sound like that work is mostly ram bandwidth limited. The values I posted were mostly obtained through testing, although I didn't really do that much testing on Ryzen once I saw how low its performance was. A fast Intel quad of the time would easily outperform 8 core Zen(+).

I did short testing with my 7800X previously once LLR was updated to use AVX-512. Running small tasks one per core I could see close to the expected scaling.

Excerpt from a post I made on another forum:

Quote:
TRP task using: -d -q"38473*2^9349715-1" -t6

Starting Lucas Lehmer Riesel prime test of 38473*2^9349715-1
Using AVX-512 FFT length 800K, Pass1=128, Pass2=6400, clm=2, 6 threads
V1 = 4 ; Computing U0...done.
38473*2^9349715-1, iteration : 60000 / 9349715 [0.64%]. Time per iteration : 0.349 ms.


Resuming LLR test of 38473*2^9349715-1 at iteration 68794 [0.73%]
38473*2^9349715-1, iteration : 110000 / 9349715 [1.17%]. Time per iteration : 0.577 ms.

This is better, 65% speedup on a TRP task but I would hope for more. In theory this should be small enough to fit in cache, maybe too small to scale with threads? Unfortunately the 7800X has a measly 8.25MB L3 cache and I'm not convinced the large non-inclusive L2 offsets that for this code.


PPSE using: -d -q"3829*2^1534830+1"

Starting Proth prime test of 3829*2^1534830+1
Using all-complex FMA3 FFT length 120K, Pass1=384, Pass2=320, a = 3
3829*2^1534830+1, bit: 40000 / 1534841 [2.60%]. Time per bit: 0.377 ms.

Resuming Proth prime test of 3829*2^1534830+1 at bit 44002 [2.86%]
Using all-complex AVX-512 FFT length 120K, Pass1=192, Pass2=640, clm=1, a = 3
3829*2^1534830+1, bit: 90000 / 1534841 [5.86%]. Time per bit: 0.192 ms.

96% speedup on a PPSE. This is more like it!
Why a short test? I observed the hottest core hitting 107C (with high end air cooler) and decided that was excessive! It was on my to do list to underclock/undervolt it to try and tame it, but then summer hit and all crunching was halted anyway.

I never saw great thread scaling with that CPU even with FMA3 for some reason. I'll blame the cache like everyone else but can't prove it without access to higher core count equivalents.

Oh, on ram scaling on Intel, I do have to take back a comment I made in the past. I had previously observed that primary timings made little to no difference. Since then I've started tinkering with more sub-timings and aggressively setting those can provide more of a benefit. I'm not sure what timing(s) affect it though and stability testing ram isn't my idea of fun.
mackerel is offline   Reply With Quote
Old 2020-11-03, 14:13   #16
Trilo
 
Trilo's Avatar
 
"W. Byerly"
Aug 2013
1423*2^2179023-1

103 Posts
Default

The threadripper 3960x has 128MB of L3 cache and 24 cores= 5.1MB L3/core. I've heard that a rough estimate of how much an LLR test takes in memory is anywhere from 6 to 12* fftlen. So I should be able to do up to around roughly 500k fft lengths with 1 core 1 worker workloads before it hits the ram? Is my math correct?

also, in general is it better to increase cores per worker so the entire workload fits in L3 cache rather than letting it hit ram? threadrippers are quad channel but it seems with so many cores quad channel will still have throughput issues.

For my purposes I'm doing LLR work on smaller exponents 1.5m-5m rather than gimps LL.

Last fiddled with by Trilo on 2020-11-03 at 14:42
Trilo is offline   Reply With Quote
Old 2020-11-03, 14:52   #17
axn
 
axn's Avatar
 
Jun 2003

25×5×31 Posts
Default

Quote:
Originally Posted by Trilo View Post
6 to 12* fftlen.
8 * fftlen + overhead.

Quote:
Originally Posted by Trilo View Post
also, in general is it better to increase cores per worker so the entire workload fits in L3 cache? threadrippers are quad channel but it seems with so many cores quad channel will still have throughput issues.
There are two competing forces. More cores per worker means more multithreading overhead/loss of efficiency. OTOH, fewer workers means more of the FFT stays in L3 cache increasing efficiency.

So select the maximum number of workers such that the total FFT will fit comfortably within L3.

Now, for the threadripper, we have another complication - the 128 MB is plit into 8 x 16 MB L3. So you would want each worker (or group of workers) to fit within 16 MB for maximum efficiency.

You didn't say what FFT size(s) you're encountering, but in general for these small FFTs, you can do 8 workers with 3 cores each or 24 workers x 1 core each for maximum efficiency. Try out both and see.

EDIT:- If 5m bits is the largest size, you should be able to easily hold 3 such FFTs in 16 MB, so I'd bet 24 worker x 1 core will be your best setting

Last fiddled with by axn on 2020-11-03 at 15:07
axn is online now   Reply With Quote
Old 2020-11-03, 17:43   #18
Trilo
 
Trilo's Avatar
 
"W. Byerly"
Aug 2013
1423*2^2179023-1

103 Posts
Default

Thank you, this is what I originally figured. So I'm correct in assuming the discussion about ram is only relevant because the fft is too large to fit into L3 cache for GIMPS?

If so then it seems for my purposes this would open up a lot more processors to buy without running into memory constraints such as AMD's Ryzen series which is only dual channel. AMD seems to be the better choice as the processors have more L3 cache/core than similarly priced intel processors.

I'm curious to see some benchmarks of the new Zen3s coming out in a few days.

Another question,

I have a i5-9600k running small 120k ffts, and I was considering overclocking it. In order to overclock it I'll have to buy a better fan as it is near thermal throttling with my current fan. Since it seems that the ffts should fit in L3 cache, would I see a speedup if I overclock my build? I don't want to spend money on a new fan if I wouldn't see any benefit.

Last fiddled with by Trilo on 2020-11-03 at 17:44
Trilo is offline   Reply With Quote
Old 2020-11-03, 18:29   #19
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

17×281 Posts
Default

Quote:
Originally Posted by Trilo View Post
I have a i5-9600k running small 120k ffts, and I was considering overclocking it. In order to overclock it I'll have to buy a better fan as it is near thermal throttling with my current fan. Since it seems that the ffts should fit in L3 cache, would I see a speedup if I overclock my build? I don't want to spend money on a new fan if I wouldn't see any benefit.
Logic points to "yes". You can get a good test of this by decreasing the speed of your RAM without overclocking- if LLR speeds remain the same, then you have solid evidence that you're not saturating the memory bandwidth and therefore a CPU overclock would indeed run your tests faster.

If slowing RAM down a bit (say, 10%) slows your iteration times, it's unlikely that a CPU overclock would get you much.
VBCurtis is offline   Reply With Quote
Old 2020-12-21, 04:39   #20
nitf
 
Dec 2020

2×3 Posts
Default

I'm glad I ran across this thread as I was getting confused.

I recently rebuilt with a Ryzen 3900X and the stock cooler, and things were running hotter than desired. That is, hot enough for the fan to be running noisily enough. My ms/iter were ~3.2 for all 12 cores going on a single worker, M110541719. I have no idea what fft size other than the default? "PRP test of M110541719 using FMA3 FFT length 6M, Pass1=1536, Pass2=4k, clm=1, 12 threads".

I enabled Eco Mode, thinking I'd just run at a lower power draw and let things run slower, but I'm still ~3.2 ms/iter there. CPU package power went down to 87 watts from 142, CPU package temps dropped from mid 80s to high 60s, CPU clock speed went from 3.95GHz to 3.45GHz, but my total Prime95 throughput is unchanged. Might as well run in eco mode to save electricity and avoid thermal issues, if I'm not seeing any benefit at stock.

So, two questions:
1) How do I calculate L3 cache (64MB), # of workers (up to 12 cores), and FFT size (where do I see/set this?), to find out whether everything will fit in cache?
2) If my ms/iter isn't any different when running at reduced power, what exactly is Prime95 doing to burn up the cpu when running at stock? Or is that just "it'll take whatever the system gives it, and if the system is bottlenecked elsewhere, that's not Prime95's fault; there's a reason this is used to stress test"?
nitf is offline   Reply With Quote
Old 2020-12-21, 04:53   #21
axn
 
axn's Avatar
 
Jun 2003

25×5×31 Posts
Default

Quote:
Originally Posted by nitf View Post
So, two questions:
1) How do I calculate L3 cache (64MB), # of workers (up to 12 cores), and FFT size (where do I see/set this?), to find out whether everything will fit in cache?
2) If my ms/iter isn't any different when running at reduced power, what exactly is Prime95 doing to burn up the cpu when running at stock? Or is that just "it'll take whatever the system gives it, and if the system is bottlenecked elsewhere, that's not Prime95's fault; there's a reason this is used to stress test"?
A 6M FFT takes 48 MB. That should comfortably fit within 64MB L3. However that 64MB is split into 4 16MB cache, one each per CCX (or is it CCD?!). Hence you're bottlenecked by cross CCX bandwidth (i.e Infinity Fabric) which will be constant regardless of the CPU speed. If you increase the IF speed, you'll get more thruput. Normally IF speed (FCLK) runs 1:1 with memory clock. What is your RAM speed? The fastest FCLK supported is about 1900 MHz corresponding to DDR4 3800. So if you have DDR4 3800 (or can overclock RAM to 3800) and run FCLK 1:1 with MCLK, you'll get the most performance for this particular workload.

EDIT: You should be able to change FCLK independently of MCLK. It is just that normally, keeping them 1:1 is the best

Last fiddled with by axn on 2020-12-21 at 05:03
axn is online now   Reply With Quote
Old 2020-12-21, 14:26   #22
nitf
 
Dec 2020

1102 Posts
Default

DDR4 PC4-25600 3200MHz.

Is there a way I can set a smaller FFT on an assignment I get?
nitf is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there any sensible auxiliary task for HT logical cores when physical cores already used for PRP? hansl Information & Answers 5 2019-06-17 14:07
LL speed vs cores danmur Hardware 28 2018-05-06 06:09
laptop reporting wrong clock speed to PrimeNet ixfd64 Hardware 1 2008-10-19 03:20
Mprime is faster on lower CPU clock speed drewster1829 Hardware 6 2008-07-17 13:43
Adding RAM with different clock speed(bad idea?) jasong Hardware 8 2006-10-25 10:05

All times are UTC. The time now is 04:58.

Mon May 10 04:58:23 UTC 2021 up 31 days, 23:39, 0 users, load averages: 2.88, 2.71, 2.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.