mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-07-23, 09:47   #100
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

1110000002 Posts
Default

Quote:
Originally Posted by M344587487 View Post
I don't know if it's intentional, probably not, but it is possible to have a higher binned chip. It's possible that the two CCX's on the same chip clock differently. These chips already self-boost to get the most out of themselves out of the box (acting much more like GPUs do than what we expect of CPUs), but one of the recommended ways I've seen to try and overclock beyond that is to do it on a CCX basis. This video shows a manually overclocked 3900X with a 150 MHz disparity between the highest and lowest CCX (4490, 4441, 4341, 4341, the chiplets do clock quite differently but this is a sample size of 1):
The differently binned CCDs on two CCD CPUs I saw from an untrusted (to me) source. I don't have or intend to get a two CCD model any time soon, but I am dropping hints on an overclocking forum for those there to hopefully try it out.

I think we can safely say they are binning. Recent info from Silicon Lottery suggests that 3800X on average gets you about 100 MHz more than a 3700X in similar conditions. On two CCD models, the highest boosts are only needed at lower active core counts so they can stretch the best CCDs a bit further. I guess the only significant possible counter-argument against different CCD binning would be that they may still need "better" CCDs running all core to help keep within power/current budget.

I haven't looked at differing performance on CCXs yet. I do note using Ryzen Master software it indicates for a CCD, which is the "fastest" core, the 2nd fastest core on same CCX, and also 1st and 2nd fastest cores on the other CCX.

Yesterday I started doing some experiments on a 3700X, running at 4+0 (one CCX, half effective L3 cache, better core to core latency), and running 2+2 (all L3 available, higher latency crossing CCX). Not done many workloads yet, no significant difference (<1%) in Cinebench R15, R20, Geekbench 4 multicore. 3DMark11 Physics showed around 8% advantage for 4+0. If this sounds like a random bunch of benchmarks, it's for a bit of fun elsewhere.
mackerel is offline   Reply With Quote
Old 2019-07-29, 14:27   #101
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2·52·19 Posts
Default

In addition to the current known variants it's likely that 3500, 3700 and 3900 are coming: https://www.tomshardware.com/news/ry...ies,40040.html

The 3500 is likely OEM-only. The 3700 is an interesting alternative to the 3600 at the right price (unlikely to arrive at that price unless it comes much later). The 3900 likely uses lower binned chiplets like the 3600 which means it should be easier to secure enough supply for. By the time the 3900 comes to market the 3900X supply may have been solved but who knows.
M344587487 is offline   Reply With Quote
Old 2019-08-06, 07:44   #102
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

26×7 Posts
Default

I'm doing the challenge at PrimeGrid at the moment with the two Zen 2 systems... or would be if one hadn't produced some bad results. It's the 3600 running stock, with 2666 ram. Running two tasks of 3 cores each as I found that optimal for productivity.

I have a suspicion that the CPU might not be stable at higher temps, even before it throttles. Currently I have a Wraith Prism on it, which is the cooler bundled with the 3700X. Peak temps hit 90C... when I had a Noctua D9L on it before, it was only hitting 80C. So, bad trade off maybe. RGB+noise vs cooler temps. Guess I have to put the Noctua back on it some time.
mackerel is offline   Reply With Quote
Old 2019-08-20, 12:48   #103
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

317 Posts
Default

Okay, I just had to do some quick benchmarks of my own on the Ryzen 5 3600. No overclocking options used in the BIOS so I assume it's running at base clock (3.6 GHz) while doing these tests. It's a bit hard to tell because the clock values I get from Linux are all over the place. One second it's 2.1 GHz, the next it's 4.1 and so on, while running benchmarks in mprime. But anyway it was a quick qualitative test to see how the L3 cache and memory bandwidth copes with different situations, so maybe that doesn't matter so much.

So I ran all mprime FFT sizes from 2048K to 8192K, varied the number of cores used from 2 to 6, and always kept the number of workers at 1. As a baseline comparison, the lowest graph curve is the now retired Ryzen 3 2200G, four cores, one worker. Speeds normalized by multiplying FFT length (in K) with throughput (iters/sec), then dividing that value by the slowest such result, which happened to be 8064K FFT on the 2200G processor.

It seems that four cores is enough to saturate the RAM bandwidth, once the FFT size gets big enough (around 6M). Around the current first test wavefront (5120M fft) five cores seems to be enough, there's not much improvement from having six cores running. But in my opinion, it makes a lot more sense to run tests at the double checking wavefront. The DC exponents I'm getting use 2688K FFT and fit well inside the L3 cache. It would be really interesting to see how fast the eight-core 3700X/3800X runs there

And, of course, it would be even more interesting to see the 3900X performance, if someone ever manages to get one... Double the L3 cache should mean that even first test FFTs fit in the cache and the speedup should be very noticeable. Note that even on the 3600, the L3 cache is divided in two, but this doesn't seem to matter that much, only the total amount. Bigger is better, folks.
Attached Thumbnails
Click image for larger version

Name:	ryzen3600-fft-bench.png
Views:	200
Size:	65.2 KB
ID:	20926  
nomead is offline   Reply With Quote
Old 2019-08-20, 17:27   #104
aurashift
 
Jan 2015

2·127 Posts
Default

Quote:
Originally Posted by nomead View Post
It seems that four cores is enough to saturate the RAM bandwidth, once the FFT size gets big enough (around 6M). Around the current first test wavefront (5120M fft) five cores seems to be enough, there's not much improvement from having six cores running. But in my opinion, it makes a lot more sense to run tests at the double checking wavefront. The DC exponents I'm getting use 2688K FFT and fit well inside the L3 cache. It would be really interesting to see how fast the eight-core 3700X/3800X runs there

And, of course, it would be even more interesting to see the 3900X performance, if someone ever manages to get one... Double the L3 cache should mean that even first test FFTs fit in the cache and the speedup should be very noticeable. Note that even on the 3600, the L3 cache is divided in two, but this doesn't seem to matter that much, only the total amount. Bigger is better, folks.

Ryzen is still dual channel RAM right?



Just guessing, maybe we'll see linear acceleration up to 16 workers at least with Rome's 8 channels since it sort of has intrasocket UMA, depending on how long it takes to saturate those PCIe4.0 IF links.
aurashift is offline   Reply With Quote
Old 2019-08-20, 18:36   #105
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

317 Posts
Default

Quote:
Originally Posted by aurashift View Post
Ryzen is still dual channel RAM right?
Yeah it's dual channel, but the point was that if most of the stuff fits in L3 cache, maybe the memory subsystem matters less than usual.
nomead is offline   Reply With Quote
Old 2019-08-20, 19:14   #106
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

61608 Posts
Default

What speed are you running the memory at?
Mark Rose is offline   Reply With Quote
Old 2019-08-20, 19:21   #107
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

317 Posts
Default

2x 8GB, 3600 MHz CL18.
nomead is offline   Reply With Quote
Old 2019-08-21, 08:03   #108
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

26×7 Posts
Default

Quote:
Originally Posted by nomead View Post
Note that even on the 3600, the L3 cache is divided in two, but this doesn't seem to matter that much, only the total amount. Bigger is better, folks.
The L3 cache is split per-CCX. If data isn't on the local CCX, you're going back to ram to get it. Also the interconnect from each CCD only has half bandwidth writes compared to read. I don't know if that information might be useful in some way...

As it stands, for smaller tasks than around here, I found one worker per CCX to give optimal throughput, but those do fit in their split of the cache. For bigger FFTs exceeding a CCX, the performance isn't bad using the whole CCD, but I think would have been even higher had the L3 cache been unified.

While running 3600 ram is probably optimal, for those with slower ram I wonder if there may be some benefit to running a higher IF clock than ram clock. The tradeoff is increased ram latency from breaking the synchronous nature, but you recover some of that write bandwidth..
mackerel is offline   Reply With Quote
Old 2019-08-21, 09:01   #109
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×52×19 Posts
Default

Quote:
Originally Posted by aurashift View Post
Ryzen is still dual channel RAM right?



Just guessing, maybe we'll see linear acceleration up to 16 workers at least with Rome's 8 channels since it sort of has intrasocket UMA, depending on how long it takes to saturate those PCIe4.0 IF links.
A worker per 64MB of cache (two chiplets) may yield more throughput than a worker per 32MB of cache if it means RAM bandwidth isn't saturated in the former but is for the latter (which should be the case for wavefront tests). We've confirmed that splitting a wavefront worker across two CCX's is the way to go for single chiplets SKUs (which have unconnected L3 caches, on the same die but that should be irrelevant as they communicate the same as they would if on different dies). If it is confirmed that a worker per 64MB is the way to go (these 3900X shortages are ridiculous), then scaling to 4 6 and 8 chiplets should be linear all other things equal (more parallel workers). Best case scenario RAM bandwidth is so severely underutilised that not fully populating the memory channels and/or using cheaper slower memory (while decoupling IF) barely affects throughput but saves a chunk of change on hardware.

An interesting thing to note is that IF speed (FCLK) can be decoupled from RAM speed, it's not just a case of being tied to a multiple of RAM speed. There is a latency penalty (at least) to doing so, but if it means IF can be set to 1900 (instead of the typical 1600 to 1800 range) it may be a worthwhile speedup when not bound by RAM. (This video shows some tuning/metrics that may not be directly relevant but show the general concept: https://youtu.be/10pYf9wqFFY?t=535 ).
M344587487 is offline   Reply With Quote
Old 2019-09-01, 21:29   #110
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

4758 Posts
Default

All right, one 3900X owner ran the mprime benchmarks for me, same methodology in plotting as before. The 3900X is thus over two times faster than the 3600 within a certain range of FFT sizes, from 5120K to 7680K. 2.45 ms/iter at the wavefront, 5120K... Certainly the effect of the larger L3 cache can be seen. He has 3000 MHz memory in the system, no idea about latency. No system tuning done, so I assume that fclk is just 1500 MHz.
Attached Thumbnails
Click image for larger version

Name:	ryzen3900x-3600-fft-bench.png
Views:	330
Size:	81.4 KB
ID:	20980  
nomead is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
RX470 and RX460 announced VictordeHolland GPU Computing 0 2016-07-30 13:05
Intel Xeon D announced VictordeHolland Hardware 7 2015-03-11 23:26
Factoring details mturpin Information & Answers 4 2013-02-08 02:43
Euler (6,2,5) details. Death Math 10 2011-08-03 13:49
Larrabee instruction set announced fivemack Hardware 0 2009-03-25 12:09

All times are UTC. The time now is 16:39.


Fri Jul 7 16:39:06 UTC 2023 up 323 days, 14:07, 1 user, load averages: 4.46, 3.08, 2.39

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔