mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   AMD Zen speculations (https://www.mersenneforum.org/showthread.php?t=20992)

Prime95 2017-03-10 06:14

There are some odd timings for 8 CPUs, 5 and 6 workers. It looks like the last worker is only getting half a CPU.

Can you add AffinityVerbosityBench=1 to prime.txt and rerun 4M throughput bench, all CPUs, with hyperthreading, all worker/core combos. The output is likely to be a huge mess. You can email it to me rather than posting it here.

Mysticial 2017-03-10 06:20

1 Attachment(s)
[QUOTE=Prime95;454586]There are some odd timings for 8 CPUs, 5 and 6 workers. It looks like the last worker is only getting half a CPU.

Can you add AffinityVerbosityBench=1 to prime.txt and rerun 4M throughput bench, all CPUs, with hyperthreading, all worker/core combos. The output is likely to be a huge mess. You can email it to me rather than posting it here.[/QUOTE]

There were some errors in the output.

Prime95 2017-03-10 09:44

[QUOTE=Mysticial;454587]There were some errors in the output.[/QUOTE]

I have a fix for that. The bug only affected the oddball throughput benchmarks (#cores not a multiple of #workers).

Prime95 2017-03-10 09:47

Are there any BIOS options to alter prefetching?

I'm wondering if Ryzen's hardware prefetcher is being too active, wasting memory bandwidth prefetching data that won't be needed.

mackerel 2017-03-10 12:33

I only had a quick run of the latest version before leaving for work this morning. FMA3 was a bit faster than K10 overall, but I want to rerun both more times before making any conclusions. In particular, I note when it was using K10, performance seem to fall off at 2MB/core. With FMA3 it holds out a little longer to 2.5MB/core, suggesting it is making better use of the combined L2+L3. Non ram limited IPC for FMA3 was in the ball park of half that of Skylake.

The memory bandwidth comment previous did make me notice, in my earlier testing the limited performance was far below equivalent Intel configuration. Based on IPC and cores alone, the total CPU potential should be comparable after allowing for clocks, so is the ram limiting?

Side comment: In previous testing on a high core count Xeon, I found at high number of threads for a single worker, it didn't seem able to saturate each core. I saw this again on Ryzen. Is there some overhead when running like this? I didn't observe this effect at 4 threads, but haven't investigated to see how it scales.

I'll have a look for any prefetcher settings, but on the Asus X370-Pro mobo I'm currently running I don't see anything for HPET or SMT either so I wouldn't be optimistic.

I don't think we have a complete understanding of what goes on inside the CPU with regard to how the CCX and ram are really connected, and there has been talk elsewhere it may be better to consider it a NUMA system... They will surely have to clarify this if they hope to sell the server versions.

Mark Rose 2017-03-10 13:23

Here is article discussing the frequency domains inside the chip: [url]https://thetechaltar.com/amd-ryzen-clock-domains-detailed/[/url]

Mark Rose 2017-03-10 13:35

[QUOTE=Mysticial;454585]Most likely dual-rank. My memory configuration is completely maxed out: 4 x 16GB.[/QUOTE]

You're certainly running in dual channel mode, but the rank of the RAM sticks are either single or dual rank. We've found it makes about a 10% difference or so in memory bandwidth performance for Prime95. Do you know the model number of your RAM?

mackerel 2017-03-10 13:52

If you're running 4 modules, you still have at least two rank per channel. In my testing, performance is the same regardless if those 2 ranks are in two single rank modules per channel, or one dual rank module per channel (all else being equal).

Mysticial 2017-03-10 14:47

[QUOTE=Mark Rose;454607]You're certainly running in dual channel mode, but the rank of the RAM sticks are either single or dual rank. We've found it makes about a 10% difference or so in memory bandwidth performance for Prime95. Do you know the model number of your RAM?[/QUOTE]

It's two sets of these: [url]https://www.newegg.com/Product/Product.aspx?Item=N82E16820232299[/url]

I don't know if they actually are dual-rank. But I usually assume that max density ram is dual-rank.

Running these at 2133 MHz doesn't do much justice. But I'm not gonna bother trying to overclock them until my new motherboard arrives.

VictordeHolland 2017-03-10 16:08

[QUOTE=Mark Rose;454605]Here is article discussing the frequency domains inside the chip: [URL]https://thetechaltar.com/amd-ryzen-clock-domains-detailed/[/URL][/QUOTE]
:goodposting:

So basically Ryzen is 2 NUMA nodes (of 4 cores each) sharing a Dual-Channel memory controller and inbetween-NUMA transfers are done on the same fabric (on the memory frequency domain). Could lead to extra bandwidth congestion if data has to be fetched from the wrong L3 NUMA node.

Mark Rose 2017-03-10 16:19

[QUOTE=mackerel;454610]If you're running 4 modules, you still have at least two rank per channel. In my testing, performance is the same regardless if those 2 ranks are in two single rank modules per channel, or one dual rank module per channel (all else being equal).[/QUOTE]

Ahh, thanks for that insight :)


All times are UTC. The time now is 19:57.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.