mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-07-08, 13:50   #56
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

26×7 Posts
Default

Now I'm scared:

Quote:
Yes. All CCX<->CCX communication traverses the IOD, meaning all CCXes communicate at a common latency. Same for cache. Same for DRAM. From the perspective of the system, this is monolithic die behavior.

A few ns of wire latency notwithstanding. :)
Later reply clarified that CCX-CCX traffic on same die still does externally via IOD. This could have significant impact on using the L3 cache.
From AMD technical marketing manager https://twitter.com/Thracks/status/1147925341316505602
mackerel is offline   Reply With Quote
Old 2019-07-08, 20:29   #57
maxzor
 
Apr 2017

22×5 Posts
Default

Quote:
Originally Posted by mackerel View Post
Now I'm scared:/
https://www.reddit.com/r/Amd/comment..._data_latency/

Last fiddled with by maxzor on 2019-07-08 at 20:31
maxzor is offline   Reply With Quote
Old 2019-07-08, 22:38   #58
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

1110000002 Posts
Default

Latency isn't the concern, bandwidth is. Gonna try and get an early night, so assuming CPU arrives as scheduled tomorrow I can get out of work earlier and start testing for real.
mackerel is offline   Reply With Quote
Old 2019-07-09, 07:58   #59
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×52×19 Posts
Default

I've had a bit of a play too based on the above youtube video's AIDA64 benchmark of 46.4GB/s read, 25.5GB/s write per chiplet using 3200 DDR4, and the 69MB read 64MB write requirements for 4M. Scaling linearly to 3600 that puts a lower bound 4M iteration time of ~2.23 ms/it limited by the write speed (~1.32 ms/it by the read speed). This ( https://www.mersenneforum.org/showpo...&postcount=782 ) 9900K benchmark using 3600 DDR4 yields ~3.17 ms/it for 4M. Assuming perfect conditions/optimisation this means the nerfed write speed isn't a definite limiter, in practice it may be (especially when you consider that we'll be running one job per CCX so even if optimal writes were possible if only a single job used the pipe, syncing the writes from two jobs may be a step too far).

From a position of ignorance it's hard to guess how a doubling of cache from 8MB to 16MB will help. Judging by the description of 2 passes per iteration it'll at least help a little with cache misses between passes but is that all? This is probably the dumbest thing I've said all week but if L3 size is incidental and the write speed is a limiter, could keeping 8MB of data permanently in L3 cache help? If there's no proper way for the data to persist you'd do this by accessing the data just before it would get evicted. You'd then only need 53MB of reads, 48MB of writes from memory per iteration, lowering the lower bound imposed by the write speed from 2.23ms/it to 1.67ms/it per chiplet. This would be at the expense of having to refresh the 8MB of data repeatedly, a minimum of 7 times per iteration under perfect conditions.
M344587487 is offline   Reply With Quote
Old 2019-07-09, 11:56   #60
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

1110000002 Posts
Default

The 3600 CPU is in my possession now. Be another 6 hours or so before I get home and install it, and start benching.
mackerel is offline   Reply With Quote
Old 2019-07-09, 23:23   #61
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

26·7 Posts
Default

So much testing... so much more to do.

Summary test results at https://linustechtips.com/main/topic...umber-finding/

Attempted interpretation on how it relates to PrimeGrid work: http://www.primegrid.com/forum_threa...ap=true#130991

4096k FFT iter/s for 1, 2, 6 workers tested with 29.8b5:
8086k 187.59 180.16 170.94
3600 309.71 214.21 193.38

5120k FFT:
8086k 145.74 141.85 135.79
3600 193.86 155.41 149.88

8086k seems to run around 4.3 GHz not that it matters as it is ram bandwidth starved. I had dual channel 3000C15 in the tested system.

The 3600 system wouldn't boot with my 4000 rated ram, even after dropping it down to 3400. I gave up at that point and swapped in 3000C14 modules which were used for test. 3600 clock fluctuated with temperature but was in the region of 3.9 GHz.
mackerel is offline   Reply With Quote
Old 2019-07-10, 00:22   #62
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2·52·19 Posts
Default

Exciting, if I'm interpreting the data right this is nuts. Look at that 4M result, despite being 2x16MB a single worker that has to traverse IF is still way quicker than accessing memory. Is the 8086K definitely bandwidth starved? Naively scaling the 9900K result above from 3600 to 3000 gives an upper bound of ~263 it/s, it has 2 extra cores meaning 4MB more L3 cache but is it enough to explain the 70 it/s difference?
M344587487 is offline   Reply With Quote
Old 2019-07-10, 03:44   #63
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

317 Posts
Default

Oh crap. This could mean that the 3900X could be more awesome for LL work than expected, because of the larger L3 cache. Even when divided between two chiplets / four CCX. The "tail" that now extends to about 3-4M FFT size could then go up to 6-8M. Instead of being purely DRAM bandwidth limited as I was expecting. As the inter-core data latency doesn't really change after leaving that one CCX.
nomead is offline   Reply With Quote
Old 2019-07-10, 06:21   #64
maxzor
 
Apr 2017

1416 Posts
Default

Is there a tool for you to compare the average number of RAM calls vs cache calls?

Last fiddled with by maxzor on 2019-07-10 at 06:22
maxzor is offline   Reply With Quote
Old 2019-07-10, 06:51   #65
hansl
 
hansl's Avatar
 
Apr 2019

110011012 Posts
Default

Quote:
Originally Posted by maxzor View Post
Is there a tool for you to compare the average number of RAM calls vs cache calls?
On linux, you can use "perf" to check all kinds of cache information.
There's some broad "cache-references" and "cache-misses" events which can be reported, but also hundreds of other more fine-grained event types, "perf list" to browse all the possibilities.
hansl is offline   Reply With Quote
Old 2019-07-10, 08:08   #66
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

44810 Posts
Default

Quote:
Originally Posted by M344587487 View Post
Is the 8086K definitely bandwidth starved?
Yes, definitely. Dual channel 3000 ram isn't enough for even a fast quad core. It would take about double that to feed the 8086k to be close to unlimited.

Quote:
Originally Posted by maxzor View Post
Is there a tool for you to compare the average number of RAM calls vs cache calls?
I think I saw a commercial Intel tool that did that, but it is likely to be a case of if you have to ask, you can't afford it. Something for big businesses to pay big money for.


Edit: if anyone has specific test requests let me know and I will see what I can do. I'm hoping for an updated bios to allow the faster ram to be used and see how that helps (or not). Beyond that I need to move to more controlled tests.

Last fiddled with by mackerel on 2019-07-10 at 08:13
mackerel is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
RX470 and RX460 announced VictordeHolland GPU Computing 0 2016-07-30 13:05
Intel Xeon D announced VictordeHolland Hardware 7 2015-03-11 23:26
Factoring details mturpin Information & Answers 4 2013-02-08 02:43
Euler (6,2,5) details. Death Math 10 2011-08-03 13:49
Larrabee instruction set announced fivemack Hardware 0 2009-03-25 12:09

All times are UTC. The time now is 16:37.


Fri Jul 7 16:37:51 UTC 2023 up 323 days, 14:06, 1 user, load averages: 3.56, 2.73, 2.23

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔