![]() |
|
|
#56 | |
|
Feb 2016
UK
26×7 Posts |
Now I'm scared:
Quote:
From AMD technical marketing manager https://twitter.com/Thracks/status/1147925341316505602 |
|
|
|
|
|
|
#57 |
|
Apr 2017
22×5 Posts |
Last fiddled with by maxzor on 2019-07-08 at 20:31 |
|
|
|
|
|
#58 |
|
Feb 2016
UK
1110000002 Posts |
Latency isn't the concern, bandwidth is. Gonna try and get an early night, so assuming CPU arrives as scheduled tomorrow I can get out of work earlier and start testing for real.
|
|
|
|
|
|
#59 |
|
"Composite as Heck"
Oct 2017
2×52×19 Posts |
I've had a bit of a play too based on the above youtube video's AIDA64 benchmark of 46.4GB/s read, 25.5GB/s write per chiplet using 3200 DDR4, and the 69MB read 64MB write requirements for 4M. Scaling linearly to 3600 that puts a lower bound 4M iteration time of ~2.23 ms/it limited by the write speed (~1.32 ms/it by the read speed). This ( https://www.mersenneforum.org/showpo...&postcount=782 ) 9900K benchmark using 3600 DDR4 yields ~3.17 ms/it for 4M. Assuming perfect conditions/optimisation this means the nerfed write speed isn't a definite limiter, in practice it may be (especially when you consider that we'll be running one job per CCX so even if optimal writes were possible if only a single job used the pipe, syncing the writes from two jobs may be a step too far).
From a position of ignorance it's hard to guess how a doubling of cache from 8MB to 16MB will help. Judging by the description of 2 passes per iteration it'll at least help a little with cache misses between passes but is that all? This is probably the dumbest thing I've said all week but if L3 size is incidental and the write speed is a limiter, could keeping 8MB of data permanently in L3 cache help? If there's no proper way for the data to persist you'd do this by accessing the data just before it would get evicted. You'd then only need 53MB of reads, 48MB of writes from memory per iteration, lowering the lower bound imposed by the write speed from 2.23ms/it to 1.67ms/it per chiplet. This would be at the expense of having to refresh the 8MB of data repeatedly, a minimum of 7 times per iteration under perfect conditions. |
|
|
|
|
|
#60 |
|
Feb 2016
UK
1110000002 Posts |
The 3600 CPU is in my possession now. Be another 6 hours or so before I get home and install it, and start benching.
|
|
|
|
|
|
#61 |
|
Feb 2016
UK
26·7 Posts |
So much testing... so much more to do.
Summary test results at https://linustechtips.com/main/topic...umber-finding/ Attempted interpretation on how it relates to PrimeGrid work: http://www.primegrid.com/forum_threa...ap=true#130991 4096k FFT iter/s for 1, 2, 6 workers tested with 29.8b5: 8086k 187.59 180.16 170.94 3600 309.71 214.21 193.38 5120k FFT: 8086k 145.74 141.85 135.79 3600 193.86 155.41 149.88 8086k seems to run around 4.3 GHz not that it matters as it is ram bandwidth starved. I had dual channel 3000C15 in the tested system. The 3600 system wouldn't boot with my 4000 rated ram, even after dropping it down to 3400. I gave up at that point and swapped in 3000C14 modules which were used for test. 3600 clock fluctuated with temperature but was in the region of 3.9 GHz. |
|
|
|
|
|
#62 |
|
"Composite as Heck"
Oct 2017
2·52·19 Posts |
Exciting, if I'm interpreting the data right this is nuts. Look at that 4M result, despite being 2x16MB a single worker that has to traverse IF is still way quicker than accessing memory. Is the 8086K definitely bandwidth starved? Naively scaling the 9900K result above from 3600 to 3000 gives an upper bound of ~263 it/s, it has 2 extra cores meaning 4MB more L3 cache but is it enough to explain the 70 it/s difference?
|
|
|
|
|
|
#63 |
|
"Sam Laur"
Dec 2018
Turku, Finland
317 Posts |
Oh crap. This could mean that the 3900X could be more awesome for LL work than expected, because of the larger L3 cache. Even when divided between two chiplets / four CCX. The "tail" that now extends to about 3-4M FFT size could then go up to 6-8M. Instead of being purely DRAM bandwidth limited as I was expecting. As the inter-core data latency doesn't really change after leaving that one CCX.
|
|
|
|
|
|
#64 |
|
Apr 2017
1416 Posts |
Is there a tool for you to compare the average number of RAM calls vs cache calls?
Last fiddled with by maxzor on 2019-07-10 at 06:22 |
|
|
|
|
|
#65 | |
|
Apr 2019
110011012 Posts |
Quote:
There's some broad "cache-references" and "cache-misses" events which can be reported, but also hundreds of other more fine-grained event types, "perf list" to browse all the possibilities. |
|
|
|
|
|
|
#66 | |
|
Feb 2016
UK
44810 Posts |
Yes, definitely. Dual channel 3000 ram isn't enough for even a fast quad core. It would take about double that to feed the 8086k to be close to unlimited.
Quote:
Edit: if anyone has specific test requests let me know and I will see what I can do. I'm hoping for an updated bios to allow the faster ram to be used and see how that helps (or not). Beyond that I need to move to more controlled tests. Last fiddled with by mackerel on 2019-07-10 at 08:13 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| RX470 and RX460 announced | VictordeHolland | GPU Computing | 0 | 2016-07-30 13:05 |
| Intel Xeon D announced | VictordeHolland | Hardware | 7 | 2015-03-11 23:26 |
| Factoring details | mturpin | Information & Answers | 4 | 2013-02-08 02:43 |
| Euler (6,2,5) details. | Death | Math | 10 | 2011-08-03 13:49 |
| Larrabee instruction set announced | fivemack | Hardware | 0 | 2009-03-25 12:09 |