Quote:
 Originally Posted by mackerel I had observed that for Prime95, LLR and similar FMA small FFT testing (so ram bandwidth was not a limit), Skylake was consistently 14% faster than Haswell after normalising clock. I did see that some cycles were reduced and wondered if that was a contributor. I never figured out why Broadwell was 6% slower than Haswell though. The L4 of that came in handy for large FFTs.
My guess is that this is all about mismatching latencies and pipeline bubbles.

When the same execution port has to handle instructions of different latencies, there can be problems on the write back - IOW, the pipeline sorta gets messed up.

On Skylake, everything is 4 cycles. No issues here.

On Haswell, adds are 3 and mul/FMA are 5 cycles. They share one of the ports. But IIRC, George mentioned pipeline bubbles from this. So he changed the adds to FMAs with a 1 multiplier. This normalizes everything to 5 cycles - no more issues.

On Broadwell, adds are 3, but multiply has also been reduced to 3. FMA remains at 5. So unless George updated his code to account for that, you now have a mix of "5 cycle" adds, 5 cycle FMAs, and 3 cycle multiplies - thus mismatching latencies and pipeline conflicts.

For most software that isn't that tightly optimized, reducing the latency of something will be a benefit. But perhaps not in this case.

In other words, Intel knowingly did the trade-off to increase the FP-add latency from 3 to 4 cycles to normalize it with everything else.

 2019-05-30, 22:26 #24 retina Undefined     "The unspeakable one" Jun 2006 My evil lair 23×701 Posts So while the uses mentioned here for QP are legitimate, they aren't exactly killer apps, or in widespread use. You'll need to find a better argument to convince Intel/AMD. Show them the positive ROI. I watched an animated movie recently, Ralph Breaks the Internet. Which is quite good actually. The level of detail and accuracy of rendering are particularly stunning. Really good texturing without any tearing or discontinuities. Long distance zooms and antialiasing and whatnot are flawless. And it is all done using nothing more than standard IEEE64 floats. QP not required. On the other side of the coin. I've seen Mandelbrot deep zooms that would quickly become nonsense after a few frames if all that was available was DP floats. But QP also wouldn't help much either. Nothing but full on arbitrary precision to at least 3k+ bits floats is needed there.
 2019-05-31, 11:48 #25 joblack     Oct 2008 n00bville 52×29 Posts In the AMD keynote I have heard they doubled the floating point performance. Won't that impact the Prime95 performance? Or am I completely off the track? Last fiddled with by joblack on 2019-05-31 at 11:49
Quote:
 Originally Posted by joblack In the AMD keynote I have heard they doubled the floating point performance. Won't that impact the Prime95 performance?
It should, indeed, impact Prime95 performance, as rumors have long said that the doubling will be in AVX. Combine that with the better memory performance, and we might have a winner. But, as always, better wait for real benchmarks instead of just marketing promises...

 2019-05-31, 13:37 #27 mackerel     Feb 2016 UK 22·5·19 Posts I intend to buy one as soon as practical to test. My uses are more with LLR and relatively smaller FFT sizes, so that should be run out of the 32MB L3 cache and not be ram limited. It will be interesting... if 32MB is not sufficient then ram will probably remain a limiting factor. Dual channel even at somewhat higher speeds is unlikely to be enough. The 64MB of the 12 core model might get around that, but we have two chiplets inside that and we don't know how well they will communicate with each other.
Quote:
 Originally Posted by mackerel I guess I'm looking at it more from a user perspective. How fast is it running something I want to do? I certainly do intend to buy a Zen 2, 8 core CPU as soon as practical and start benching it. It will be now even more interesting if there is a relative speedup in x87 tasks compared to Zen(+).
I am contemplating replacing my quadcore Haswell with an octocore *Zen ... was thinking maybe a budget Ryzen system ... where is that CPU family amongst the recent and upcoming AMD offerings, when are the latest chips hitting market, and should I wait until then to decide? (The decision tree being essentially: If new CPUs appreciably less than 2x LL-crunching throughput, go with the older; otherwise do a detailed FLOPS/$comparison based on hardware/electricity costs over the expected typical-system lifetime). Should we expect cost of the older CPU models to drop appreciably once the latest-greatest hit the stores? Last fiddled with by ewmayer on 2019-05-31 at 20:01  2019-05-31, 22:42 #29 mackerel Feb 2016 UK 22·5·19 Posts The next gen Ryzens are expected next month, although your guess is as good as any on what real availability will be like. I do intend to get one quickly. I'd say it is worth the wait to find out, unless you absolutely have to get something today. I doubt the existing models will see much if any further price drop. They have already fallen significantly over their marketing life. I would hope they're not over-producing them, so no fire sales to clear inventory.  2019-05-31, 23:43 #30 Mysticial Sep 2016 7×47 Posts I was planning an mATX Zen 2 build and went so far as to start picking a case, a PSU, price watching ram, and matching RGB components. But I was targeting the 16-core Zen 2... Seeing as how AMD is holding back the 16-core, I guess I'm scrapping those plans for now. It would be a good compilation box and non-AVX512 workhorse. And I'm really curious to see exactly how bad the memory bottleneck will be. I imagine they will wreck as much havoc on my superoptimizer tuning tables as Skylake X did 2 years ago. Only this time, there's a lot less room to optimize since I "used it all up" in response to Skylake X already. Last fiddled with by Mysticial on 2019-06-01 at 00:01 2019-06-01, 17:30 #31 M344587487 "Composite as Heck" Oct 2017 3·199 Posts Quote:  Originally Posted by mackerel I intend to buy one as soon as practical to test. My uses are more with LLR and relatively smaller FFT sizes, so that should be run out of the 32MB L3 cache and not be ram limited. It will be interesting... if 32MB is not sufficient then ram will probably remain a limiting factor. Dual channel even at somewhat higher speeds is unlikely to be enough. The 64MB of the 12 core model might get around that, but we have two chiplets inside that and we don't know how well they will communicate with each other. I'm not fully up to date but don't believe cache hierarchy has been detailed in depth. If zen/zen+ is anything to go by it's likely that the quad core CCX remains and cache access is split by CCX into 16MB chunks. At best in a situation where you rely on inter-chiplet or inter-CCX communication you may be limited by the higher bandwidth of the infinity fabric instead of RAM bandwidth. At worst communication may be poor enough that straddling CCX's on the same die is not viable let alone across chiplets (that's where my money is unfortunately). If these topology assumptions are wrong all bets are off. Quote:  Originally Posted by ewmayer I am contemplating replacing my quadcore Haswell with an octocore *Zen ... was thinking maybe a budget Ryzen system ... where is that CPU family amongst the recent and upcoming AMD offerings ... The 1700 zen won't be an upgrade for LL but is good for less optimised tasks that can take advantage of 8 cores of general compute. The 2700 zen+ is a good iteration on the 1700 (tl;dr ~300MHz clock bump, better RAM speed support, OTOH 12 cycle L2 latency vs 17 for the 1700) but depending on use case it may not be a huge bump. Quote:  Originally Posted by ewmayer , when are the latest chips hitting market, and should I wait until then to decide? All things point to a release close to 2019-07-07. E3 is in a week or so in which Navi will be the main focus, but the next consoles will be using zen2 too and it's likely they'll reveal a little more general information around then. Quote:  Originally Posted by ewmayer (The decision tree being essentially: If new CPUs appreciably less than 2x LL-crunching throughput, go with the older; Unless the doubling of L3 cache per core along with much better RAM speed support works a miracle the new CPUs won't be able to achieve doubled LL throughput. AFAIK dual channel RAM is still a limiter. The doubled AVX2 is great but instead of more throughput it means you can go for the cheaper end of the stack to optimally saturate what is likely a RAM bottleneck. Quote:  Originally Posted by ewmayer otherwise do a detailed FLOPS/$ comparison based on hardware/electricity costs over the expected typical-system lifetime).
Efficiency will IMO be the killer feature of zen2 for LL, unless doubled FP units take a massive toll I'd go so far as to say game changer.

Quote:
 Originally Posted by ewmayer Should we expect cost of the older CPU models to drop appreciably once the latest-greatest hit the stores?
Ryzen are already competitively priced but modest cuts seem likely. The lead up and soon after a new release tends to yield some good deals for prior AMD hardware, I'm convinced that's part of a marketing tactic to get eyeballs on Ryzen. It's interesting trying to predict what will happen to the lineup as there are some factors with plenty of unknown variables:
• 1700 are still being produced in numbers because Epyc is still on zen, skipping zen+ to avoid the need for validation and reflecting the slower rate of change for the server market. Once Epyc 2 is widely available Epyc 1 demand should drop like a stone with supply not long after. It'll still be around short term but AMD desperately needs market share and will likely price Epyc 2 aggressively enough that Epyc 1 is more pushed off a cliff than gracefully retired
• There is still a restrictive quota of chips that AMD must purchase from Global Foundries of unknown severity. That's why in the past (and to a lesser extent currently), bottom of the barrel products are filled with craptastic old APUs no one with any sense would knowingly purchase if performance or efficiency is a factor.
• zen2 levels the playing field somewhat for gamers and single thread performance, but for most general use zen+ is hard to argue with. Pricing for Ryzen 3000 may not be as aggressive as it could be as the "but muh FPS" crowd has had 10 years getting gouged by intel, all AMD need to do is price "well enough" to be attractive and let the chips fall where they may. zen+ may end up being the value king for some time after launch for general use
• Supply, if it's in any way inadequate Ryzen pricing is likely to suffer as Epyc likely has first dibs on the chips
• How the binned chips break down and are divvied out and how the yields shake out. Epyc 2 likely get the cream of all the good 8 core dies, last I heard consoles get the dregs with potentially two 4 core dies per device. It may be that the intent is for Ryzen to mainly be supplied by 6 core dies at least in the short term as that is the most readily-available bin once the other markets take their fill. That could in part explain why the 16 core Ryzen has been held back, and may make the 12 core a more interesting proposition down the line if the 8 core part remains supply-constrained
• intel haven't had a viable competitor on all fronts in a long time. How they'll price their 14nm+++++++ once zen2 has gained traction is anyone's guess. They are in such a dominant position that they could do nothing and just milk what they can while they can, equally they could drop prices until competitive and still make a healthy profit. Supply is a big question mark as for a year now there's been talk of intel not being able to produce 14nm quickly enough (some fabs geared for a failing 10nm sitting essentially idle, increased demand, outsourcing some of their chipsets and using older 28nm nodes to alleviate 14nm supply issues). What benefit is there to dropping prices significantly if they're selling out as is, it may take a while of zen2 being competitive before it makes enough of a dent in demand for intel chips for them to need to respond

Quote:
 Originally Posted by retina So while the uses mentioned here for QP are legitimate, they aren't exactly killer apps, or in widespread use. You'll need to find a better argument to convince Intel/AMD. Show them the positive ROI.
Supposing we lived in a world without double precision implemented in main-stream CPUs. What (convincing) argument could be made to implement a higher precision format?

Of course doubles have a higher precision, but the counterpoint is that single precision floats give you 7 decimal digits, which on the face of it does seem rather a lot, very few constants are measured to that level of precision. Two that spring to mind are the speed of light (defined), and planck's constant, but neither are exactly in common use.

tl;dr What is the "killer app" for doubles?

Quote:
 Originally Posted by lavalamp tl;dr What is the "killer app" for doubles?
I suspect it was finance. Before 64bit integer CPUs were introduced "everyone" used the FPU to compute their millions. Excel is a good example of this.

And yes, rounding problems were rife. But how else to handle fractional percentages with only integer arithmetic?

Last fiddled with by retina on 2019-06-01 at 20:15

