mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-06-03, 02:53   #45
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DED16 Posts
Default

Quote:
Originally Posted by lavalamp View Post
There was an additional question that someone posed earlier that I don't think was addressed. In GPUs the architecture seems to be such that some number of single precision floating point units (4?) are combined, perhaps power-ranger style, to compute with double precision. Could it not also be the case that mighty morphing x86 architecture could do something similar and combine areas of the FPU for doubles to also compute quads?

This seems like a fairly sensible way to include quad precision support without adding a ton of extra silicon devoted specifically to it. And hey, if they wanted to include all the same FMA and vectorisation support for quads too that'd be excellent. Even if quad FLOPS were 1/4 (or lower) than double FLOPS, that's still perfectly acceptable.

The registers could still be maintained at 64 bits wide if the 128 bit floats were simply given and returned as an upper and lower half. It would just mean that the CPU instruction would need to be given 6 arguments instead of 3 for a mul.
Computer architecture is hard because every feature implemented in hardware becomes hugely faster, but in the process slows down everything else a little bit. So you don't get many choices for hugely faster. Native support for a 128-bit mantissa would probably be 10-20x faster than synthesizing a double-double operation, especially since they can pipeline in hardware, but would you sacrifice 5% of your clock speed or 10% of your single-precision throughput for it?
jasonp is offline   Reply With Quote
Old 2019-06-03, 08:28   #46
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

26·7 Posts
Default

Why would everything else get slower? To me, the biggest reason for not adding every possible feature is die area, and thus manufacturing cost.
mackerel is offline   Reply With Quote
Old 2019-06-03, 12:58   #47
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

6,793 Posts
Default

Quote:
Originally Posted by mackerel View Post
Why would everything else get slower?
Because of heat and distance. There are trade-offs.
retina is offline   Reply With Quote
Old 2019-06-04, 08:52   #48
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

26·7 Posts
Default

If it is a trade off that clocks might have to be reduced to control heat while that instruction is being used, I think that's an ok tradeoff. Similar to what we have with AVX-512 now. Other things probably will be unaffected.

Latency, in this context, I'm not sure is really significant.

I still think die area, yields, and manufacturing costs are the bigger factor than either of the above.
mackerel is offline   Reply With Quote
Old 2019-06-04, 13:51   #49
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

1A8916 Posts
Default

Quote:
Originally Posted by mackerel View Post
Latency, in this context, I'm not sure is really significant.
If you want to add more transistors then the die gets larger. So the travel time from one end of the chip to the other takes longer. That means either adding an extra delay cycle and extra buffers, or slowing down the maximum clock speed, or both. This would affect all operations of the chip regardless of which instructions are being executed.

This is why there are various levels of caching. L1 is the closest and the smallest so it is the fastest. If you push the L1 further away because you have more computation transistors in there then you have to slow things down. The same with access to the register file, it takes longer to send and receive data since it is further from the action.
retina is offline   Reply With Quote
Old 2019-06-04, 22:37   #50
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

7008 Posts
Default

Just how much more die area are you thinking this could add? My question is specifically on the significance. I'd still rate it far lower down on factors than simple area translating into cost.
mackerel is offline   Reply With Quote
Old 2019-06-04, 22:45   #51
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

6,793 Posts
Default

Quote:
Originally Posted by mackerel View Post
Just how much more die area are you thinking this could add? My question is specifically on the significance. I'd still rate it far lower down on factors than simple area translating into cost.
I'm not privy to the internals of an x86 core and its relative execution unit sizes so I couldn't say. But everything has a trade-off. It wouldn't come for free, either in money terms or in speed/wattage terms.

Like I mentioned above, it isn't impossible to do. Just show them the positive ROI and it can happen.
retina is offline   Reply With Quote
Old 2019-06-04, 23:00   #52
Mysticial
 
Mysticial's Avatar
 
Sep 2016

38010 Posts
Default

Quote:
Originally Posted by mackerel View Post
Just how much more die area are you thinking this could add? My question is specifically on the significance. I'd still rate it far lower down on factors than simple area translating into cost.
Naively extrapolating, let's say we add a "vfmaddpq zmm, zmm, zmm" instruction that does 4 quad-precision FMAs on 128-bit lanes.
  • A fully pipelined execution unit for that would probably be around 2x the size of the current one. Maybe only 1.5x with added latency if they do some sort of Karatsuba split.
  • Even with 2x the area for quad-precision, it would only add maybe 10% to the total area of the Skylake core since the rest of the core is still huge.
  • If fully-pipelined, the power-consumption by the unit itself will scale accordingly since every part of it would be active at all time. So there's a real possibility that you would need to throttle clock speeds to keep the same TDP.
  • A 2x increase in area would mean a ~sqrt(2) relative increase in distance traveled for data on and around the FMA units. That could mean longer latencies and bypass delays for everything - including the legacy stuff.
  • 128-bit execution lanes will probably complicate much of the wiring logic since nothing currently crosses a 64-bit boundary except for the shuffle and load/store units.
Mysticial is offline   Reply With Quote
Old 2019-06-04, 23:45   #53
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

@Mysticial -- Also, if you're a hw vendor and looking to add silicon, I expect you'd have a lot more customers interested in adding more 64-bit FMA units than adding ones with support for 128-bit floats. You could likely double the number of FMA units for less area/speed hit than 128-bit-ifying the current number of units.
ewmayer is offline   Reply With Quote
Old 2019-06-05, 00:00   #54
Mysticial
 
Mysticial's Avatar
 
Sep 2016

17C16 Posts
Default

Quote:
Originally Posted by ewmayer View Post
@Mysticial -- Also, if you're a hw vendor and looking to add silicon, I expect you'd have a lot more customers interested in adding more 64-bit FMA units than adding ones with support for 128-bit floats. You could likely double the number of FMA units for less area/speed hit than 128-bit-ifying the current number of units.
It's kinda already happening, but with the 8-bit and 16-bit stuff for DL.

As a lesser version of the quad-precision requests, other bignum people have been requesting that the SIMD unit be widened to do 64-bit multiplies. Not surprisingly these fell on deaf ears because widening the multiplier from 52x52 to 64x64 is a ~50% increase in area.

But the one that did made sense is the request to expose the 52-bit multiplier. No real new silicon - and resulted in the AVX512-IFMA.
Mysticial is offline   Reply With Quote
Old 2019-06-05, 07:43   #55
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

26×7 Posts
Default

Thanks Mysticial, 10% sounds like a fair bit.

I was thinking, do we have an existing example with the addition of AVX-512 in Skylake-X/SP, especially the two unit models. But this is complicated when doing a comparison since it also has differences in cache and IMC. Still, with my single sample of Skylake-X I can't say the clocks or general performance are impacted compared to Skylake, even if Skylake-X is more comparable is Kaby Lake in process.

I am curious what Intel did with Ice Lake, and it would be interesting to put that against Zen 2 although I really don't want another laptop just to do that testing.
mackerel is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Ryzen help Prime95 Hardware 9 2018-05-14 04:06
Ryzen 2 efficiency improvements M344587487 Hardware 3 2018-04-25 15:23
Help to choose components for a Ryzen rig robert44444uk Hardware 50 2018-04-07 20:41
29.2 benchmark help #2 (Ryzen only) Prime95 Software 10 2017-05-08 13:24
AMD Ryzen is risin' up. jasong Hardware 11 2017-03-02 19:56

All times are UTC. The time now is 16:34.


Fri Jul 7 16:34:40 UTC 2023 up 323 days, 14:03, 1 user, load averages: 2.22, 2.26, 1.98

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔