mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-05-30, 16:14   #12
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by retina View Post
You'd need to convince Intel/AMD of the need for 128 bit values.
Maybe convince someone influential at a three-letter-agency that it's vital for breaking encryption. Then the agency persuades the chip makers.
kriesel is online now   Reply With Quote
Old 2019-05-30, 16:36   #13
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

11010100010012 Posts
Default

Quote:
Originally Posted by kriesel View Post
Maybe convince someone influential at a three-letter-agency that it's vital for breaking encryption.
Yeah, but it isn't, is it.

Nobody breaks encryption nowadays anyway. It's all about the metadata. And five dollar wrenches.

But actually what are some real uses for quad precision? I mean really, where are they required?
retina is offline   Reply With Quote
Old 2019-05-30, 17:18   #14
Mysticial
 
Mysticial's Avatar
 
Sep 2016

17C16 Posts
Default

Quote:
Originally Posted by retina View Post
Yeah, but it isn't, is it.

Nobody breaks encryption nowadays anyway. It's all about the metadata. And five dollar wrenches.

But actually what are some real uses for quad precision? I mean really, where are they required?
It would make GIMPS more efficient by reducing the memory (and bandwidth) requirement for LL.
Mysticial is offline   Reply With Quote
Old 2019-05-30, 17:28   #15
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by retina View Post
Yeah, but it isn't, is it.
Efficacy toward the stated purpose is not a requirement of a government program or position. "However, as those facts did not support Bomber Commands’ philosophy, they were suppressed at the time." http://www.reformationsa.org/index.php/history/174-the-bombing-of-cities-in-world-war-ii
kriesel is online now   Reply With Quote
Old 2019-05-30, 17:48   #16
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

1110000002 Posts
Default

Quote:
Originally Posted by retina View Post
But actually what are some real uses for quad precision? I mean really, where are they required?
I can't say "why", only give one use case mentioned earlier: Genefer. I don't understand how it works, any more than I do the coding talk here. The software has implemented different paths using different instructions sets. Most are similar in their test limit, but x87 goes much further, presumably from the increased numerical accuracy. It is also the slowest so a last resort when none of the faster ones still work.

If FP128 were a thing, presumably it would have higher limits still. And if implemented in a SIMD format, all the better. I found it interesting when comparing x87 tasks between Ryzen and Intel, the Ryzen throughput was about half that of Intel. I do wonder if x87 is somehow tied into their AVX2/FMA FPU units which is similarly about half the potential, and if Zen 2's implementation will catch up with consumer Intel CPUs as expected from that.

Edit: random thought: would there be an advantage for our use cases if more bits were put towards the number part as opposed to the exponent? So still 64-but overall, but the question is where to put them.

Last fiddled with by mackerel on 2019-05-30 at 17:50
mackerel is offline   Reply With Quote
Old 2019-05-30, 18:07   #17
Nick
 
Nick's Avatar
 
Dec 2012
The Netherlands

22×33×17 Posts
Default

Quote:
Originally Posted by retina View Post
Nobody breaks encryption nowadays anyway. It's all about the metadata. And five dollar wrenches.
Yes and no - things like side channels, fault injection etc. are still important.
Nick is offline   Reply With Quote
Old 2019-05-30, 18:23   #18
Mysticial
 
Mysticial's Avatar
 
Sep 2016

17C16 Posts
Default

Quote:
Originally Posted by mackerel View Post
I found it interesting when comparing x87 tasks between Ryzen and Intel, the Ryzen throughput was about half that of Intel.
Ryzen's x87 latencies are overall higher than Intel's. AMD has always lacked in their x87. But I doubt they really tried to make it super-efficient. All new programs would be using at minimum scalar-SSE. And all legacy programs written with x87 were meant for very old processors and would have no problem on a modern processor anyway.

The only thing having an super-efficient x87 FPU would get is an amazing SuperPi or PiFast benchmark - which aren't really used that much anymore for benchmarking.

Quote:
I do wonder if x87 is somehow tied into their AVX2/FMA FPU units which is similarly about half the potential, and if Zen 2's implementation will catch up with consumer Intel CPUs as expected from that.
I mentioned this in an earlier post, but I suspect the exact opposite for Intel.
  • On Skylake: All SIMD instructions that use the FMA unit (including all floating-point) have 4 cycle latency. (excluding bypass delays)
  • On most Intel going back many years: The x87 "fadd" and "fmul" had 3 and 5 cycle latencies respectively.
  • On Haswell and earlier: SIMD FP-add/mul was also 3 and 5 cycles respectively.
  • Prior to Skylake, Intel had separate execution units for SIMD FP-add and SIMD FP-mul.
  • On Skylake: SIMD FP-add/mul are both normalized to 4 cycles using the same unified execution unit that does everything.
Things to note on Skylake:
  • A double precision (53-bit) add is 4 cycles in the SIMD units. An extended precision (64-bit) add is only 3 cycles on the x87 FPU.
  • The x87 fmul (64-bit) is 5 cycles on the FPU.

What we see here is strong evidence of a complete revamp of the SIMD unit in Skylake while the x87 FPU remains untouched. The most logical explanation for this is that the x87 FPU is completely separate from the SIMD as of Skylake (and possibly going back a bit further)

Likewise if you look at the die-shot for Skylake, you can see 8 identical squares in a 2x4 pattern (16 in a 4x4 on Skylake X). Each square is a 64-bit FP-FMA lane. They are all identical and the same size. There's no "special" one which looks a bit bigger - as would be needed for a 64-bit multiplier instead of a 53-bit multiplier.

-----

There isn't enough evidence to say much about Ryzen. The x87 FPU latencies are 5-cycles for both fadd and fmul. That's higher than all the SIMD latencies except for the FMA. So it's hard to say whether Ryzen has a dedicated x87 FPU? Or whether one of the SIMD lanes is "bigger" to accommodate the 64-bit multiplier that's needed.

And as I mentioned before, there's the possibility of the x87 FPU sharing the 64-bit multiplier with the scalar integer unit. But someone needs to test this. Even though a 64-bit multiplier would be large in silicon, it's not obvious from the die shots since there's only one of them as opposed to the SIMD lanes which is the same thing copy-pasted multiple times.

Quote:
Edit: random thought: would there be an advantage for our use cases if more bits were put towards the number part as opposed to the exponent? So still 64-but overall, but the question is where to put them.
Yes it would. At no point in a normal FFT would any of the coefficients get anywhere near the limits of 10^+/-308.
Attached Thumbnails
Click image for larger version

Name:	Skylake.jpg
Views:	107
Size:	276.0 KB
ID:	20486  

Last fiddled with by Mysticial on 2019-05-30 at 18:38
Mysticial is offline   Reply With Quote
Old 2019-05-30, 19:44   #19
lavalamp
 
lavalamp's Avatar
 
Oct 2007
Manchester, UK

53·11 Posts
Default

Quote:
Originally Posted by retina View Post
But actually what are some real uses for quad precision? I mean really, where are they required?
The one use-case relevant to me would be simulations to produce planetary ephemeris data and simulation of spacecraft's motion around the solar system.

Edit: I should add that JPL explicitly state they use quad precision during simulation when generating their planetary ephemeris data sets, so I presume they do not use x86.

Last fiddled with by lavalamp on 2019-05-30 at 19:48
lavalamp is offline   Reply With Quote
Old 2019-05-30, 20:09   #20
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

A few thoughts re. x87 80-bit FP and general 128-bit FP usage:

o Like Alex (Mysticial), I've long been an advocate for instruction sharing of expensive hardware, mainly multipliers. For the x87 having both floating and integer MUL share a single 64x64 hardware multiplier makes perfect sense. But I strongly believe that Intel and AMD have simply 'frozen' their current legacy-core-on-die tech - it'll keep benefiting from process size reductions but it's a fixed block of IP, they are not dedicating any engineering resources to changing anything in there.

o Another instruction which GIMPS makes huge use of for which hardware support would be useful (but will likely not happen) is paired-add/subtract, a +- b. This is a huge component in transform arithmetic. The idea is that in an FP context, the add/sub of the 2 signficands is only part of a longer execution pipeline which proceeds (in simplified form, ignoring under-and-overflow handling) something like this, for FADD:

1. Unpack FP inputs to extract sign, exponent and significand (restoring hidden bit in latter if normalized, i.e. not underflowed);
2. Compute absolute difference of a_exp and b_exp, right-shift significand of the smaller-exponent datum |a_exp - b_exp| bits to align the data;
3. Add the aligned significands;
4. Round the low bits of the sum (hardware will have several extra bits at the low end to support IEEE rounding rules). We round before checking for an add-carryout because of the possibility for a carry rippling all the way from least to most-signifcant bit on rounding;
5. If a carry bit results, shift the sum rightward one place and add one to the larger of the 2 exponent fields, yielding the exponent field of the output;
6. Repack the output into IEEE64 form.

For paired add/sub steps 1 and 2 can be done just once, at which point copies of the resulting unpacked/aligned data would get sent to the dedicated add and subtract logic needed to do steps 3-5. The shared-computation savings would be even greater for an FMA-based add/sub butterfly, a*b +- c, because the MUL need only be done once.

o The main use cases I know of for 128-bit FP are in finance and scientific computation. I come from the latter milieu so understand that better. The old Cray supercomputers, before Cray moved to building around commodity microprocessors, had hardware support for 128-bit quad-precision FP. But even there the economics were such that it now makes more sense to use commodity microprocessors and support 128-bit FP via emulation in software.

o The old DEC Vax-11 architecture had two distinct 64-bit FP types, G_floating which is more less the same as today's IEEE64 with 11 bits for the scaled exponent, and D_floating, which gained 3 significand bits at the expense of the exponent, the latter's 8 bits only support an operand range of approximately plus or minus 2.9E-39 to plus or minus 1.7E+38. But it's expensive to support even just a single float type in hardware, so it makes sense to pick one which gives a a broadest-general-usage-optimized balance of precision and range. Having broadly standardized industry-wide rules for these, as well as for rounding is vitally important in the era of ubiquitous computation. That's why IEEE stepped in decades ago, and why we have just a single industry-wide IEEE64 floating-point standard. And there is also a codified IEEE FP128 standard known as "binary128", with 15 exponent bits and 112 explicit significand bits (i.e. 113 bits, including the hidden bit). That Wikipage also has a section re. hardware support -- it looks like specialty IBM hardware is the only kind which currently offers such.

Last fiddled with by ewmayer on 2019-05-30 at 20:15
ewmayer is offline   Reply With Quote
Old 2019-05-30, 20:56   #21
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

172208 Posts
Default

Quote:
Originally Posted by ewmayer View Post
o The old DEC Vax-11 architecture had two distinct 64-bit FP types, G_floating which is more less the same as today's IEEE64 with 11 bits for the scaled exponent, and D_floating, which gained 3 significand bits at the expense of the exponent, the latter's 8 bits only support an operand range of approximately plus or minus 2.9E-39 to plus or minus 1.7E+38. But it's expensive to support even just a single float type in hardware, so it makes sense to pick one which gives a a broadest-general-usage-optimized balance of precision and range. Having broadly standardized industry-wide rules for these, as well as for rounding is vitally important in the era of ubiquitous computation. That's why IEEE stepped in decades ago, and why we have just a single industry-wide IEEE64 floating-point standard. And there is also a codified IEEE FP128 standard known as "binary128", with 15 exponent bits and 112 explicit significand bits (i.e. 113 bits, including the hidden bit). That Wikipage also has a section re. hardware support -- it looks like specialty IBM hardware is the only kind which currently offers such.
And also H_floating point. https://nssdc.gsfc.nasa.gov/nssdc/fo...atingPoint.htm

Last fiddled with by kriesel on 2019-05-30 at 20:57
kriesel is online now   Reply With Quote
Old 2019-05-30, 21:43   #22
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

26×7 Posts
Default

Quote:
Originally Posted by Mysticial View Post
What we see here is strong evidence of a complete revamp of the SIMD unit in Skylake while the x87 FPU remains untouched. The most logical explanation for this is that the x87 FPU is completely separate from the SIMD as of Skylake (and possibly going back a bit further)
I had observed that for Prime95, LLR and similar FMA small FFT testing (so ram bandwidth was not a limit), Skylake was consistently 14% faster than Haswell after normalising clock. I did see that some cycles were reduced and wondered if that was a contributor. I never figured out why Broadwell was 6% slower than Haswell though. The L4 of that came in handy for large FFTs.

Quote:
There isn't enough evidence to say much about Ryzen.
I guess I'm looking at it more from a user perspective. How fast is it running something I want to do? I certainly do intend to buy a Zen 2, 8 core CPU as soon as practical and start benching it. It will be now even more interesting if there is a relative speedup in x87 tasks compared to Zen(+).
mackerel is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Ryzen help Prime95 Hardware 9 2018-05-14 04:06
Ryzen 2 efficiency improvements M344587487 Hardware 3 2018-04-25 15:23
Help to choose components for a Ryzen rig robert44444uk Hardware 50 2018-04-07 20:41
29.2 benchmark help #2 (Ryzen only) Prime95 Software 10 2017-05-08 13:24
AMD Ryzen is risin' up. jasong Hardware 11 2017-03-02 19:56

All times are UTC. The time now is 16:34.


Fri Jul 7 16:34:51 UTC 2023 up 323 days, 14:03, 1 user, load averages: 2.26, 2.27, 1.99

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔