![]() |
![]() |
#1 |
"Composite as Heck"
Oct 2017
13718 Posts |
![]()
https://fuse.wikichip.org/news/3600/...pphire-rapids/
https://en.wikichip.org/wiki/x86/amx#Instructions Page 89: https://software.intel.com/content/w...reference.html What do people make of this new x86 extension for prime hunting? Separate to AVX512, "AI-specific" matrix operations. Accelerating int8 and bf16 matrix operations doesn't look too promising as if it was a thing someone probably would've made a program to take advantage of nvidia's tensor-flow hardware, but I know jack. |
![]() |
![]() |
![]() |
#2 | |
"Marv"
May 2009
near the TannhÀuser Gate
10011000012 Posts |
![]() Quote:
Ampere microarchitecture introduces their 4th generation of tensor cores They mostly appeal to folks doing AI since they greatly reduce training time and reduce latency when trying to get answers from a trained network. Intel, once again, is left scrambling to play catch-up. |
|
![]() |
![]() |
![]() |
#3 |
Feb 2016
UK
13×31 Posts |
![]()
I had to ask on another forum, given that GPUs have massive throughput already, why do we need to run these on CPU? The response I got was along the lines of some data sets are simply too large to be effectively processed by GPU. This is not a particular interest area of mine, so I don't know how much this factors in, but take the recent Intel Cooper Lake Xeon launch for example. They are very specifically targeted at those who want performance in specific areas. It was questioned why Intel even bothered announcing it to the public, since it was never going to be a mass market solution, and the customers buying it don't need to be told about it at this stage.
Anyway, every time I see AI-optimised instructions, what springs to mind are insanely high OPs at really low data sizes. I'll let the mathematicians and programmers give the long answer, but I suspect it'll just be too inefficient to do what we need in prime number finding. |
![]() |
![]() |
![]() |
#4 |
"Composite as Heck"
Oct 2017
761 Posts |
![]()
A possible explanation I read about the need for the instructions in CPUs is for workloads that use them mixed with things GPUs are bad at like branching. Don't know how common those types of workloads might be.
Alternatively intel may be preparing for a unifying framework with their GPUs, code that can run somewhat accelerated CPU only but is really meant for scaling on GPUs. It would be nice if AMD and intel teamed up on an open standard to try and kick nvidia where it hurts but depressingly it's more likely intel will introduce a third standard and fight for second place. |
![]() |
![]() |
![]() |
#5 |
Bemusing Prompter
"Danny"
Dec 2002
California
93C16 Posts |
![]()
Possibly related: https://tomshardware.com/news/linus-...-painful-death
|
![]() |
![]() |
![]() |
#6 |
∂2ω=0
Sep 2002
RepĂșblica de California
2·7·829 Posts |
![]()
@above:
Intel could easily reduce their transistor budget for SIMD support and provide the much-improved integer-math functionality Linus Torvalds yearns for if they weren't so crazy-biased towards FP support and thought more about multiple kinds of instructions sharing the same transistors insofar as possible. Let's consider the notoriously-transistor-hungry case of multiply: instead of first offering only avx-512 FP multiply and low-width vector integer mul, then later adding another half-measure, using those FP-mul units to generate the high 52 bits of a 64x64-bit integer product, plunk down a bunch of full 64x64->128-bit integer multipliers, supporting a vector analog (at long last) of the longstanding integer MUL instructions. Then design things so those units can be used for both integer and FP operands. Need bottom 64-bits of 64x64-bit integer mul? Just discard the high product halves, and maybe shave a few cycles. Signed vs unsigned high half of 64x64-bit product? Easily handled via a tiny bit of extra logic. Vector-DP product, either high-53-bits or full-width FMA style? No problem, just use the usual FP-operand preprocessing logic, then feed the resulting integer mantissas to the multi-purpose vector-MUL unit, then the usual postprocessing pipeline stages to properly deal with the resulting 106-bit product. The HPC part comes in in the above context this way: very few programs are gonna need *both* high-perf integer and FP mul - the ones that do are *truly* outliers, unlike Torvalds' inane labeling of all HPC as some kind of fringe community. Using the same big-block transistor budget to support multiple data types is a big-picture win, even it leads to longer pipelines: the 32 avx-512 vector registers are more thn enough to allow coders to do a good job at latency hiding even with fairly long instruction pipelines. Last fiddled with by ewmayer on 2020-07-14 at 21:08 |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Broadwell new instructions | tha | Hardware | 6 | 2014-07-18 00:08 |
Useless SSE instructions | __HRB__ | Programming | 41 | 2012-07-07 17:43 |
Project instructions | MooooMoo | Twin Prime Search | 9 | 2006-06-06 13:30 |
Instructions to manual LLR? | OmbooHankvald | PSearch | 3 | 2005-08-05 20:28 |
Instructions please? | jasong | Sierpinski/Riesel Base 5 | 10 | 2005-03-14 04:03 |