mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   AMX instructions (https://www.mersenneforum.org/showthread.php?t=25698)

M344587487 2020-07-03 11:57

AMX instructions
 
[url]https://fuse.wikichip.org/news/3600/the-x86-advanced-matrix-extension-amx-brings-matrix-operations-to-debut-with-sapphire-rapids/[/url]


[url]https://en.wikichip.org/wiki/x86/amx#Instructions[/url]


Page 89:


[url]https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html[/url]


What do people make of this new x86 extension for prime hunting? Separate to AVX512, "AI-specific" matrix operations. Accelerating int8 and bf16 matrix operations doesn't look too promising as if it was a thing someone probably would've made a program to take advantage of nvidia's tensor-flow hardware, but I know jack.

tServo 2020-07-03 13:13

[QUOTE=M344587487;549676][url]https://fuse.wikichip.org/news/3600/the-x86-advanced-matrix-extension-amx-brings-matrix-operations-to-debut-with-sapphire-rapids/[/url]


[url]https://en.wikichip.org/wiki/x86/amx#Instructions[/url]


Page 89:


[url]https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html[/url]


What do people make of this new x86 extension for prime hunting? Separate to AVX512, "AI-specific" matrix operations. Accelerating int8 and bf16 matrix operations doesn't look too promising as if it was a thing someone probably would've made a program to take advantage of nvidia's tensor-flow hardware, but I know jack.[/QUOTE]

Nvidia has had these in Cuda since the Pascal microarchitecture ( 2016 ).
Ampere microarchitecture introduces their 4th generation of tensor cores
They mostly appeal to folks doing AI since they greatly reduce training time and reduce latency when trying to get answers from a trained network.

Intel, once again, is left scrambling to play catch-up.

mackerel 2020-07-03 20:13

I had to ask on another forum, given that GPUs have massive throughput already, why do we need to run these on CPU? The response I got was along the lines of some data sets are simply too large to be effectively processed by GPU. This is not a particular interest area of mine, so I don't know how much this factors in, but take the recent Intel Cooper Lake Xeon launch for example. They are very specifically targeted at those who want performance in specific areas. It was questioned why Intel even bothered announcing it to the public, since it was never going to be a mass market solution, and the customers buying it don't need to be told about it at this stage.

Anyway, every time I see AI-optimised instructions, what springs to mind are insanely high OPs at really low data sizes. I'll let the mathematicians and programmers give the long answer, but I suspect it'll just be too inefficient to do what we need in prime number finding.

M344587487 2020-07-03 21:18

A possible explanation I read about the need for the instructions in CPUs is for workloads that use them mixed with things GPUs are bad at like branching. Don't know how common those types of workloads might be.


Alternatively intel may be preparing for a unifying framework with their GPUs, code that can run somewhat accelerated CPU only but is really meant for scaling on GPUs. It would be nice if AMD and intel teamed up on an open standard to try and kick nvidia where it hurts but depressingly it's more likely intel will introduce a third standard and fight for second place.

ixfd64 2020-07-14 04:50

[QUOTE=tServo;549680]Intel, once again, is left scrambling to play catch-up.[/QUOTE]

Possibly related: [url]https://tomshardware.com/news/linus-torvalds-wishes-intel-avx-512-a-painful-death[/url]

ewmayer 2020-07-14 21:07

@above:
Intel could easily reduce their transistor budget for SIMD support and provide the much-improved integer-math functionality Linus Torvalds yearns for if they weren't so crazy-biased towards FP support and thought more about multiple kinds of instructions sharing the same transistors insofar as possible. Let's consider the notoriously-transistor-hungry case of multiply: instead of first offering only avx-512 FP multiply and low-width vector integer mul, then later adding another half-measure, using those FP-mul units to generate the high 52 bits of a 64x64-bit integer product, plunk down a bunch of full 64x64->128-bit integer multipliers, supporting a vector analog (at long last) of the longstanding integer MUL instructions. Then design things so those units can be used for both integer and FP operands. Need bottom 64-bits of 64x64-bit integer mul? Just discard the high product halves, and maybe shave a few cycles. Signed vs unsigned high half of 64x64-bit product? Easily handled via a tiny bit of extra logic. Vector-DP product, either high-53-bits or full-width FMA style? No problem, just use the usual FP-operand preprocessing logic, then feed the resulting integer mantissas to the multi-purpose vector-MUL unit, then the usual postprocessing pipeline stages to properly deal with the resulting 106-bit product.

The HPC part comes in in the above context this way: very few programs are gonna need *both* high-perf integer and FP mul - the ones that do are *truly* outliers, unlike Torvalds' inane labeling of all HPC as some kind of fringe community. Using the same big-block transistor budget to support multiple data types is a big-picture win, even it leads to longer pipelines: the 32 avx-512 vector registers are more thn enough to allow coders to do a good job at latency hiding even with fairly long instruction pipelines.


All times are UTC. The time now is 03:09.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.