mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2020-07-03, 11:57   #1
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

23·3·29 Posts
Default AMX instructions

https://fuse.wikichip.org/news/3600/...pphire-rapids/


https://en.wikichip.org/wiki/x86/amx#Instructions


Page 89:


https://software.intel.com/content/w...reference.html


What do people make of this new x86 extension for prime hunting? Separate to AVX512, "AI-specific" matrix operations. Accelerating int8 and bf16 matrix operations doesn't look too promising as if it was a thing someone probably would've made a program to take advantage of nvidia's tensor-flow hardware, but I know jack.
M344587487 is offline   Reply With Quote
Old 2020-07-03, 13:13   #2
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the TannhÀuser Gate

3·181 Posts
Default

Quote:
Originally Posted by M344587487 View Post
https://fuse.wikichip.org/news/3600/...pphire-rapids/


https://en.wikichip.org/wiki/x86/amx#Instructions


Page 89:


https://software.intel.com/content/w...reference.html


What do people make of this new x86 extension for prime hunting? Separate to AVX512, "AI-specific" matrix operations. Accelerating int8 and bf16 matrix operations doesn't look too promising as if it was a thing someone probably would've made a program to take advantage of nvidia's tensor-flow hardware, but I know jack.
Nvidia has had these in Cuda since the Pascal microarchitecture ( 2016 ).
Ampere microarchitecture introduces their 4th generation of tensor cores
They mostly appeal to folks doing AI since they greatly reduce training time and reduce latency when trying to get answers from a trained network.

Intel, once again, is left scrambling to play catch-up.
tServo is offline   Reply With Quote
Old 2020-07-03, 20:13   #3
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

3·131 Posts
Default

I had to ask on another forum, given that GPUs have massive throughput already, why do we need to run these on CPU? The response I got was along the lines of some data sets are simply too large to be effectively processed by GPU. This is not a particular interest area of mine, so I don't know how much this factors in, but take the recent Intel Cooper Lake Xeon launch for example. They are very specifically targeted at those who want performance in specific areas. It was questioned why Intel even bothered announcing it to the public, since it was never going to be a mass market solution, and the customers buying it don't need to be told about it at this stage.

Anyway, every time I see AI-optimised instructions, what springs to mind are insanely high OPs at really low data sizes. I'll let the mathematicians and programmers give the long answer, but I suspect it'll just be too inefficient to do what we need in prime number finding.
mackerel is offline   Reply With Quote
Old 2020-07-03, 21:18   #4
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

23·3·29 Posts
Default

A possible explanation I read about the need for the instructions in CPUs is for workloads that use them mixed with things GPUs are bad at like branching. Don't know how common those types of workloads might be.


Alternatively intel may be preparing for a unifying framework with their GPUs, code that can run somewhat accelerated CPU only but is really meant for scaling on GPUs. It would be nice if AMD and intel teamed up on an open standard to try and kick nvidia where it hurts but depressingly it's more likely intel will introduce a third standard and fight for second place.
M344587487 is offline   Reply With Quote
Old 2020-07-14, 04:50   #5
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

2×19×61 Posts
Default

Quote:
Originally Posted by tServo View Post
Intel, once again, is left scrambling to play catch-up.
Possibly related: https://tomshardware.com/news/linus-...-painful-death
ixfd64 is offline   Reply With Quote
Old 2020-07-14, 21:07   #6
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
RepĂșblica de California

266016 Posts
Default

@above:
Intel could easily reduce their transistor budget for SIMD support and provide the much-improved integer-math functionality Linus Torvalds yearns for if they weren't so crazy-biased towards FP support and thought more about multiple kinds of instructions sharing the same transistors insofar as possible. Let's consider the notoriously-transistor-hungry case of multiply: instead of first offering only avx-512 FP multiply and low-width vector integer mul, then later adding another half-measure, using those FP-mul units to generate the high 52 bits of a 64x64-bit integer product, plunk down a bunch of full 64x64->128-bit integer multipliers, supporting a vector analog (at long last) of the longstanding integer MUL instructions. Then design things so those units can be used for both integer and FP operands. Need bottom 64-bits of 64x64-bit integer mul? Just discard the high product halves, and maybe shave a few cycles. Signed vs unsigned high half of 64x64-bit product? Easily handled via a tiny bit of extra logic. Vector-DP product, either high-53-bits or full-width FMA style? No problem, just use the usual FP-operand preprocessing logic, then feed the resulting integer mantissas to the multi-purpose vector-MUL unit, then the usual postprocessing pipeline stages to properly deal with the resulting 106-bit product.

The HPC part comes in in the above context this way: very few programs are gonna need *both* high-perf integer and FP mul - the ones that do are *truly* outliers, unlike Torvalds' inane labeling of all HPC as some kind of fringe community. Using the same big-block transistor budget to support multiple data types is a big-picture win, even it leads to longer pipelines: the 32 avx-512 vector registers are more thn enough to allow coders to do a good job at latency hiding even with fairly long instruction pipelines.

Last fiddled with by ewmayer on 2020-07-14 at 21:08
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Broadwell new instructions tha Hardware 6 2014-07-18 00:08
Useless SSE instructions __HRB__ Programming 41 2012-07-07 17:43
Project instructions MooooMoo Twin Prime Search 9 2006-06-06 13:30
Instructions to manual LLR? OmbooHankvald PSearch 3 2005-08-05 20:28
Instructions please? jasong Sierpinski/Riesel Base 5 10 2005-03-14 04:03

All times are UTC. The time now is 11:15.

Wed Nov 25 11:15:30 UTC 2020 up 76 days, 8:26, 4 users, load averages: 1.55, 1.66, 1.58

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.