![]() |
![]() |
#1 |
∂2ω=0
Sep 2002
República de California
267538 Posts |
![]()
With AVX-512-capable CPUs expected to hit market this year, time to open a thread dedicated to the various implementation issues. Note the latest AVX-512 Architecture Instruction Set Extensions Programming Reference is available from Intel here (topmost pdf on that page).
Besides the obvious doubling of register width to 64 from AVX's 32 bytes and doubling of the vector register count from 16 to 32, I see the following types of new instructions as being espcially useful for GIMPS: [1] Load-with broadcast (the various VBROADCAST[...] instructions): This will be very handy for loads of various FFT-related consts (roots of unity) and DWT-weights-related data which consist of the same double-datum repeated across the vector register. AVX already has some similar functionality but AVX-512 significantly extends it, including versions which load-with-broadcast integer data from memory or GPRs. [2] Gather-load: fill a vector register with smaller 32/64-bit pieces loaded from non-contiguous memory locations, e.g. VGATHERQPD, whose summary is (this is from the PDF but with a few edits-for-succinctness and correction of my own - e.g. 'faulting-point' data in the original is clearly a typo): Code:
Using signed qword indices, gather float64 vector into float64 vector zmm1 using OPMASK register k1 as completion mask: VGATHERQPD zmm1 {k1}, vm64z [{} arg can be any of k1-7 OPMASK regs, or k0 (or omit {k*} arg) for 'load all'] Description A set of 8 double-precision floating-point memory locations pointed by base address BASE_ADDR and index vector V_INDEX with scale SCALE are gathered. The result is written into a vector register. The elements are specified via the VSIB (i.e., the index register vm64z is a vector register, holding packed indices). Elements will only be loaded if their corresponding mask bit in the OPMASK register is one. If an element’s mask bit is not set, the corresponding element of the destination register is left unchanged. The entire mask register will be set to zero by this instruction unless it triggers an exception. |
![]() |
![]() |
![]() |
#2 | |
Jan 2008
France
22×149 Posts |
![]() Quote:
According to test cases found in nasm, the syntax looks like this: Code:
vgatherqpd zmm30{k1}, [r14+zmm31*8+0x7b] Last fiddled with by ldesnogu on 2016-02-18 at 08:51 |
|
![]() |
![]() |
![]() |
#3 |
Aug 2002
22×3×5×11×13 Posts |
![]()
We have attached the PDF for your convenience.
![]() |
![]() |
![]() |
![]() |
#4 | |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]() Quote:
Code:
"vaddpd %%zmm3,%%zmm2,%%zmm1 \n\t"\ "vgatherqpd 0x7b(%%r14,%%zmm31,8),%%zmm30 \n\t"\ Assembler messages: test_file.c:14: Error: default mask isn't allowed for `vgatherqpd' [i.e. ADDPD accepts a default-mask form, but VGATHER* does not]. Then I tried adding a mask-arg to the destination (rightmost in GCC syntax) operand of both instructions: Code:
"vaddpd %%zmm3,%%zmm2,{k1}%%zmm1 \n\t"\ "vgatherqpd 0x7b(%%r14,%%zmm31,8),{k1}%%zmm30 \n\t"\ Assembler messages: test_file.c:13: Error: operand size mismatch for `vaddpd' test_file.c:14: Error: too many memory references for `vgatherqpd' |
|
![]() |
![]() |
![]() |
#5 |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
150728 Posts |
![]()
Can you set GCC to use the more sane Intel syntax, instead of that AT&T ugliness?
|
![]() |
![]() |
![]() |
#6 | |
∂2ω=0
Sep 2002
República de California
2DEB16 Posts |
![]() Quote:
Having used both syntaxes in the past I find that it's more a function of whichever one uses most often. I agree that the %% of the GCC extended-inline-asm are hard on the eyes, but I only add those after sketching out the basic assembly code, when I'm ready to test it out. And having learned reading and writing in Western-style left-to-right form I find the AT&T [src,src,dest] operand order to be much more natural to work with: "combine two source operands and output result in destination operand". I mean, consider the kind of descriptional awkwardness that litters Intel's instruction references due to their [dest,src,src] syntax: Performs a SIMD [blah] of the double-precision floating-point values in the first source operand (the second operand) by the floating- point values in the second source operand (the third operand). Results are written to the destination operand (the first operand). Back to the case at hand - in the inline-asm case one saves little 'ugliness' by using Intel syntax, here is the body of a basic AVX-syntax test macro which inlines the same instruction using each of the 2 syntaxes in turn: Code:
".intel_syntax \n\t"\ "vaddpd %%ymm1,%%ymm2,[%%rax+%%rbx*8] \n\t"\ ".att_syntax \n\t"\ "vaddpd (%%rax,%%rbx,8),%%ymm2,%%ymm1 \n\t"\ |
|
![]() |
![]() |
![]() |
#7 |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
2·32·373 Posts |
![]()
Can you set GCC to not require the %% ugliness? Can you set GCC to not require the \n\t silliness? Can you set GCC to not require the surrounding "'s nonsensiness?
Last fiddled with by retina on 2016-02-19 at 08:42 |
![]() |
![]() |
![]() |
#8 |
Tribal Bullet
Oct 2004
3,559 Posts |
![]()
No, all that stuff is a consequence of the way inline asm works in gcc; you are specifying a string that is blindly inserted into the assembler code the compiler produces so it needs C string formatting. The two %'s are for when you need to specify a register explictly, since a single % is assumed to mean the argument number out of the argument list below the asm string.
You can not like it, but if the choice is between this and Microsoft's compiler, which doesn't allow inline asm at all for 64-bit code, then the choice is easy. Last fiddled with by jasonp on 2016-02-19 at 14:15 |
![]() |
![]() |
![]() |
#9 | |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
2·32·373 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#10 | |
Jan 2008
France
22×149 Posts |
![]() Quote:
gas test files look like this: Code:
gas/testsuite/gas/i386/avx512f.s: vgatherqpd 123(%ebp,%zmm7,8), %zmm6{%k1} # AVX512F gas/testsuite/gas/i386/avx512f.s: vgatherqpd 123(%ebp,%zmm7,8), %zmm6{%k1} # AVX512F gas/testsuite/gas/i386/avx512f.s: vgatherqpd 256(%eax,%zmm7), %zmm6{%k1} # AVX512F gas/testsuite/gas/i386/avx512f.s: vgatherqpd 1024(%ecx,%zmm7,4), %zmm6{%k1} # AVX512F gas/testsuite/gas/i386/avx512f.s: vgatherqpd zmm6{k1}, ZMMWORD PTR [ebp+zmm7*8-123] # AVX512F gas/testsuite/gas/i386/avx512f.s: vgatherqpd zmm6{k1}, ZMMWORD PTR [ebp+zmm7*8-123] # AVX512F gas/testsuite/gas/i386/avx512f.s: vgatherqpd zmm6{k1}, ZMMWORD PTR [eax+zmm7+256] # AVX512F gas/testsuite/gas/i386/avx512f.s: vgatherqpd zmm6{k1}, ZMMWORD PTR [ecx+zmm7*4+1024] # AVX512F |
|
![]() |
![]() |
![]() |
#11 | |
∂2ω=0
Sep 2002
República de California
1175510 Posts |
![]() Quote:
Re. retina's separate-.s-file idea: sure that will work, but has issues of its own, like having to respect calling conventions and a 2-step compile. I like being able to focus on the code that actually 'does stuff' and being able to one-step-compile just as with pure C code. To each his own. Further, since I am for the most part translating many 1000s of line of existing AVX inline-asm - most of which is rote, with the exception of key new instructions like the above gather-load - it would be far more work to now switch to a separate-.s-file paradigm. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Tag Me, Radiolarian Chaff! (a.k.a. Official Anagram Thread!) | ixfd64 | Lounge | 68 | 2019-08-19 22:44 |
Official Peeved Pets Thread | Prime95 | Lounge | 32 | 2015-10-02 04:17 |
NOT the official forum factoring project thread | jyb | Factoring | 2 | 2013-09-03 16:11 |
Official 'Let's move the hyphen!' thread. | Flatlander | Lounge | 29 | 2013-01-12 19:29 |
Official Odd Perfect Number thread | ewmayer | Math | 14 | 2008-10-23 13:43 |