mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2016-02-18, 07:46   #1
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

100110011010002 Posts
Default Official AVX-512 programming thread

With AVX-512-capable CPUs expected to hit market this year, time to open a thread dedicated to the various implementation issues. Note the latest AVX-512 Architecture Instruction Set Extensions Programming Reference is available from Intel here (topmost pdf on that page).

Besides the obvious doubling of register width to 64 from AVX's 32 bytes and doubling of the vector register count from 16 to 32, I see the following types of new instructions as being espcially useful for GIMPS:

[1] Load-with broadcast (the various VBROADCAST[...] instructions): This will be very handy for loads of various FFT-related consts (roots of unity) and DWT-weights-related data which consist of the same double-datum repeated across the vector register. AVX already has some similar functionality but AVX-512 significantly extends it, including versions which load-with-broadcast integer data from memory or GPRs.

[2] Gather-load: fill a vector register with smaller 32/64-bit pieces loaded from non-contiguous memory locations, e.g. VGATHERQPD, whose summary is (this is from the PDF but with a few edits-for-succinctness and correction of my own - e.g. 'faulting-point' data in the original is clearly a typo):
Code:
Using signed qword indices, gather float64 vector into float64 vector zmm1 using OPMASK
register k1 as completion mask:
	VGATHERQPD zmm1 {k1}, vm64z	[{} arg can be any of k1-7 OPMASK regs,
					or k0 (or omit {k*} arg) for 'load all']
Description
	A set of 8 double-precision floating-point memory locations pointed by base address
	BASE_ADDR and index vector V_INDEX with scale SCALE are gathered. The result
	is written into a vector register. The elements are specified via the VSIB (i.e.,
	the index register vm64z is a vector register, holding packed indices). Elements
	will only be loaded if their corresponding mask bit in the OPMASK register is one.
	If an element’s mask bit is not set, the corresponding element of the destination
	register is left unchanged. The entire mask register will be set to zero by this
	instruction unless it triggers an exception.
With regard to the latter class of instructions, however, I can't find a description of the syntax for the hybrid 'vm64[xyz]' datum in the PDF. It looks like the index vector is a vector register ([xyz]mm depending on the precise instruction that is used) but whether BASE_ADDR is stored in the low 32/64-bits of said vector register is not made clear (so far as I can tell). It seems SCALE can take values {1,2,4,8} depending on the data type & precise instruction, but again, I see no actual clarifying examples in the PDF reference.
ewmayer is offline   Reply With Quote
Old 2016-02-18, 08:51   #2
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

10000110002 Posts
Default

Quote:
Originally Posted by ewmayer View Post
[2] Gather-load: fill a vector register with smaller 32/64-bit pieces loaded from non-contiguous memory locations, e.g. VGATHERQPD, whose summary is (this is from the PDF but with a few edits-for-succinctness and correction of my own - e.g. 'faulting-point' data in the original is clearly a typo):
Code:
Using signed qword indices, gather float64 vector into float64 vector zmm1 using OPMASK
register k1 as completion mask:
    VGATHERQPD zmm1 {k1}, vm64z    [{} arg can be any of k1-7 OPMASK regs,
                    or k0 (or omit {k*} arg) for 'load all']
Description
    A set of 8 double-precision floating-point memory locations pointed by base address
    BASE_ADDR and index vector V_INDEX with scale SCALE are gathered. The result
    is written into a vector register. The elements are specified via the VSIB (i.e.,
    the index register vm64z is a vector register, holding packed indices). Elements
    will only be loaded if their corresponding mask bit in the OPMASK register is one.
    If an element’s mask bit is not set, the corresponding element of the destination
    register is left unchanged. The entire mask register will be set to zero by this
    instruction unless it triggers an exception.
With regard to the latter class of instructions, however, I can't find a description of the syntax for the hybrid 'vm64[xyz]' datum in the PDF. It looks like the index vector is a vector register ([xyz]mm depending on the precise instruction that is used) but whether BASE_ADDR is stored in the low 32/64-bits of said vector register is not made clear (so far as I can tell). It seems SCALE can take values {1,2,4,8} depending on the data type & precise instruction, but again, I see no actual clarifying examples in the PDF reference.
(Sorry in advance if I misunderstood your question.)

According to test cases found in nasm, the syntax looks like this:
Code:
vgatherqpd zmm30{k1}, [r14+zmm31*8+0x7b]
I'd say that zmm31 is made of 8x64-bit indices (confirmed by Intel documentation), that r14 contains the BASE_ADDR, and that this will generate 8 addresses : r14 + 8*zmm31[i] + 0x7b.

Last fiddled with by ldesnogu on 2016-02-18 at 08:51
ldesnogu is online now   Reply With Quote
Old 2016-02-18, 15:45   #3
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

3×19×137 Posts
Default

We have attached the PDF for your convenience.

Attached Files
File Type: pdf 319433-024.pdf (4.83 MB, 1431 views)
Xyzzy is offline   Reply With Quote
Old 2016-02-19, 04:21   #4
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

23·1,229 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
According to test cases found in nasm, the syntax looks like this:
Code:
vgatherqpd zmm30{k1}, [r14+zmm31*8+0x7b]
I'd say that zmm31 is made of 8x64-bit indices (confirmed by Intel documentation), that r14 contains the BASE_ADDR, and that this will generate 8 addresses : r14 + 8*zmm31[i] + 0x7b.
Thanks, Laurent - good find. Since NASM and MSVC inline-asm have the same addressing syntax I adapted the vector-offset address in your example the same way I do for the scalar analogs of such addresses in my existing GCC inline-asm: Just as when translating pre-AX512 MSVC inline-asm macros to GCC-form, I reverse the operand order, replace the outer [] with () and inner + and * signs with commas, and move any literal-constant offset to the left of the (. I started with your []-form and tried a GCC-compile to verify that it leads to an error, and then did the above syntax translation one step at time, each time with a recompile attempt to see if the resulting error 'moved one step deeper' into the address-construction. Once I did the final step, no more errors. But, GCC appears to still have some kind of issue with the k-register mask arguments (I'm using v4.9.2 here). I first verified that that these need to go to the left of the destination register in GCC syntax, and that the k-register takes no prepended % (neither 1 nor 2 of these) as are needed for GPR and vector-data registers - apparently the special status of the k-regs obviates the need for a %. But I still get error/warning messages about the resulting instruction. Let's compare - Without a mask argument, the first of the following pair of test-instructions compiles/assembles OK:
Code:
"vaddpd	%%zmm3,%%zmm2,%%zmm1	\n\t"\
"vgatherqpd 0x7b(%%r14,%%zmm31,8),%%zmm30	\n\t"\
but the 2nd gives

Assembler messages:
test_file.c:14: Error: default mask isn't allowed for `vgatherqpd'

[i.e. ADDPD accepts a default-mask form, but VGATHER* does not]. Then I tried adding a mask-arg to the destination (rightmost in GCC syntax) operand of both instructions:
Code:
"vaddpd	%%zmm3,%%zmm2,{k1}%%zmm1	\n\t"\
"vgatherqpd 0x7b(%%r14,%%zmm31,8),{k1}%%zmm30	\n\t"\
That gives

Assembler messages:
test_file.c:13: Error: operand size mismatch for `vaddpd'
test_file.c:14: Error: too many memory references for `vgatherqpd'
ewmayer is offline   Reply With Quote
Old 2016-02-19, 05:41   #5
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

5,879 Posts
Default

Can you set GCC to use the more sane Intel syntax, instead of that AT&T ugliness?
retina is offline   Reply With Quote
Old 2016-02-19, 08:06   #6
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

23×1,229 Posts
Default

Quote:
Originally Posted by retina View Post
Can you set GCC to use the more sane Intel syntax, instead of that AT&T ugliness?
Eye of the beholder, my friend - yes, there are several ways (which differ in the key aspect of how they interact with memory-operand syntax) to use Intel syntax with GCC, described here.

Having used both syntaxes in the past I find that it's more a function of whichever one uses most often. I agree that the %% of the GCC extended-inline-asm are hard on the eyes, but I only add those after sketching out the basic assembly code, when I'm ready to test it out. And having learned reading and writing in Western-style left-to-right form I find the AT&T [src,src,dest] operand order to be much more natural to work with: "combine two source operands and output result in destination operand". I mean, consider the kind of descriptional awkwardness that litters Intel's instruction references due to their [dest,src,src] syntax:

Performs a SIMD [blah] of the double-precision floating-point values in the first source operand (the second operand) by the floating- point values in the second source operand (the third operand). Results are written to the destination operand (the first operand).

Back to the case at hand - in the inline-asm case one saves little 'ugliness' by using Intel syntax, here is the body of a basic AVX-syntax test macro which inlines the same instruction using each of the 2 syntaxes in turn:
Code:
".intel_syntax \n\t"\
"vaddpd	%%ymm1,%%ymm2,[%%rax+%%rbx*8]	\n\t"\
".att_syntax \n\t"\
"vaddpd	(%%rax,%%rbx,8),%%ymm2,%%ymm1	\n\t"\
ewmayer is offline   Reply With Quote
Old 2016-02-19, 08:41   #7
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

133678 Posts
Default

Can you set GCC to not require the %% ugliness? Can you set GCC to not require the \n\t silliness? Can you set GCC to not require the surrounding "'s nonsensiness?

Last fiddled with by retina on 2016-02-19 at 08:42
retina is offline   Reply With Quote
Old 2016-02-19, 13:52   #8
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,163 Posts
Default

No, all that stuff is a consequence of the way inline asm works in gcc; you are specifying a string that is blindly inserted into the assembler code the compiler produces so it needs C string formatting. The two %'s are for when you need to specify a register explictly, since a single % is assumed to mean the argument number out of the argument list below the asm string.

You can not like it, but if the choice is between this and Microsoft's compiler, which doesn't allow inline asm at all for 64-bit code, then the choice is easy.

Last fiddled with by jasonp on 2016-02-19 at 14:15
jasonp is offline   Reply With Quote
Old 2016-02-19, 14:17   #9
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

587910 Posts
Default

Quote:
Originally Posted by jasonp View Post
No, all that stuff is a consequence of the way inline asm works in gcc; you are specifying a string that is blindly inserted into the assembler code the compiler produces so it needs C string formatting.
It's almost as though the writers of GCC don't want anyone to use inline ASM. Are they actively trying to stop people using it out of some idealistic dream of C being the one and only?
Quote:
Originally Posted by jasonp View Post
You can not like it, but if the choice is between this and Microsoft's compiler, which doesn't allow inline asm at all for 64-bit code, then the choice is easy.
Write all your ASM in "proper" assembly code as a separate file and link it at compile time to the other C stuff later. Or write a simple parser that take the C code source and upon detection of inline ASM inserts all the ugliness automatically. BTW: Fossil (the SCM) does this for inline webpage text. Makes it much easier to insert inline arbitrary text (without all the ugliness and red-tape).
retina is offline   Reply With Quote
Old 2016-02-19, 15:47   #10
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

23×67 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I first verified that that these need to go to the left of the destination register in GCC syntax, and that the k-register takes no prepended % (neither 1 nor 2 of these) as are needed for GPR and vector-data registers - apparently the special status of the k-regs obviates the need for a %. But I still get error/warning messages about the resulting instruction. Let's compare - Without a mask argument, the first of the following pair of test-instructions compiles/assembles OK:
Code:
"vaddpd    %%zmm3,%%zmm2,%%zmm1    \n\t"\
"vgatherqpd 0x7b(%%r14,%%zmm31,8),%%zmm30    \n\t"\
but the 2nd gives

Assembler messages:
test_file.c:14: Error: default mask isn't allowed for `vgatherqpd'

[i.e. ADDPD accepts a default-mask form, but VGATHER* does not]. Then I tried adding a mask-arg to the destination (rightmost in GCC syntax) operand of both instructions:
Code:
"vaddpd    %%zmm3,%%zmm2,{k1}%%zmm1    \n\t"\
"vgatherqpd 0x7b(%%r14,%%zmm31,8),{k1}%%zmm30    \n\t"\
That gives

Assembler messages:
test_file.c:13: Error: operand size mismatch for `vaddpd'
test_file.c:14: Error: too many memory references for `vgatherqpd'
Are you sure the {k1} specifier should go before the register?

gas test files look like this:
Code:
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      123(%ebp,%zmm7,8), %zmm6{%k1}    # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      123(%ebp,%zmm7,8), %zmm6{%k1}    # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      256(%eax,%zmm7), %zmm6{%k1}      # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      1024(%ecx,%zmm7,4), %zmm6{%k1}   # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      zmm6{k1}, ZMMWORD PTR [ebp+zmm7*8-123]   # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      zmm6{k1}, ZMMWORD PTR [ebp+zmm7*8-123]   # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      zmm6{k1}, ZMMWORD PTR [eax+zmm7+256]     # AVX512F
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      zmm6{k1}, ZMMWORD PTR [ecx+zmm7*4+1024]  # AVX512F
ldesnogu is online now   Reply With Quote
Old 2016-02-19, 21:27   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

23×1,229 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
Are you sure the {k1} specifier should go before the register?

gas test files look like this:
Code:
gas/testsuite/gas/i386/avx512f.s:       vgatherqpd      123(%ebp,%zmm7,8), %zmm6{%k1}    # AVX512F
I based that on the error messages I got from GCC in my various iterative syntax experiments - but it appears those may have led me down a blind alley w.r.to the k-register masking syntax. Thanks! Will try as soon as I am home and have access to my Broadwell NUC which has the GCC 4.9 install. Such upfront new-instruction-syntax issues seem to be par for the course - when I first started to upgrade to AVX a couple years ago, the first issue I had to work through was that the then-current version of GCC refused to accept ymm registers in the clobber list - workaround (which I still use, since specifying xmm* as clobbered implies the whole corresponding ymm or zmm register) was to use SSE2-style xmm-clobbers but ymm in the actual code.

Re. retina's separate-.s-file idea: sure that will work, but has issues of its own, like having to respect calling conventions and a 2-step compile. I like being able to focus on the code that actually 'does stuff' and being able to one-step-compile just as with pure C code. To each his own. Further, since I am for the most part translating many 1000s of line of existing AVX inline-asm - most of which is rote, with the exception of key new instructions like the above gather-load - it would be far more work to now switch to a separate-.s-file paradigm.
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Tag Me, Radiolarian Chaff! (a.k.a. Official Anagram Thread!) ixfd64 Lounge 68 2019-08-19 22:44
Official Peeved Pets Thread Prime95 Lounge 32 2015-10-02 04:17
NOT the official forum factoring project thread jyb Factoring 2 2013-09-03 16:11
Official 'Let's move the hyphen!' thread. Flatlander Lounge 29 2013-01-12 19:29
Official Odd Perfect Number thread ewmayer Math 14 2008-10-23 13:43

All times are UTC. The time now is 18:11.

Fri Nov 27 18:11:32 UTC 2020 up 78 days, 15:22, 4 users, load averages: 1.82, 1.73, 1.50

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.