View Single Post
Old 2020-07-23, 21:23   #15
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

9,791 Posts
Default

Quote:
Originally Posted by retina View Post
I am still amazed at how much C code it takes just to coax it into using a single native instruction.

You have to throw out all the "portability" that C is supposed to provide. And you have to provide a lot of ugly boilerplate lines to coerce it into assembling just one line. And you have to write it the worst possible syntax: AT&T.

That's why the way to go is use the single (or few)-instruction asm macros only at small scale, say for prototyping and proof-of-concept work. In my case, I'm a Luddite re. all those asm-macro i/o flag goodies gcc provides - way too many, syntax really nasty, hard to discern actual register usage, etc. Since those are mainly to help the compiler better inline one's asm with the surrounding C code, any performance gain therefrom is going to be mainly when the enclosed machine instruction count is small - for bigger chunks of asm, it's not gonna matter. If you're a performance-fetishist, you're gonna want larger chunks of asm anyway - so in my case, in order to better focus on the actual instruction flow and obviate the distractive boilerplate, I just use a standardized-format asm-macro template, which makes it easy to add as many memory operands as needed, uses actual hardware register names, etc. Here an example - this is the source for one of the 8x8 matrix-of-doubles transpose macros I tested when avx-512 became available:
Code:
// [1a] Rowwise-load and in-register data shuffles. On KNL: 45 cycles per loop-exec:
nerr = 0; clock1 = getRealTime();
for(i = 0; i < imax; i++) {
	__asm__ volatile (\
		"movq		%[__data],%%rax		\n\t"\
		/* Read in the 8 rows of our input matrix: */\
		"vmovaps		0x000(%%rax),%%zmm0		\n\t"\
		"vmovaps		0x040(%%rax),%%zmm1		\n\t"\
		"vmovaps		0x080(%%rax),%%zmm2		\n\t"\
		"vmovaps		0x0c0(%%rax),%%zmm3		\n\t"\
		"vmovaps		0x100(%%rax),%%zmm4		\n\t"\
		"vmovaps		0x140(%%rax),%%zmm5		\n\t"\
		"vmovaps		0x180(%%rax),%%zmm6		\n\t"\
		"vmovaps		0x1c0(%%rax),%%zmm7		\n\t"\
		/* Transpose uses regs0-7 for data, reg8 for temp: */\
		/* [1] First step is a quartet of [UNPCKLPD,UNPCKHPD] pairs to effect transposed 2x2 submatrices - */\
		/* indices in comments at right are [row,col] pairs, i.e. octal version of linear array indices: */
		"vunpcklpd		%%zmm1,%%zmm0,%%zmm8	\n\t"/* zmm8 = 00 10 02 12 04 14 06 16 */\
		"vunpckhpd		%%zmm1,%%zmm0,%%zmm1	\n\t"/* zmm1 = 01 11 03 13 05 15 07 17 */\
		"vunpcklpd		%%zmm3,%%zmm2,%%zmm0	\n\t"/* zmm0 = 20 30 22 32 24 34 26 36 */\
		"vunpckhpd		%%zmm3,%%zmm2,%%zmm3	\n\t"/* zmm3 = 21 31 23 33 25 35 27 37 */\
		"vunpcklpd		%%zmm5,%%zmm4,%%zmm2	\n\t"/* zmm2 = 40 50 42 52 44 54 46 56 */\
		"vunpckhpd		%%zmm5,%%zmm4,%%zmm5	\n\t"/* zmm5 = 41 51 43 53 45 55 47 57 */\
		"vunpcklpd		%%zmm7,%%zmm6,%%zmm4	\n\t"/* zmm4 = 60 70 62 72 64 74 66 76 */\
		"vunpckhpd		%%zmm7,%%zmm6,%%zmm7	\n\t"/* zmm7 = 61 71 63 73 65 75 67 77 */\
	/**** Getting rid of reg-index-nicifying copies here means Outputs not in 0-7 but in 8,1,0,3,2,5,4,7, with 6 now free ****/\
		/* [2] 1st layer of VSHUFF64x2, 2 outputs each with trailing index pairs [0,4],[1,5],[2,6],[3,7]. */\
		/* Note the imm8 values expressed in terms of 2-bit index subfields again read right-to-left */\
		/* (as for the SHUFPS imm8 values in the AVX 8x8 float code) are 221 = (3,1,3,1) and 136 = (2,0,2,0): */\
		"vshuff64x2	$136,%%zmm0,%%zmm8,%%zmm6	\n\t"/* zmm6 = 00 10 04 14 20 30 24 34 */\
		"vshuff64x2	$221,%%zmm0,%%zmm8,%%zmm0	\n\t"/* zmm0 = 02 12 06 16 22 32 26 36 */\
		"vshuff64x2	$136,%%zmm3,%%zmm1,%%zmm8	\n\t"/* zmm8 = 01 11 05 15 21 31 25 35 */\
		"vshuff64x2	$221,%%zmm3,%%zmm1,%%zmm3	\n\t"/* zmm3 = 03 13 07 17 23 33 27 37 */\
		"vshuff64x2	$136,%%zmm4,%%zmm2,%%zmm1	\n\t"/* zmm1 = 40 50 44 54 60 70 64 74 */\
		"vshuff64x2	$221,%%zmm4,%%zmm2,%%zmm4	\n\t"/* zmm4 = 42 52 46 56 62 72 66 76 */\
		"vshuff64x2	$136,%%zmm7,%%zmm5,%%zmm2	\n\t"/* zmm2 = 41 51 45 55 61 71 65 75 */\
		"vshuff64x2	$221,%%zmm7,%%zmm5,%%zmm7	\n\t"/* zmm7 = 43 53 47 57 63 73 67 77 */\
	/**** Getting rid of reg-index-nicifying copies here means Outputs 8,1,2,5 -> 6,8,1,2, with 5 now free ***/\
		/* [3] Last step in 2nd layer of VSHUFF64x2, now combining reg-pairs sharing same trailing index pairs. */\
		/* Output register indices reflect trailing index of data contained therein: */\
		"vshuff64x2	$136,%%zmm1,%%zmm6,%%zmm5	\n\t"/* zmm5 = 00 10 20 30 40 50 60 70 [row 0 of transpose-matrix] */\
		"vshuff64x2	$221,%%zmm1,%%zmm6,%%zmm1	\n\t"/* zmm1 = 04 14 24 34 44 54 64 74 [row 4 of transpose-matrix] */\
		"vshuff64x2	$136,%%zmm2,%%zmm8,%%zmm6	\n\t"/* zmm6 = 01 11 21 31 41 51 61 71 [row 1 of transpose-matrix] */\
		"vshuff64x2	$221,%%zmm2,%%zmm8,%%zmm2	\n\t"/* zmm2 = 05 15 25 35 45 55 65 75 [row 5 of transpose-matrix] */\
		"vshuff64x2	$136,%%zmm4,%%zmm0,%%zmm8	\n\t"/* zmm8 = 02 12 22 32 42 52 62 72 [row 2 of transpose-matrix] */\
		"vshuff64x2	$221,%%zmm4,%%zmm0,%%zmm4	\n\t"/* zmm4 = 06 16 26 36 46 56 66 76 [row 6 of transpose-matrix] */\
		"vshuff64x2	$136,%%zmm7,%%zmm3,%%zmm0	\n\t"/* zmm0 = 03 13 23 33 43 53 63 73 [row 3 of transpose-matrix] */\
		"vshuff64x2	$221,%%zmm7,%%zmm3,%%zmm7	\n\t"/* zmm7 = 07 17 27 37 47 57 67 77 [row 7 of transpose-matrix] */\
	/**** Getting rid of reg-index-nicifying copies here means Outputs 6,8,0,3 -> 5,6,8,0 with 3 now free ***/\
		/* Write original columns back as rows: */\
		"vmovaps		%%zmm5,0x000(%%rax)		\n\t"\
		"vmovaps		%%zmm6,0x040(%%rax)		\n\t"\
		"vmovaps		%%zmm8,0x080(%%rax)		\n\t"\
		"vmovaps		%%zmm0,0x0c0(%%rax)		\n\t"\
		"vmovaps		%%zmm1,0x100(%%rax)		\n\t"\
		"vmovaps		%%zmm2,0x140(%%rax)		\n\t"\
		"vmovaps		%%zmm4,0x180(%%rax)		\n\t"\
		"vmovaps		%%zmm7,0x1c0(%%rax)		\n\t"\
		:						// outputs: none
		: [__data] "m" (data)	// All inputs from memory addresses here
		: "cc","memory","rax","xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7","xmm8"	// Clobbered registers - use xmm form for compatibility with older versions of clang/gcc
	);
}
clock2 = getRealTime();
tdiff = (double)(clock2 - clock1);
printf("Method [1a]: Time for %u 8x8 doubles-transposes using in-register shuffles =%s\n",imax, get_time_str(tdiff));
Just one mem-operand in this case; if I needed a second, it would be a simply matter of adding e.g. ,[__dat2] "m" (dat2). The actual register clobber list needs to be carefully done by the user once code is in place - one of my peeves re. GCC is that it doesn't even do something as simple as parse-asm-and-extract-register-names-to-autogenerate-a-clobber-list ... the old 32-bit MSFT visual studio was actually great in that respect, it did a "smart parsing" of the user's inline asm and would also allow step-thru debugging of same. (I abandoned MSFT when they took literally *years* following wide-scale deployment of x86_64 to update their compiler to support 64-bit inline asm.)

Anyway, as noted GCC won't way anything re. guts-of-such-macros at compile time - so if the asm has a syntax error (very easy to for these to creep in) one is often left trying to resort cryptic error messages from the assembler, and use "divide and conquer" syntax-debugging: cut bottom half of instruction block, see if assembler-error persists, etc.

Lastly, not the liberal use of inline C-syntax comments: most editors capable of C-syntax highlighting will also color-highlight the asm, instructions same color as for a C string, comments whatever color those are set to. *Very* nice in terms of aiding debug and readability.

Last fiddled with by ewmayer on 2020-07-23 at 21:27
ewmayer is offline   Reply With Quote