mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2020-07-23, 08:39   #12
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

32·72·13 Posts
Default

Quote:
Originally Posted by xilman View Post
AFAICT essentially all other languages (assembly excepted of course) make it difficult to exploit all the instructions provided by any CISC architecture.
RISC also. No access to any flags. No way to get the high portion of a multiply. etc.
retina is offline   Reply With Quote
Old 2020-07-23, 13:06   #13
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,529 Posts
Default

Modern assembler versions also let you switch to Intel syntax with an assembler directive.

The extra boilerplate controls where the input operands come from, where outputs go, what is expected to be overwritten and clobbered, whether the whole block can be moved around other basic blocks in your code, etc. The actual instructions in the inline asm are don't-cares for the compiler, you can put 1000 instructions in there and it will paste them into the generated assembly, or paste nonsense that will fail to compile if you make a mistake.

If you want a braindead alternative, Sun's compiler used to have an inline asm syntax that only allowed the text of one instruction, with no way to control any of the above. Good luck doing something nontrivial with that facility.

Last fiddled with by jasonp on 2020-07-23 at 13:12
jasonp is offline   Reply With Quote
Old 2020-07-23, 20:55   #14
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

7×673 Posts
Default

Hope it is a valid question, why not use fortran?
pinhodecarlos is offline   Reply With Quote
Old 2020-07-23, 21:23   #15
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2CFE16 Posts
Default

Quote:
Originally Posted by retina View Post
I am still amazed at how much C code it takes just to coax it into using a single native instruction.

You have to throw out all the "portability" that C is supposed to provide. And you have to provide a lot of ugly boilerplate lines to coerce it into assembling just one line. And you have to write it the worst possible syntax: AT&T.

That's why the way to go is use the single (or few)-instruction asm macros only at small scale, say for prototyping and proof-of-concept work. In my case, I'm a Luddite re. all those asm-macro i/o flag goodies gcc provides - way too many, syntax really nasty, hard to discern actual register usage, etc. Since those are mainly to help the compiler better inline one's asm with the surrounding C code, any performance gain therefrom is going to be mainly when the enclosed machine instruction count is small - for bigger chunks of asm, it's not gonna matter. If you're a performance-fetishist, you're gonna want larger chunks of asm anyway - so in my case, in order to better focus on the actual instruction flow and obviate the distractive boilerplate, I just use a standardized-format asm-macro template, which makes it easy to add as many memory operands as needed, uses actual hardware register names, etc. Here an example - this is the source for one of the 8x8 matrix-of-doubles transpose macros I tested when avx-512 became available:
Code:
// [1a] Rowwise-load and in-register data shuffles. On KNL: 45 cycles per loop-exec:
nerr = 0; clock1 = getRealTime();
for(i = 0; i < imax; i++) {
	__asm__ volatile (\
		"movq		%[__data],%%rax		\n\t"\
		/* Read in the 8 rows of our input matrix: */\
		"vmovaps		0x000(%%rax),%%zmm0		\n\t"\
		"vmovaps		0x040(%%rax),%%zmm1		\n\t"\
		"vmovaps		0x080(%%rax),%%zmm2		\n\t"\
		"vmovaps		0x0c0(%%rax),%%zmm3		\n\t"\
		"vmovaps		0x100(%%rax),%%zmm4		\n\t"\
		"vmovaps		0x140(%%rax),%%zmm5		\n\t"\
		"vmovaps		0x180(%%rax),%%zmm6		\n\t"\
		"vmovaps		0x1c0(%%rax),%%zmm7		\n\t"\
		/* Transpose uses regs0-7 for data, reg8 for temp: */\
		/* [1] First step is a quartet of [UNPCKLPD,UNPCKHPD] pairs to effect transposed 2x2 submatrices - */\
		/* indices in comments at right are [row,col] pairs, i.e. octal version of linear array indices: */
		"vunpcklpd		%%zmm1,%%zmm0,%%zmm8	\n\t"/* zmm8 = 00 10 02 12 04 14 06 16 */\
		"vunpckhpd		%%zmm1,%%zmm0,%%zmm1	\n\t"/* zmm1 = 01 11 03 13 05 15 07 17 */\
		"vunpcklpd		%%zmm3,%%zmm2,%%zmm0	\n\t"/* zmm0 = 20 30 22 32 24 34 26 36 */\
		"vunpckhpd		%%zmm3,%%zmm2,%%zmm3	\n\t"/* zmm3 = 21 31 23 33 25 35 27 37 */\
		"vunpcklpd		%%zmm5,%%zmm4,%%zmm2	\n\t"/* zmm2 = 40 50 42 52 44 54 46 56 */\
		"vunpckhpd		%%zmm5,%%zmm4,%%zmm5	\n\t"/* zmm5 = 41 51 43 53 45 55 47 57 */\
		"vunpcklpd		%%zmm7,%%zmm6,%%zmm4	\n\t"/* zmm4 = 60 70 62 72 64 74 66 76 */\
		"vunpckhpd		%%zmm7,%%zmm6,%%zmm7	\n\t"/* zmm7 = 61 71 63 73 65 75 67 77 */\
	/**** Getting rid of reg-index-nicifying copies here means Outputs not in 0-7 but in 8,1,0,3,2,5,4,7, with 6 now free ****/\
		/* [2] 1st layer of VSHUFF64x2, 2 outputs each with trailing index pairs [0,4],[1,5],[2,6],[3,7]. */\
		/* Note the imm8 values expressed in terms of 2-bit index subfields again read right-to-left */\
		/* (as for the SHUFPS imm8 values in the AVX 8x8 float code) are 221 = (3,1,3,1) and 136 = (2,0,2,0): */\
		"vshuff64x2	$136,%%zmm0,%%zmm8,%%zmm6	\n\t"/* zmm6 = 00 10 04 14 20 30 24 34 */\
		"vshuff64x2	$221,%%zmm0,%%zmm8,%%zmm0	\n\t"/* zmm0 = 02 12 06 16 22 32 26 36 */\
		"vshuff64x2	$136,%%zmm3,%%zmm1,%%zmm8	\n\t"/* zmm8 = 01 11 05 15 21 31 25 35 */\
		"vshuff64x2	$221,%%zmm3,%%zmm1,%%zmm3	\n\t"/* zmm3 = 03 13 07 17 23 33 27 37 */\
		"vshuff64x2	$136,%%zmm4,%%zmm2,%%zmm1	\n\t"/* zmm1 = 40 50 44 54 60 70 64 74 */\
		"vshuff64x2	$221,%%zmm4,%%zmm2,%%zmm4	\n\t"/* zmm4 = 42 52 46 56 62 72 66 76 */\
		"vshuff64x2	$136,%%zmm7,%%zmm5,%%zmm2	\n\t"/* zmm2 = 41 51 45 55 61 71 65 75 */\
		"vshuff64x2	$221,%%zmm7,%%zmm5,%%zmm7	\n\t"/* zmm7 = 43 53 47 57 63 73 67 77 */\
	/**** Getting rid of reg-index-nicifying copies here means Outputs 8,1,2,5 -> 6,8,1,2, with 5 now free ***/\
		/* [3] Last step in 2nd layer of VSHUFF64x2, now combining reg-pairs sharing same trailing index pairs. */\
		/* Output register indices reflect trailing index of data contained therein: */\
		"vshuff64x2	$136,%%zmm1,%%zmm6,%%zmm5	\n\t"/* zmm5 = 00 10 20 30 40 50 60 70 [row 0 of transpose-matrix] */\
		"vshuff64x2	$221,%%zmm1,%%zmm6,%%zmm1	\n\t"/* zmm1 = 04 14 24 34 44 54 64 74 [row 4 of transpose-matrix] */\
		"vshuff64x2	$136,%%zmm2,%%zmm8,%%zmm6	\n\t"/* zmm6 = 01 11 21 31 41 51 61 71 [row 1 of transpose-matrix] */\
		"vshuff64x2	$221,%%zmm2,%%zmm8,%%zmm2	\n\t"/* zmm2 = 05 15 25 35 45 55 65 75 [row 5 of transpose-matrix] */\
		"vshuff64x2	$136,%%zmm4,%%zmm0,%%zmm8	\n\t"/* zmm8 = 02 12 22 32 42 52 62 72 [row 2 of transpose-matrix] */\
		"vshuff64x2	$221,%%zmm4,%%zmm0,%%zmm4	\n\t"/* zmm4 = 06 16 26 36 46 56 66 76 [row 6 of transpose-matrix] */\
		"vshuff64x2	$136,%%zmm7,%%zmm3,%%zmm0	\n\t"/* zmm0 = 03 13 23 33 43 53 63 73 [row 3 of transpose-matrix] */\
		"vshuff64x2	$221,%%zmm7,%%zmm3,%%zmm7	\n\t"/* zmm7 = 07 17 27 37 47 57 67 77 [row 7 of transpose-matrix] */\
	/**** Getting rid of reg-index-nicifying copies here means Outputs 6,8,0,3 -> 5,6,8,0 with 3 now free ***/\
		/* Write original columns back as rows: */\
		"vmovaps		%%zmm5,0x000(%%rax)		\n\t"\
		"vmovaps		%%zmm6,0x040(%%rax)		\n\t"\
		"vmovaps		%%zmm8,0x080(%%rax)		\n\t"\
		"vmovaps		%%zmm0,0x0c0(%%rax)		\n\t"\
		"vmovaps		%%zmm1,0x100(%%rax)		\n\t"\
		"vmovaps		%%zmm2,0x140(%%rax)		\n\t"\
		"vmovaps		%%zmm4,0x180(%%rax)		\n\t"\
		"vmovaps		%%zmm7,0x1c0(%%rax)		\n\t"\
		:						// outputs: none
		: [__data] "m" (data)	// All inputs from memory addresses here
		: "cc","memory","rax","xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7","xmm8"	// Clobbered registers - use xmm form for compatibility with older versions of clang/gcc
	);
}
clock2 = getRealTime();
tdiff = (double)(clock2 - clock1);
printf("Method [1a]: Time for %u 8x8 doubles-transposes using in-register shuffles =%s\n",imax, get_time_str(tdiff));
Just one mem-operand in this case; if I needed a second, it would be a simply matter of adding e.g. ,[__dat2] "m" (dat2). The actual register clobber list needs to be carefully done by the user once code is in place - one of my peeves re. GCC is that it doesn't even do something as simple as parse-asm-and-extract-register-names-to-autogenerate-a-clobber-list ... the old 32-bit MSFT visual studio was actually great in that respect, it did a "smart parsing" of the user's inline asm and would also allow step-thru debugging of same. (I abandoned MSFT when they took literally *years* following wide-scale deployment of x86_64 to update their compiler to support 64-bit inline asm.)

Anyway, as noted GCC won't way anything re. guts-of-such-macros at compile time - so if the asm has a syntax error (very easy to for these to creep in) one is often left trying to resort cryptic error messages from the assembler, and use "divide and conquer" syntax-debugging: cut bottom half of instruction block, see if assembler-error persists, etc.

Lastly, not the liberal use of inline C-syntax comments: most editors capable of C-syntax highlighting will also color-highlight the asm, instructions same color as for a C string, comments whatever color those are set to. *Very* nice in terms of aiding debug and readability.

Last fiddled with by ewmayer on 2020-07-23 at 21:27
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Even faster integer multiplication paulunderwood Computer Science & Computational Number Theory 17 2020-05-21 19:51
multiplication and logarithm bhelmes Math 4 2016-10-06 13:33
k*b^n+/-c where b is an integer greater than 2 and c is an integer from 1 to b-1 jasong Miscellaneous Math 5 2016-04-24 03:40
New Multiplication Algorithm vector Miscellaneous Math 10 2007-12-20 18:16
Multiplication Tendency clowns789 Miscellaneous Math 5 2005-03-11 00:23

All times are UTC. The time now is 02:44.

Sat Sep 26 02:44:31 UTC 2020 up 15 days, 23:55, 0 users, load averages: 1.28, 1.30, 1.36

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.