![]() |
![]() |
#12 |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
23×3×269 Posts |
![]() |
![]() |
![]() |
![]() |
#13 |
Tribal Bullet
Oct 2004
5·709 Posts |
![]()
Modern assembler versions also let you switch to Intel syntax with an assembler directive.
The extra boilerplate controls where the input operands come from, where outputs go, what is expected to be overwritten and clobbered, whether the whole block can be moved around other basic blocks in your code, etc. The actual instructions in the inline asm are don't-cares for the compiler, you can put 1000 instructions in there and it will paste them into the generated assembly, or paste nonsense that will fail to compile if you make a mistake. If you want a braindead alternative, Sun's compiler used to have an inline asm syntax that only allowed the text of one instruction, with no way to control any of the above. Good luck doing something nontrivial with that facility. Last fiddled with by jasonp on 2020-07-23 at 13:12 |
![]() |
![]() |
![]() |
#14 |
"Carlos Pinho"
Oct 2011
Milton Keynes, UK
13AC16 Posts |
![]()
Hope it is a valid question, why not use fortran?
|
![]() |
![]() |
![]() |
#15 | |
∂2ω=0
Sep 2002
República de California
32·1,303 Posts |
![]() Quote:
Code:
// [1a] Rowwise-load and in-register data shuffles. On KNL: 45 cycles per loop-exec: nerr = 0; clock1 = getRealTime(); for(i = 0; i < imax; i++) { __asm__ volatile (\ "movq %[__data],%%rax \n\t"\ /* Read in the 8 rows of our input matrix: */\ "vmovaps 0x000(%%rax),%%zmm0 \n\t"\ "vmovaps 0x040(%%rax),%%zmm1 \n\t"\ "vmovaps 0x080(%%rax),%%zmm2 \n\t"\ "vmovaps 0x0c0(%%rax),%%zmm3 \n\t"\ "vmovaps 0x100(%%rax),%%zmm4 \n\t"\ "vmovaps 0x140(%%rax),%%zmm5 \n\t"\ "vmovaps 0x180(%%rax),%%zmm6 \n\t"\ "vmovaps 0x1c0(%%rax),%%zmm7 \n\t"\ /* Transpose uses regs0-7 for data, reg8 for temp: */\ /* [1] First step is a quartet of [UNPCKLPD,UNPCKHPD] pairs to effect transposed 2x2 submatrices - */\ /* indices in comments at right are [row,col] pairs, i.e. octal version of linear array indices: */ "vunpcklpd %%zmm1,%%zmm0,%%zmm8 \n\t"/* zmm8 = 00 10 02 12 04 14 06 16 */\ "vunpckhpd %%zmm1,%%zmm0,%%zmm1 \n\t"/* zmm1 = 01 11 03 13 05 15 07 17 */\ "vunpcklpd %%zmm3,%%zmm2,%%zmm0 \n\t"/* zmm0 = 20 30 22 32 24 34 26 36 */\ "vunpckhpd %%zmm3,%%zmm2,%%zmm3 \n\t"/* zmm3 = 21 31 23 33 25 35 27 37 */\ "vunpcklpd %%zmm5,%%zmm4,%%zmm2 \n\t"/* zmm2 = 40 50 42 52 44 54 46 56 */\ "vunpckhpd %%zmm5,%%zmm4,%%zmm5 \n\t"/* zmm5 = 41 51 43 53 45 55 47 57 */\ "vunpcklpd %%zmm7,%%zmm6,%%zmm4 \n\t"/* zmm4 = 60 70 62 72 64 74 66 76 */\ "vunpckhpd %%zmm7,%%zmm6,%%zmm7 \n\t"/* zmm7 = 61 71 63 73 65 75 67 77 */\ /**** Getting rid of reg-index-nicifying copies here means Outputs not in 0-7 but in 8,1,0,3,2,5,4,7, with 6 now free ****/\ /* [2] 1st layer of VSHUFF64x2, 2 outputs each with trailing index pairs [0,4],[1,5],[2,6],[3,7]. */\ /* Note the imm8 values expressed in terms of 2-bit index subfields again read right-to-left */\ /* (as for the SHUFPS imm8 values in the AVX 8x8 float code) are 221 = (3,1,3,1) and 136 = (2,0,2,0): */\ "vshuff64x2 $136,%%zmm0,%%zmm8,%%zmm6 \n\t"/* zmm6 = 00 10 04 14 20 30 24 34 */\ "vshuff64x2 $221,%%zmm0,%%zmm8,%%zmm0 \n\t"/* zmm0 = 02 12 06 16 22 32 26 36 */\ "vshuff64x2 $136,%%zmm3,%%zmm1,%%zmm8 \n\t"/* zmm8 = 01 11 05 15 21 31 25 35 */\ "vshuff64x2 $221,%%zmm3,%%zmm1,%%zmm3 \n\t"/* zmm3 = 03 13 07 17 23 33 27 37 */\ "vshuff64x2 $136,%%zmm4,%%zmm2,%%zmm1 \n\t"/* zmm1 = 40 50 44 54 60 70 64 74 */\ "vshuff64x2 $221,%%zmm4,%%zmm2,%%zmm4 \n\t"/* zmm4 = 42 52 46 56 62 72 66 76 */\ "vshuff64x2 $136,%%zmm7,%%zmm5,%%zmm2 \n\t"/* zmm2 = 41 51 45 55 61 71 65 75 */\ "vshuff64x2 $221,%%zmm7,%%zmm5,%%zmm7 \n\t"/* zmm7 = 43 53 47 57 63 73 67 77 */\ /**** Getting rid of reg-index-nicifying copies here means Outputs 8,1,2,5 -> 6,8,1,2, with 5 now free ***/\ /* [3] Last step in 2nd layer of VSHUFF64x2, now combining reg-pairs sharing same trailing index pairs. */\ /* Output register indices reflect trailing index of data contained therein: */\ "vshuff64x2 $136,%%zmm1,%%zmm6,%%zmm5 \n\t"/* zmm5 = 00 10 20 30 40 50 60 70 [row 0 of transpose-matrix] */\ "vshuff64x2 $221,%%zmm1,%%zmm6,%%zmm1 \n\t"/* zmm1 = 04 14 24 34 44 54 64 74 [row 4 of transpose-matrix] */\ "vshuff64x2 $136,%%zmm2,%%zmm8,%%zmm6 \n\t"/* zmm6 = 01 11 21 31 41 51 61 71 [row 1 of transpose-matrix] */\ "vshuff64x2 $221,%%zmm2,%%zmm8,%%zmm2 \n\t"/* zmm2 = 05 15 25 35 45 55 65 75 [row 5 of transpose-matrix] */\ "vshuff64x2 $136,%%zmm4,%%zmm0,%%zmm8 \n\t"/* zmm8 = 02 12 22 32 42 52 62 72 [row 2 of transpose-matrix] */\ "vshuff64x2 $221,%%zmm4,%%zmm0,%%zmm4 \n\t"/* zmm4 = 06 16 26 36 46 56 66 76 [row 6 of transpose-matrix] */\ "vshuff64x2 $136,%%zmm7,%%zmm3,%%zmm0 \n\t"/* zmm0 = 03 13 23 33 43 53 63 73 [row 3 of transpose-matrix] */\ "vshuff64x2 $221,%%zmm7,%%zmm3,%%zmm7 \n\t"/* zmm7 = 07 17 27 37 47 57 67 77 [row 7 of transpose-matrix] */\ /**** Getting rid of reg-index-nicifying copies here means Outputs 6,8,0,3 -> 5,6,8,0 with 3 now free ***/\ /* Write original columns back as rows: */\ "vmovaps %%zmm5,0x000(%%rax) \n\t"\ "vmovaps %%zmm6,0x040(%%rax) \n\t"\ "vmovaps %%zmm8,0x080(%%rax) \n\t"\ "vmovaps %%zmm0,0x0c0(%%rax) \n\t"\ "vmovaps %%zmm1,0x100(%%rax) \n\t"\ "vmovaps %%zmm2,0x140(%%rax) \n\t"\ "vmovaps %%zmm4,0x180(%%rax) \n\t"\ "vmovaps %%zmm7,0x1c0(%%rax) \n\t"\ : // outputs: none : [__data] "m" (data) // All inputs from memory addresses here : "cc","memory","rax","xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7","xmm8" // Clobbered registers - use xmm form for compatibility with older versions of clang/gcc ); } clock2 = getRealTime(); tdiff = (double)(clock2 - clock1); printf("Method [1a]: Time for %u 8x8 doubles-transposes using in-register shuffles =%s\n",imax, get_time_str(tdiff)); Anyway, as noted GCC won't way anything re. guts-of-such-macros at compile time - so if the asm has a syntax error (very easy to for these to creep in) one is often left trying to resort cryptic error messages from the assembler, and use "divide and conquer" syntax-debugging: cut bottom half of instruction block, see if assembler-error persists, etc. Lastly, not the liberal use of inline C-syntax comments: most editors capable of C-syntax highlighting will also color-highlight the asm, instructions same color as for a C string, comments whatever color those are set to. *Very* nice in terms of aiding debug and readability. Last fiddled with by ewmayer on 2020-07-23 at 21:27 |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Even faster integer multiplication | paulunderwood | Computer Science & Computational Number Theory | 17 | 2020-05-21 19:51 |
multiplication and logarithm | bhelmes | Math | 4 | 2016-10-06 13:33 |
k*b^n+/-c where b is an integer greater than 2 and c is an integer from 1 to b-1 | jasong | Miscellaneous Math | 5 | 2016-04-24 03:40 |
New Multiplication Algorithm | vector | Miscellaneous Math | 10 | 2007-12-20 18:16 |
Multiplication Tendency | clowns789 | Miscellaneous Math | 5 | 2005-03-11 00:23 |