![]() |
|
|
#1 |
|
"Fredrik Jansson"
Jan 2016
Amsterdam
816 Posts |
Hi All,
I've been poking around in mfaktc, and think I found a potentially unsafe construction. myintrinsics.h defines a bunch of macros like __subc_cc, each containing a single inline asm instruction. These macros are then used in sequence to handle multi-word numbers. Now, is the carry guaranteed to be preserved across multiple asm statements? Obviously, since the program works, it tends to be, but if this is not guaranteed, it could lead to bugs in the future. For an experiment I tried mul_96_192, defined in tf_96bit_base_math.cu but commented out. It turned out that this function sometimes return the wrong result, it seems to miss a carry from d0 to d1. When I rewrote the function as a single inline asm block, with the same sequence of operations, this problem disappeared, which leads me to wonder if the separate asm statements are to blame. Then some background: I'm new to the forum but have been in the GIMPS project for over 10 years. I started looking at mfaktc just out of curiosity. I don't know CUDA from before. I implemented Montgomery's modular multiplication for the modular exponentiation in mfaktc. It is working (passes -st2), but slower than the existing code (barrett76_mul32_gs in particular) by a factor 2...3. I'm willing to share this code if anyone is interested. Cheers, Fredrik |
|
|
|
|
|
#2 |
|
"Oliver"
Mar 2005
Germany
11·101 Posts |
Hi Fredrik,
which CUDA version (and GPU) are you using? Oliver |
|
|
|
|
|
#3 |
|
"Fredrik Jansson"
Jan 2016
Amsterdam
810 Posts |
from mfaktc's diagnostics:
CUDA version info binary compiled for CUDA 6.50 CUDA runtime version 6.50 CUDA driver version 7.50 CUDA device info name GeForce GTX 960 compute capability 5.2 I'm using Arch Linux. I installed cuda 6.5 manually due to the problems with 7.0 and 7.5 with mfaktc. |
|
|
|
|
|
#4 | |
|
"Oliver"
Mar 2005
Germany
11·101 Posts |
OK, that's what I wanted to hear. You're right, CUDA 7.x is completly broken for mfaktc (on Maxwell architecture).
Back to the other question: Quote:
Code:
__umul32 (a.d0, b.d0); And the main question, well, I don't exactly know if this is OK or not but at least many people in different codes are using it this way, arent't they? So a little bit uncomfortable but "seems to be OK". Oliver |
|
|
|
|
|
|
#5 | ||
|
"Fredrik Jansson"
Jan 2016
Amsterdam
10002 Posts |
Quote:
I will try to gather some test cases. I'm not sure at the moment if the errors were deterministic or not. Quote:
|
||
|
|
|
|
|
#6 |
|
"Oliver"
Mar 2005
Germany
100010101112 Posts |
Hi Fredrik,
maybe option d) is the best: remove(replace) the statements in question. In mfaktc 0.22 (not finished/released yet, no time frame) I have removed alot of legacy code (no CC 1.x, CUDA >= 6.5 requiered) which reduces the number of __add and __sub calls. ![]() Oliver Last fiddled with by TheJudger on 2016-01-14 at 20:37 |
|
|
|
|
|
#7 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
In the gcc world, we deal with this by putting the condition codes in the clobber list for the asm statement; this would force gcc to not reorder asm statements that clobbered the condition codes.
The PTX instruction set does have a carry, but you have to find out what nvcc's inline asm syntax calls it. |
|
|
|
|
|
#8 |
|
"Fredrik Jansson"
Jan 2016
Amsterdam
23 Posts |
This document, section 1.2.3 states "The compiler assumes that an asm() statement has no side effects except to change the output operands. To ensure that the asm is not deleted or moved, you should use the volatile keyword"
I didn't see any mention of how to add carry to the output or clobber list, but I found the volatile keyword in use in a situation similar to ours. Option d sounds great, but I'd also like to understand how it all is working :) |
|
|
|
|
|
#9 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
Making the entire asm volatile should work and probably wouldn't have performance implications; remember that the generated PTX then is fed to Nvidia's back-end compiler which optimizes the whole thing all over again and emits GPU-specific instructions. In fact the inline asm I've had to write generated horrible PTX with an enormous number of useless register moves, all of which get stripped away in the generated cubin.
|
|
|
|
|
|
#10 | |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote:
Code:
#define MUL_LOHI64(_x,_y,_lo,_hi)\
{\
union {\
uint32 u32[2]; /* NVCC is little-endian, LS 32 bits in [0]. */\
uint64 u64;\
} _ux,_uy;\
\
uint32 _al,_ah,_bl,_bh;\
uint32 _r0,_r1,_r2,_r3;\
\
_ux.u64 = _x;\
_al = _ux.u32[0]; /* x_lo */\
_ah = _ux.u32[1]; /* x_hi */\
_uy.u64 = _y;\
_bl = _uy.u32[0]; /* y_lo */\
_bh = _uy.u32[1]; /* y_hi */\
/* Old 1-line-at-a-time inline-ASM version: */\
/* MULL32(_al,_bl,_r0); // r0 =(_al*_bl).lo , no carry-out */\
/* MULH32(_al,_bl,_r1); // r1 =(_al*_bl).hi , no carry-out */\
/* _r1 = __umad32_cc(_ah,_bl,_r1); // r1+=(_ah*_bl).lo , may carry-out */\
/* _r2 = __umad32hic(_ah,_bl, 0); // r2 =(_ah*_bl).hi+cy, no carry-out */\
/* _r1 = __umad32_cc(_al,_bh,_r1); // r1+=(_al*_bh).lo , may carry-out */\
/* _r2 = __umad32hic_cc(_al,_bh,_r2); // r2+=(_al*_bh).hi+cy, may carry-out */\
/* _r3 = __addc( 0, 0); // r3 = cy, no carry-out */\
/* _r2 = __umad32_cc(_ah,_bh,_r2); // r2+=(_ah*_bh).lo , may carry-out */\
/* _r3 = __umad32hic(_ah,_bh,_r3); // r3+=(_ah*_bh).hi+cy */\
/* Using straight PTX inline-ASM: */\
asm volatile (\
"mul.lo.u32 %0,%4,%6; \n\t"/* r0 =(r4*r6).lo , no carry-out */\
"mul.hi.u32 %1,%4,%6; \n\t"/* r1 =(r4*r6).hi , no carry-out */\
"mad.lo.cc.u32 %1,%5,%6,%1; \n\t"/* r1+=(r5*r6).lo , may carry-out */\
"madc.hi.u32 %2,%5,%6, 0; \n\t"/* r2 =(r5*r6).hi+cy, no carry-out */\
"mad.lo.cc.u32 %1,%4,%7,%1; \n\t"/* r1+=(r4*r7).lo , may carry-out */\
"madc.hi.cc.u32 %2,%4,%7,%2; \n\t"/* r2+=(r4*r7).hi+cy, may carry-out */\
"addc.u32 %3, 0, 0; \n\t"/* r3 = cy, no carry-out */\
"mad.lo.cc.u32 %2,%5,%7,%2; \n\t"/* r2+=(r5*r7).lo , may carry-out */\
"madc.hi.u32 %3,%5,%7,%3; \n\t"/* r3+=(r5*r7).hi+cy */\
: "=r" (_r0) , "=r" (_r1) "=r" (_r2) , "=r" (_r3) : "r" (_al), "r" (_ah) , "r" (_bl), "r" (_bh));\
\
_ux.u32[0] = _r0;\
_ux.u32[1] = _r1;\
_lo= _ux.u64;\
_uy.u32[0] = _r2;\
_uy.u32[1] = _r3;\
_hi= _uy.u64;\
}
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Inline code and YouTube tags | Xyzzy | Forum Feedback | 18 | 2016-01-13 20:53 |
| Unsafe verification? | CuriousKit | PrimeNet | 13 | 2015-06-21 14:49 |
| Why does GCC inline asm need a clobber list? | ewmayer | Programming | 60 | 2011-12-31 21:52 |
| mfaktc | tichy | GPU Computing | 4 | 2010-12-03 21:51 |
| linux 25.9 client, unsafe thread output? | xorbe | Software | 0 | 2009-04-03 04:21 |