mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2016-01-12, 13:45   #1
Fredrik
 
"Fredrik Jansson"
Jan 2016
Amsterdam

816 Posts
Default mfaktc: potentially unsafe inline asm

Hi All,

I've been poking around in mfaktc, and think I found a potentially unsafe construction. myintrinsics.h defines a bunch of macros like __subc_cc, each containing a single inline asm instruction. These macros are then used in sequence to handle multi-word numbers. Now, is the carry guaranteed to be preserved across multiple asm statements? Obviously, since the program works, it tends to be, but if this is not guaranteed, it could lead to bugs in the future.

For an experiment I tried mul_96_192, defined in tf_96bit_base_math.cu but commented out. It turned out that this function sometimes return the wrong result, it seems to miss a carry from d0 to d1. When I rewrote the function as a single inline asm block, with the same sequence of operations, this problem disappeared, which leads me to wonder if the separate asm statements are to blame.

Then some background: I'm new to the forum but have been in the GIMPS project for over 10 years. I started looking at mfaktc just out of curiosity. I don't know CUDA from before. I implemented Montgomery's modular multiplication for the modular exponentiation in mfaktc. It is working (passes -st2), but slower than the existing code (barrett76_mul32_gs in particular) by a factor 2...3. I'm willing to share this code if anyone is interested.

Cheers,
Fredrik
Fredrik is offline   Reply With Quote
Old 2016-01-13, 19:24   #2
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hi Fredrik,

which CUDA version (and GPU) are you using?

Oliver
TheJudger is offline   Reply With Quote
Old 2016-01-13, 19:58   #3
Fredrik
 
"Fredrik Jansson"
Jan 2016
Amsterdam

810 Posts
Default

Quote:
Originally Posted by TheJudger View Post

which CUDA version (and GPU) are you using?

Oliver
from mfaktc's diagnostics:
CUDA version info
binary compiled for CUDA 6.50
CUDA runtime version 6.50
CUDA driver version 7.50

CUDA device info
name GeForce GTX 960
compute capability 5.2

I'm using Arch Linux. I installed cuda 6.5 manually due to the problems with 7.0 and 7.5 with mfaktc.
Fredrik is offline   Reply With Quote
Old 2016-01-13, 21:26   #4
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

OK, that's what I wanted to hear. You're right, CUDA 7.x is completly broken for mfaktc (on Maxwell architecture).

Back to the other question:

Quote:
Originally Posted by Fredrik View Post
For an experiment I tried mul_96_192, defined in tf_96bit_base_math.cu but commented out. It turned out that this function sometimes return the wrong result, it seems to miss a carry from d0 to d1.
Please specify with more details. Actually there is no carry from d0 to d1 in mul_96_192(), because only the result of
Code:
__umul32  (a.d0, b.d0);
goes into d0. So there is no need to carry from d0 to d1.

And the main question, well, I don't exactly know if this is OK or not but at least many people in different codes are using it this way, arent't they?
So a little bit uncomfortable but "seems to be OK".

Oliver
TheJudger is offline   Reply With Quote
Old 2016-01-14, 20:16   #5
Fredrik
 
"Fredrik Jansson"
Jan 2016
Amsterdam

10002 Posts
Default

Quote:
Actually there is no carry from d0 to d1 in mul_96_192()
I agree, didn't realize this. Anyway, as I remember the tests I did, whenever the result of mul_96_192() was wrong, it turned out to be too small by 2**32. I thought this is a missing carry, but as you say that does not make sense.
I will try to gather some test cases. I'm not sure at the moment if the errors were deterministic or not.

Quote:
And the main question, well, I don't exactly know if this is OK or not but at least many people in different codes are using it this way, arent't they?
So a little bit uncomfortable but "seems to be OK".
My worry is that the compiler might think it is free to re-order the statements in separate asm{} blocks, and not take the carry dependencies between them into account. I haven't seen much CUDA code, so I don't know how this is usually done. Some ideas for resolving this one way or the other:

  • find documentation of what is guaranteed over the asm statements
  • ask in nvidia's forums or stack overflow. What little I've seen of PTX asm code on SO, it has been a single block.
  • catch the compiler actually generating wrong code
For the last approach, maybe there is a PTX disassembler I could use on the mul_96_192() routine?
Fredrik is offline   Reply With Quote
Old 2016-01-14, 20:36   #6
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

100010101112 Posts
Default

Hi Fredrik,

maybe option d) is the best: remove(replace) the statements in question. In mfaktc 0.22 (not finished/released yet, no time frame) I have removed alot of legacy code (no CC 1.x, CUDA >= 6.5 requiered) which reduces the number of __add and __sub calls.

Oliver

Last fiddled with by TheJudger on 2016-01-14 at 20:37
TheJudger is offline   Reply With Quote
Old 2016-01-14, 21:17   #7
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

In the gcc world, we deal with this by putting the condition codes in the clobber list for the asm statement; this would force gcc to not reorder asm statements that clobbered the condition codes.

The PTX instruction set does have a carry, but you have to find out what nvcc's inline asm syntax calls it.
jasonp is offline   Reply With Quote
Old 2016-01-14, 23:15   #8
Fredrik
 
"Fredrik Jansson"
Jan 2016
Amsterdam

23 Posts
Default

This document, section 1.2.3 states "The compiler assumes that an asm() statement has no side effects except to change the output operands. To ensure that the asm is not deleted or moved, you should use the volatile keyword"

I didn't see any mention of how to add carry to the output or clobber list, but I found the volatile keyword in use in a situation similar to ours.

Option d sounds great, but I'd also like to understand how it all is working :)
Fredrik is offline   Reply With Quote
Old 2016-01-15, 13:08   #9
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

Making the entire asm volatile should work and probably wouldn't have performance implications; remember that the generated PTX then is fed to Nvidia's back-end compiler which optimizes the whole thing all over again and emits GPU-specific instructions. In fact the inline asm I've had to write generated horrible PTX with an enormous number of useless register moves, all of which get stripped away in the generated cubin.
jasonp is offline   Reply With Quote
Old 2016-01-15, 22:29   #10
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103×113 Posts
Default

Quote:
Originally Posted by jasonp View Post
Making the entire asm volatile should work and probably wouldn't have performance implications; remember that the generated PTX then is fed to Nvidia's back-end compiler which optimizes the whole thing all over again and emits GPU-specific instructions. In fact the inline asm I've had to write generated horrible PTX with an enormous number of useless register moves, all of which get stripped away in the generated cubin.
Indeed ... 'volatile' is part of my standard inline-asm macro template ... foolish to leave it off unless you've got some very good reason not to use it. For example, the PTX-code version of my basic integer-lib 64x64->128-bit unsigned MUL macro:
Code:
#define MUL_LOHI64(_x,_y,_lo,_hi)\
{\
	union {\
		uint32 u32[2];	/* NVCC is little-endian, LS 32 bits in [0]. */\
		uint64 u64;\
	} _ux,_uy;\
\
	uint32 _al,_ah,_bl,_bh;\
	uint32 _r0,_r1,_r2,_r3;\
\
	_ux.u64 = _x;\
	_al = _ux.u32[0];	/* x_lo */\
	_ah = _ux.u32[1];	/* x_hi */\
	_uy.u64 = _y;\
	_bl = _uy.u32[0];	/* y_lo */\
	_bh = _uy.u32[1];	/* y_hi */\
/* Old 1-line-at-a-time inline-ASM version: */\
/*	MULL32(_al,_bl,_r0);				// r0 =(_al*_bl).lo   , no carry-out */\
/*	MULH32(_al,_bl,_r1);				// r1 =(_al*_bl).hi   , no carry-out */\
/*	_r1 = __umad32_cc(_ah,_bl,_r1);		// r1+=(_ah*_bl).lo   , may carry-out */\
/*	_r2 = __umad32hic(_ah,_bl,  0);		// r2 =(_ah*_bl).hi+cy, no carry-out */\
/*	_r1 = __umad32_cc(_al,_bh,_r1);		// r1+=(_al*_bh).lo   , may carry-out */\
/*	_r2 = __umad32hic_cc(_al,_bh,_r2);	// r2+=(_al*_bh).hi+cy, may carry-out */\
/*	_r3 = __addc(  0,  0);				// r3 = cy, no carry-out */\
/*	_r2 = __umad32_cc(_ah,_bh,_r2);		// r2+=(_ah*_bh).lo   , may carry-out */\
/*	_r3 = __umad32hic(_ah,_bh,_r3);		// r3+=(_ah*_bh).hi+cy */\
/* Using straight PTX inline-ASM: */\
asm volatile (\
	"mul.lo.u32		%0,%4,%6;		\n\t"/* r0 =(r4*r6).lo   , no carry-out */\
	"mul.hi.u32		%1,%4,%6;		\n\t"/* r1 =(r4*r6).hi   , no carry-out */\
	"mad.lo.cc.u32	%1,%5,%6,%1;	\n\t"/* r1+=(r5*r6).lo   , may carry-out */\
	"madc.hi.u32	%2,%5,%6, 0;	\n\t"/* r2 =(r5*r6).hi+cy, no carry-out */\
	"mad.lo.cc.u32	%1,%4,%7,%1;	\n\t"/* r1+=(r4*r7).lo   , may carry-out */\
	"madc.hi.cc.u32	%2,%4,%7,%2;	\n\t"/* r2+=(r4*r7).hi+cy, may carry-out */\
	"addc.u32		%3, 0, 0;		\n\t"/* r3 = cy, no carry-out */\
	"mad.lo.cc.u32	%2,%5,%7,%2;	\n\t"/* r2+=(r5*r7).lo   , may carry-out */\
	"madc.hi.u32	%3,%5,%7,%3;	\n\t"/* r3+=(r5*r7).hi+cy */\
  : "=r" (_r0) , "=r" (_r1) "=r" (_r2) , "=r" (_r3) : "r" (_al), "r" (_ah) , "r" (_bl), "r" (_bh));\
\
	_ux.u32[0] = _r0;\
	_ux.u32[1] = _r1;\
	_lo= _ux.u64;\
	_uy.u32[0] = _r2;\
	_uy.u32[1] = _r3;\
	_hi= _uy.u64;\
}
ewmayer is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Inline code and YouTube tags Xyzzy Forum Feedback 18 2016-01-13 20:53
Unsafe verification? CuriousKit PrimeNet 13 2015-06-21 14:49
Why does GCC inline asm need a clobber list? ewmayer Programming 60 2011-12-31 21:52
mfaktc tichy GPU Computing 4 2010-12-03 21:51
linux 25.9 client, unsafe thread output? xorbe Software 0 2009-04-03 04:21

All times are UTC. The time now is 13:38.


Sat Jul 17 13:38:52 UTC 2021 up 50 days, 11:26, 1 user, load averages: 1.41, 1.57, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.