20110623, 17:34  #45  
Nov 2010
Germany
3×199 Posts 
Quote:
I already have in my mind to use the same code for a vector of data (just replace all uint by uint4, for instance), and then the result of a comparison is 0 or 1 (all bits set). What I was really hoping for is something that propagates the carry with a little less logic, as that will really slow down additions and subtractions ... 

20110623, 22:15  #46 
Jan 2005
Caught in a sieve
2×197 Posts 
Would the "ulong" data type (a 64bit unsigned integer) help?

20110624, 00:18  #47 
"Ben"
Feb 2007
2·3·541 Posts 
If you cast the arguments to uint64 then the borrow can be computed by shifting...
tmp = (uint64)a  (uint64)b; sub = (uint32)tmp; borrow = tmp >> 63; 
20110624, 08:41  #48  
Jan 2008
France
17×31 Posts 
Quote:
 do as many computations as you can without taking care of carries and do a specific pass for handling them; of course that could lead to a big slowdown if you have to reload values and memory is the limiting factor  another way to compute carries is bit arithmetic; let's say you want the carry from a  b Code:
res = a  b; carry = ((b & ~a)  (res & ~a)  (b & res)) >> (bitsize1); 

20110624, 16:27  #49 
Nov 2010
Germany
3×199 Posts 
Thanks for all your suggestions so far. I'll definitely try the (ulong) conversion and compare it to my current bunch of logic. I still welcome suggestions ;)
Here's the task again: Inputs: uint a, uint b, uint carry (which is the borrow of the previous (lower) 32 bits) Output: uint res=ab, carry (which should be the borrow for the next (higher) 32 bits) currently this basically looks like Code:
res = a  b  carry; carry = (res > a)  ((res == a) && carry); Code:
carry = (res >= a)  carry; We have available all logical operators, +, , and the OpenCL Integer Functions. But maybe a total of 6 operations for one 32bit subtraction with borrow is already the minimum for OpenCL? I did not quite understand how evaluating carries afterwards can save something. Access to the operands is no problem, it's all in registers. The bitwise operations lead to a total of 10 instructions (?) for one subtraction ... less likely to be an acceleration ;) 
20110624, 16:43  #50  
Jan 2008
France
17×31 Posts 
Quote:
The post pass carry evaluation for instance is very useful for vector like architectures. And it might be possible on some microarchitectures that the result of a comparison blocks a pipeline while logical operations don't, thus making the logical variant faster even though it uses more operations. But then again I don't know anything about your target and OpenGL, so I'm probably completely off track 

20110703, 15:42  #51 
Jun 2011
131 Posts 

20110703, 18:52  #52 
Jan 2005
Caught in a sieve
110001010_{2} Posts 
Bdot, what did you wind up finding did the fastest math, 32bit numbers or 24bit numbers? And what form of math? Or are you still working on it?

20110704, 11:35  #53  
Nov 2010
Germany
3×199 Posts 
I played around with this a little ...
Quote:
Quote:
Here's the current performance analysis of the 79bit barrett kernel: Code:
Name Throughput Radeon HD 5870 135 M Threads\Sec Radeon HD 6970 118 M Threads\Sec Radeon HD 6870 100 M Threads\Sec Radeon HD 5770 68 M Threads\Sec Radeon HD 4890 66 M Threads\Sec Radeon HD 4870 58 M Threads\Sec FireStream 9270 58 M Threads\Sec FireStream 9250 48 M Threads\Sec Radeon HD 4770 46 M Threads\Sec Radeon HD 6670 38 M Threads\Sec Radeon HD 4670 35 M Threads\Sec Radeon HD 5670 23 M Threads\Sec Radeon HD 6450 10 M Threads\Sec Radeon HD 4550 7 M Threads\Sec Radeon HD 5450 6 M Threads\Sec Unfortunately I right now have a problem that some of the kernel's code is skipped unless I enable kernel tracing. I need that fixed before I can get you another version for testing. 

20110704, 12:54  #54  
Nov 2010
Germany
3·199 Posts 
Quote:
76 M/s mfakto_cl_71_4: 3x24bit, 4vectored kernel 68 M/s mfakto_cl_barrett79: 2.5x32bit unvectored barrett kernel 53 M/s mfakto_cl_barrett92: 3x32bit unvectored barrett kernel 44 M/s mfakto_cl_71: 3x24bit unvectored kernel The barrett kernels currently need to use a nasty workaround for a bug of the compiler, costing ~ 3%. I'm still working on vectorizing the barretts, a similar speedup as for the 24bit kernel can be expected, so that the 32bit based kernels will be a lot faster than 24bit. A 24bit based barrett kernel that was suggested on the forum runs at 75M/s, but as it is using vectors for the representation of the factors, it cannot (easily) be enhanced to run on a vector of candidates. If that were easily possible, then the 24bit kernel might run for the crown again. But it will not be far ahead of the 32bit kernel. And the 32bit one has the advantage of running FCs up to 79 bit instead of 71 bit. 

20110705, 08:12  #55 
Nov 2010
Germany
3×199 Posts 
Oh boy, I finally found why the barrett kernels sometimes behaved odd ...
OpenCL does bitshifts only up to the number of bits in the target, and for that it only evaluates the necessary amount of bits of the operand. So for a bitshift a >> b with 32bitvalues, only the lowest 5 bits of b are used, the rest is ignored ... Therefore the code that used bitshifts of 32 or more positions to implicitely zero the target did not deliver the expected result. The fix for that goes without performancepenalty ... a little more testing and version 0.06 will be ready. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
gpuOwL: an OpenCL program for Mersenne primality testing  preda  GPU Computing  2277  20200606 04:26 
mfaktc: a CUDA program for Mersenne prefactoring  TheJudger  GPU Computing  3271  20200519 22:42 
LL with OpenCL  msft  GPU Computing  433  20190623 21:11 
OpenCL for FPGAs  TObject  GPU Computing  2  20131012 21:09 
Program to TF Mersenne numbers with more than 1 sextillion digits?  Stargate38  Factoring  24  20111103 00:34 