20110609, 13:16  #23  
Nov 2010
Germany
3·199 Posts 
Quote:
Quote:
Code:
void mul_24_48(uint *res_hi, uint *res_lo, uint a, uint b) /* res_hi*(2^24) + res_lo = a * b */ { *res_lo = mul24(a,b); *res_hi = (mul_hi(a,b) << 8)  (*res_lo >> 24); *res_lo &= 0xFFFFFF; } Code:
mul_24_48(&(a.d1),&(a.d0),k_tab[tid],4620); // NUM_CLASSES becomes 19 z: MUL_UINT24 ____, R1.x, (0x0000120C, 6.473998905e42f).x t: MULHI_UINT ____, R1.x, (0x0000120C, 6.473998905e42f).x 20 x: LSHL ____, PS19, (0x00000008, 1.121038771e44f).x y: LSHR ____, PV19.z, (0x00000018, 3.363116314e44f).y z: AND_INT ____, PV19.z, (0x00FFFFFF, 2.350988562e38f).z w: ADD_INT ____, KC0[1].x, PV19.z 21 x: ADD_INT ____, KC0[1].x, PV20.z z: AND_INT R4.z, PV20.w, (0x00FFFFFF, 2.350988562e38f).x w: OR_INT ____, PV20.y, PV20.x Quote:
<any 64bit> >> 32 = 0 (of type int) <any 64bit> >> 32ULL = <upper half> (of type long) Took me a while to find that ... but you may have meant something different. And to those still waiting for the real thing  it comes now, I just added Oliver's prerelease signal handler which is really required for OpenCL. Otherwise the graphics driver crashes Windows 7 when there are still kernels in the queue but the process is already gone. And yes, it may crash your machine too  do not use it on productive machines, yet. Just remember: this is a first test version. Though performance figures may be interesting, I'm more interested to see if it runs at all, stable and with or without odd sideeffects. Do not attempt to upload "no factor" results to primenet yet. 

20110609, 14:10  #24  
Sep 2006
The Netherlands
674_{10} Posts 
Quote:
So this instruction eats 5 cycles if you look at it from that viewpoint (or in fact from 5 execution units it eats 1 cycle). The fast instruction is MULHI_UINT24 which is at the full speed of 1351 Gflop (at the HD6970 series and i suppose so at the 5000 series as well). As you see it doesn't generate that one. So this is a piece of turtle slow assembler code for several reasons. Not just because it's the T unit, there is other issues as well. Quote:


20110609, 14:20  #25 
Sep 2006
The Netherlands
2·337 Posts 
Code:
*res_lo = mul24(a,b); *res_hi = (mul_hi(a,b) << 8)  (*res_lo >> 24); *res_lo &= 0xFFFFFF; Let me explain. You generate the result of res_lo using a mul24. This is a fast instruction. The problem comes after this. Directly after this. The GPU's are not out of order processors. They focus upon throughput. So i hope i formulate it not too bad if i say that the retirement of the execution unit doing the mul24, it eats a cycle or 8 for the result to have available. However already directly you need the result to be shifted right 24. That's too quick. The GPU already hides a little bit of the latency by running multiple threads, but that doesn't cover everything. The mul_hi gets shifted left 8. Now a good compiler would hide this latency. Yet that's not what you can expect here. Directly then it needs the result for an OR with the shifted part o fthe res_lo. So the mul_hi itself eats 5 cycles at the 5000 series, or better it requires all units at the same time, then another 8 cycles later it's available for usage to be ORed with the other half, which we can expect to be available at the same time, this 8 cycles later. So in reality we have a few operations that are 'in flight' at the same time here. The shiftleft after the Mul_hi and the shiftright are at the same time "in flight". Yet not yet available, as that eats another 8 cycles. So directly issuing then an 'or' is not so clever. The AND with the res_lo we can safely assume by then to be available of course, as we already needed it the line above it. So from programming viewpoint this is utmost beginnerscode for 2 reasons. Last fiddled with by diep on 20110609 at 14:39 
20110609, 14:45  #26  
Sep 2006
The Netherlands
2×337 Posts 
Quote:
My email : diep@xs4all.nl 

20110610, 00:32  #27  
Sep 2006
The Netherlands
2·337 Posts 
Quote:
the square_72_144 multiplicatoin basically seems like an optimized version of what Chrisjp posted over here. So with near 100% certainty your code is faster. Especially if we see how much you tested it to be. So i wonder why you posted this comment? Can you explain what you wrote over here? Thanks, Vincent Quote:


20110610, 11:19  #28  
Nov 2010
Germany
3×199 Posts 
Quote:
Mainly I'm looking for a faster modulus because that is where currently ~3/4 of the TF effort are spent (squaring is less than 20%). There must be some reasons why Chrisjp had better performance figures. I want to see why (probably due to barrett), and take over what makes sense. And having a fast 84bit kernel is of course another advantage over 72 bits. Last fiddled with by Bdot on 20110610 at 11:40 Reason: typo 

20110613, 07:33  #29 
Jun 2010
21_{8} Posts 
My primary machine here has a unlocked (6970 equiv.) HD6950 2GB. I would be willing to test this stuff.

20110613, 19:32  #30  
Nov 2010
Germany
3×199 Posts 
Quote:
I'm not sure I will send out version 0.04 again as I just built a vectorized version of the 71bitkernel. It may take another few days to be finalized. My first tests showed a speedup of 3040%. You'll probably get this one when it's ready. 

20110618, 23:12  #31 
Nov 2010
Germany
597_{10} Posts 
I just sent out mfakto version 0.05 to a few interested people.
Main highlight is the use of vector data types, which on my GPU raises throughput from 60M/s to 100M/s when using multiple instances, and from 36M/s to 88M/s for single instance (on a HD 5750). 
20110619, 03:46  #32 
Jun 2010
17 Posts 
I tested this with my 6970. Single instance.
I Adjusted Sieveprimes to max performance, 55k This resulted in 90M/s rate on a M57 68 to 69. Then I played with the other settings. Vector Best to worst was 4,8,2,1,16 16 had a HUGE performance drop; down to 14M/s Could only just tell 4 was better than 8. Any closer and it would probably be within normal range it runs. 2 and 1 both ran in the 7xM/s range. Best gridsize was 4 then 3,2,1,0. Curiously, the GPU never kicked into HighPerf mode, it stayed at 500MHz. ?? Last fiddled with by Colt45ws on 20110619 at 04:10 
20110619, 10:35  #33 
Nov 2010
Germany
597_{10} Posts 
Thanks for testing so quickly!
I guess SievePrimes is at 5k, not 55k? Are the gridsize differences big enough to try even bigger ones? For my GPU the UI was not usable anymore with the next bigger one ... but for fast GPUs I could check if bigger grids would always fit ... Did you monitor the CPU load? On my box I see it never go really high (max 50%). I'm afraid, for now, the only way to fully utilize such a capable GPU is many instances of mfakto (working on different exponents). Which on my machine has the nasty issue of freezing the machine sometimes ... working on it. Maybe it's time for a multithreaded siever ... until the GPUsiever comes. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
gpuOwL: an OpenCL program for Mersenne primality testing  preda  GPU Computing  2277  20200606 04:26 
mfaktc: a CUDA program for Mersenne prefactoring  TheJudger  GPU Computing  3271  20200519 22:42 
LL with OpenCL  msft  GPU Computing  433  20190623 21:11 
OpenCL for FPGAs  TObject  GPU Computing  2  20131012 21:09 
Program to TF Mersenne numbers with more than 1 sextillion digits?  Stargate38  Factoring  24  20111103 00:34 