mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-06-09, 13:16   #23
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
Code:
context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);
True, I do that as well, but fall back to the CPU on purpose if no suitable GPU is found. Plus, I added (ported) support for the -d option to specifically select any device (including CPU).

Quote:
Originally Posted by Ken_g6 View Post
Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL.
Used within my kernel:

Code:
void mul_24_48(uint *res_hi, uint *res_lo, uint a, uint b)
/* res_hi*(2^24) + res_lo = a * b */
{ 
  *res_lo  = mul24(a,b);
  *res_hi  = (mul_hi(a,b) << 8) | (*res_lo >> 24); 
  *res_lo &= 0xFFFFFF;
}
The assembly code of this looks already quite nicely packed, as the compiler optimizes (and inlines!) this whole function into 3 cycles (the w: step of cycle 20 is already the next instruction, but 21 w: still belongs to this function):

Code:
  mul_24_48(&(a.d1),&(a.d0),k_tab[tid],4620); // NUM_CLASSES

becomes

     19  z: MUL_UINT24  ____,  R1.x,  (0x0000120C, 6.473998905e-42f).x      
         t: MULHI_UINT  ____,  R1.x,  (0x0000120C, 6.473998905e-42f).x      
     20  x: LSHL        ____,  PS19,  (0x00000008, 1.121038771e-44f).x      
         y: LSHR        ____,  PV19.z,  (0x00000018, 3.363116314e-44f).y      
         z: AND_INT     ____,  PV19.z,  (0x00FFFFFF, 2.350988562e-38f).z      
         w: ADD_INT     ____,  KC0[1].x,  PV19.z  
     21  x: ADD_INT     ____,  KC0[1].x,  PV20.z      
         z: AND_INT     R4.z,  PV20.w,  (0x00FFFFFF, 2.350988562e-38f).x      
         w: OR_INT      ____,  PV20.y,  PV20.x
Quote:
Originally Posted by Ken_g6 View Post
He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)
Yes, assembly is missing for the 32-bit operations (it does not hurt that much in 24-bit ops). Shifting 64-bit values by 32 bits works if you make sure that all values are 64-bit. And that includes the "32":

<any 64-bit> >> 32 = 0 (of type int)
<any 64-bit> >> 32ULL = <upper half> (of type long)

Took me a while to find that ... but you may have meant something different.


And to those still waiting for the real thing - it comes now, I just added Oliver's pre-release signal handler which is really required for OpenCL. Otherwise the graphics driver crashes Windows 7 when there are still kernels in the queue but the process is already gone.

And yes, it may crash your machine too - do not use it on productive machines, yet.

Just remember: this is a first test version. Though performance figures may be interesting, I'm more interested to see if it runs at all, stable and with or without odd side-effects. Do not attempt to upload "no factor" results to primenet yet.

Bdot is offline   Reply With Quote
Old 2011-06-09, 14:10   #24
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

67410 Posts
Default

Quote:
Originally Posted by Bdot View Post
True, I do that as well, but fall back to the CPU on purpose if no suitable GPU is found. Plus, I added (ported) support for the -d option to specifically select any device (including CPU).



Used within my kernel:

Code:
void mul_24_48(uint *res_hi, uint *res_lo, uint a, uint b)
/* res_hi*(2^24) + res_lo = a * b */
{ 
  *res_lo  = mul24(a,b);
  *res_hi  = (mul_hi(a,b) << 8) | (*res_lo >> 24); 
  *res_lo &= 0xFFFFFF;
}
The assembly code of this looks already quite nicely packed, as the compiler optimizes (and inlines!) this whole function into 3 cycles (the w: step of cycle 20 is already the next instruction, but 21 w: still belongs to this function):

Code:
  mul_24_48(&(a.d1),&(a.d0),k_tab[tid],4620); // NUM_CLASSES

becomes

     19  z: MUL_UINT24  ____,  R1.x,  (0x0000120C, 6.473998905e-42f).x      
         t: MULHI_UINT  ____,  R1.x,  (0x0000120C, 6.473998905e-42f).x
MULHI_UINT as you see is at the T unit, so that means that it requires work from all other units. In your case that means 5 units (for the AMD HD5000 series) and in case of the HD6000 series that requires 4 units.

So this instruction eats 5 cycles if you look at it from that viewpoint (or in fact from 5 execution units it eats 1 cycle).

The fast instruction is MULHI_UINT24 which is at the full speed of 1351 Gflop (at the HD6970 series and i suppose so at the 5000 series as well).

As you see it doesn't generate that one.

So this is a piece of turtle slow assembler code for several reasons. Not just because it's the T unit, there is other issues as well.

Quote:
Code:
   
     20  x: LSHL        ____,  PS19,  (0x00000008, 1.121038771e-44f).x      
         y: LSHR        ____,  PV19.z,  (0x00000018, 3.363116314e-44f).y      
         z: AND_INT     ____,  PV19.z,  (0x00FFFFFF, 2.350988562e-38f).z      
         w: ADD_INT     ____,  KC0[1].x,  PV19.z  
     21  x: ADD_INT     ____,  KC0[1].x,  PV20.z      
         z: AND_INT     R4.z,  PV20.w,  (0x00FFFFFF, 2.350988562e-38f).x      
         w: OR_INT      ____,  PV20.y,  PV20.x
Yes, assembly is missing for the 32-bit operations (it does not hurt that much in 24-bit ops). Shifting 64-bit values by 32 bits works if you make sure that all values are 64-bit. And that includes the "32":

<any 64-bit> >> 32 = 0 (of type int)
<any 64-bit> >> 32ULL = <upper half> (of type long)

Took me a while to find that ... but you may have meant something different.


And to those still waiting for the real thing - it comes now, I just added Oliver's pre-release signal handler which is really required for OpenCL. Otherwise the graphics driver crashes Windows 7 when there are still kernels in the queue but the process is already gone.

And yes, it may crash your machine too - do not use it on productive machines, yet.

Just remember: this is a first test version. Though performance figures may be interesting, I'm more interested to see if it runs at all, stable and with or without odd side-effects. Do not attempt to upload "no factor" results to primenet yet.

diep is offline   Reply With Quote
Old 2011-06-09, 14:20   #25
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·337 Posts
Default

Code:
*res_lo  = mul24(a,b);
  *res_hi  = (mul_hi(a,b) << 8) | (*res_lo >> 24); 
  *res_lo &= 0xFFFFFF;
Also when AMD adds a function mul_hi16 (or something) generating the 16 top bits in a fast manner using MULHI_UINT24, then still this piece of programming is turtle slow for GPU's.

Let me explain.

You generate the result of res_lo using a mul24. This is a fast instruction.

The problem comes after this. Directly after this. The GPU's are not out of order processors. They focus upon throughput. So i hope i formulate it not too bad if i say that the retirement of the execution unit doing the mul24, it eats a cycle or 8 for the result to have available.

However already directly you need the result to be shifted right 24. That's too quick.

The GPU already hides a little bit of the latency by running multiple threads, but that doesn't cover everything.

The mul_hi gets shifted left 8. Now a good compiler would hide this latency.
Yet that's not what you can expect here. Directly then it needs the result for an OR with the shifted part o fthe res_lo.

So the mul_hi itself eats 5 cycles at the 5000 series, or better it requires all units at the same time, then another 8 cycles later it's available for usage to be OR-ed with the other half, which we can expect to be available at the same time, this 8 cycles later.

So in reality we have a few operations that are 'in flight' at the same time here. The shiftleft after the Mul_hi and the shiftright are at the same time "in flight".

Yet not yet available, as that eats another 8 cycles. So directly issuing then an 'or' is not so clever.

The AND with the res_lo we can safely assume by then to be available of course, as we already needed it the line above it.

So from programming viewpoint this is utmost beginnerscode for 2 reasons.

Last fiddled with by diep on 2011-06-09 at 14:39
diep is offline   Reply With Quote
Old 2011-06-09, 14:45   #26
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2×337 Posts
Default

Quote:
Originally Posted by Bdot View Post

Just remember: this is a first test version. Though performance figures may be interesting, I'm more interested to see if it runs at all, stable and with or without odd side-effects. Do not attempt to upload "no factor" results to primenet yet.

Where is that code?

My email : diep@xs4all.nl
diep is offline   Reply With Quote
Old 2011-06-10, 00:32   #27
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·337 Posts
Default

Quote:
Originally Posted by Bdot View Post
This is an early announcement that I have ported parts of Olivers (aka TheJudger) mfaktc to OpenCL.

Currently, I have only the Win64 binary, running an adapted version of Olivers 71-bit-mul24 kernel. Not yet optimized, not yet making use of the vectors available in OpenCL. A very simple (and slow) 95-bit kernel is there as well so that the complete selftest finished successfully on my box.

On my HD5750 it runs about 60M/s in the 50M exponent range - certainly a lot of headroom

As I have only this one ATI GPU I wanted to see if anyone would be willing to help testing on different hardware.

Current requirements: OpenCL 1.1 (i.e. only ATI GPUs), Windows 64-bit.

There's still a lot of work until I may eventually release this to the public, but I'm optimistic for the summer.

Next steps (unordered):
Bdot i'm looking at the code you shipped me and into your kernel.
the square_72_144 multiplicatoin basically seems like an optimized
version of what Chrisjp posted over here.

So with near 100% certainty your code is faster. Especially if we see
how much you tested it to be.

So i wonder why you posted this comment?

Can you explain what you wrote over here?

Thanks,
Vincent

Quote:
  • fast 92/95-bit kernels (barrett)
  • use of vector data types
  • various other performance/optimization tests&enhancements
  • of course, bug fixes
  • docs and licensing stuff
  • clarify if/how this new kid may contribute to primenet
Bdot
diep is offline   Reply With Quote
Old 2011-06-10, 11:19   #28
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by diep View Post
Bdot i'm looking at the code you shipped me and into your kernel.
the square_72_144 multiplicatoin basically seems like an optimized
version of what Chrisjp posted over here.

So with near 100% certainty your code is faster. Especially if we see
how much you tested it to be.

So i wonder why you posted this comment?

Can you explain what you wrote over here?

Thanks,
Vincent
First of all, I did not even check out Chrisjp's code in detail, so I did not know if it was faster or not.
Mainly I'm looking for a faster modulus because that is where currently ~3/4 of the TF effort are spent (squaring is less than 20%). There must be some reasons why Chrisjp had better performance figures. I want to see why (probably due to barrett), and take over what makes sense. And having a fast 84-bit kernel is of course another advantage over 72 bits.

Last fiddled with by Bdot on 2011-06-10 at 11:40 Reason: typo
Bdot is offline   Reply With Quote
Old 2011-06-13, 07:33   #29
Colt45ws
 
Colt45ws's Avatar
 
Jun 2010

218 Posts
Default

My primary machine here has a unlocked (6970 equiv.) HD6950 2GB. I would be willing to test this stuff.
Colt45ws is offline   Reply With Quote
Old 2011-06-13, 19:32   #30
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by Colt45ws View Post
My primary machine here has a unlocked (6970 equiv.) HD6950 2GB. I would be willing to test this stuff.
Send me a PM with your email, and you should have it. I received test results from only one tester so far ...

I'm not sure I will send out version 0.04 again as I just built a vectorized version of the 71-bit-kernel. It may take another few days to be finalized. My first tests showed a speedup of 30-40%. You'll probably get this one when it's ready.
Bdot is offline   Reply With Quote
Old 2011-06-18, 23:12   #31
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

59710 Posts
Default

I just sent out mfakto version 0.05 to a few interested people.

Main highlight is the use of vector data types, which on my GPU raises throughput from 60M/s to 100M/s when using multiple instances, and from 36M/s to 88M/s for single instance (on a HD 5750).
Bdot is offline   Reply With Quote
Old 2011-06-19, 03:46   #32
Colt45ws
 
Colt45ws's Avatar
 
Jun 2010

17 Posts
Default

I tested this with my 6970. Single instance.
I Adjusted Sieveprimes to max performance, 55k
This resulted in 90M/s rate on a M57 68 to 69.
Then I played with the other settings.
Vector Best to worst was 4,8,2,1,16
16 had a HUGE performance drop; down to 14M/s
Could only just tell 4 was better than 8. Any closer and it would probably be within normal range it runs. 2 and 1 both ran in the 7xM/s range.
Best gridsize was 4 then 3,2,1,0.
Curiously, the GPU never kicked into HighPerf mode, it stayed at 500MHz.
??

Last fiddled with by Colt45ws on 2011-06-19 at 04:10
Colt45ws is offline   Reply With Quote
Old 2011-06-19, 10:35   #33
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

59710 Posts
Default

Thanks for testing so quickly!

I guess SievePrimes is at 5k, not 55k?

Are the gridsize differences big enough to try even bigger ones? For my GPU the UI was not usable anymore with the next bigger one ... but for fast GPUs I could check if bigger grids would always fit ...

Did you monitor the CPU load? On my box I see it never go really high (max 50%).

I'm afraid, for now, the only way to fully utilize such a capable GPU is many instances of mfakto (working on different exponents). Which on my machine has the nasty issue of freezing the machine sometimes ... working on it.

Maybe it's time for a multi-threaded siever ... until the GPU-siever comes.
Bdot is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GPU Computing 2277 2020-06-06 04:26
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3271 2020-05-19 22:42
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 05:25.

Sat Jun 6 05:25:07 UTC 2020 up 73 days, 2:58, 0 users, load averages: 0.91, 1.04, 1.07

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.