mersenneforum.org mfakto: an OpenCL program for Mersenne prefactoring
 Register FAQ Search Today's Posts Mark Forums Read

2011-06-09, 13:16   #23
Bdot

Nov 2010
Germany

3×199 Posts

Quote:
 Originally Posted by Ken_g6 Code: context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);
True, I do that as well, but fall back to the CPU on purpose if no suitable GPU is found. Plus, I added (ported) support for the -d option to specifically select any device (including CPU).

Quote:
 Originally Posted by Ken_g6 Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL.
Used within my kernel:

Code:
void mul_24_48(uint *res_hi, uint *res_lo, uint a, uint b)
/* res_hi*(2^24) + res_lo = a * b */
{
*res_lo  = mul24(a,b);
*res_hi  = (mul_hi(a,b) << 8) | (*res_lo >> 24);
*res_lo &= 0xFFFFFF;
}
The assembly code of this looks already quite nicely packed, as the compiler optimizes (and inlines!) this whole function into 3 cycles (the w: step of cycle 20 is already the next instruction, but 21 w: still belongs to this function):

Code:
  mul_24_48(&(a.d1),&(a.d0),k_tab[tid],4620); // NUM_CLASSES

becomes

19  z: MUL_UINT24  ____,  R1.x,  (0x0000120C, 6.473998905e-42f).x
t: MULHI_UINT  ____,  R1.x,  (0x0000120C, 6.473998905e-42f).x
20  x: LSHL        ____,  PS19,  (0x00000008, 1.121038771e-44f).x
y: LSHR        ____,  PV19.z,  (0x00000018, 3.363116314e-44f).y
z: AND_INT     ____,  PV19.z,  (0x00FFFFFF, 2.350988562e-38f).z
21  x: ADD_INT     ____,  KC0[1].x,  PV20.z
z: AND_INT     R4.z,  PV20.w,  (0x00FFFFFF, 2.350988562e-38f).x
w: OR_INT      ____,  PV20.y,  PV20.x
Quote:
 Originally Posted by Ken_g6 He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)
Yes, assembly is missing for the 32-bit operations (it does not hurt that much in 24-bit ops). Shifting 64-bit values by 32 bits works if you make sure that all values are 64-bit. And that includes the "32":

<any 64-bit> >> 32 = 0 (of type int)
<any 64-bit> >> 32ULL = <upper half> (of type long)

Took me a while to find that ... but you may have meant something different.

And to those still waiting for the real thing - it comes now, I just added Oliver's pre-release signal handler which is really required for OpenCL. Otherwise the graphics driver crashes Windows 7 when there are still kernels in the queue but the process is already gone.

And yes, it may crash your machine too - do not use it on productive machines, yet.

Just remember: this is a first test version. Though performance figures may be interesting, I'm more interested to see if it runs at all, stable and with or without odd side-effects. Do not attempt to upload "no factor" results to primenet yet.

2011-06-09, 14:10   #24
diep

Sep 2006
The Netherlands

2×17×23 Posts

Quote:
 Originally Posted by Bdot True, I do that as well, but fall back to the CPU on purpose if no suitable GPU is found. Plus, I added (ported) support for the -d option to specifically select any device (including CPU). Used within my kernel: Code: void mul_24_48(uint *res_hi, uint *res_lo, uint a, uint b) /* res_hi*(2^24) + res_lo = a * b */ { *res_lo = mul24(a,b); *res_hi = (mul_hi(a,b) << 8) | (*res_lo >> 24); *res_lo &= 0xFFFFFF; } The assembly code of this looks already quite nicely packed, as the compiler optimizes (and inlines!) this whole function into 3 cycles (the w: step of cycle 20 is already the next instruction, but 21 w: still belongs to this function): Code:  mul_24_48(&(a.d1),&(a.d0),k_tab[tid],4620); // NUM_CLASSES becomes 19 z: MUL_UINT24 ____, R1.x, (0x0000120C, 6.473998905e-42f).x t: MULHI_UINT ____, R1.x, (0x0000120C, 6.473998905e-42f).x
MULHI_UINT as you see is at the T unit, so that means that it requires work from all other units. In your case that means 5 units (for the AMD HD5000 series) and in case of the HD6000 series that requires 4 units.

So this instruction eats 5 cycles if you look at it from that viewpoint (or in fact from 5 execution units it eats 1 cycle).

The fast instruction is MULHI_UINT24 which is at the full speed of 1351 Gflop (at the HD6970 series and i suppose so at the 5000 series as well).

As you see it doesn't generate that one.

So this is a piece of turtle slow assembler code for several reasons. Not just because it's the T unit, there is other issues as well.

Quote:
 Code:  20 x: LSHL ____, PS19, (0x00000008, 1.121038771e-44f).x y: LSHR ____, PV19.z, (0x00000018, 3.363116314e-44f).y z: AND_INT ____, PV19.z, (0x00FFFFFF, 2.350988562e-38f).z w: ADD_INT ____, KC0[1].x, PV19.z 21 x: ADD_INT ____, KC0[1].x, PV20.z z: AND_INT R4.z, PV20.w, (0x00FFFFFF, 2.350988562e-38f).x w: OR_INT ____, PV20.y, PV20.x Yes, assembly is missing for the 32-bit operations (it does not hurt that much in 24-bit ops). Shifting 64-bit values by 32 bits works if you make sure that all values are 64-bit. And that includes the "32": >> 32 = 0 (of type int) >> 32ULL = (of type long) Took me a while to find that ... but you may have meant something different. And to those still waiting for the real thing - it comes now, I just added Oliver's pre-release signal handler which is really required for OpenCL. Otherwise the graphics driver crashes Windows 7 when there are still kernels in the queue but the process is already gone. And yes, it may crash your machine too - do not use it on productive machines, yet. Just remember: this is a first test version. Though performance figures may be interesting, I'm more interested to see if it runs at all, stable and with or without odd side-effects. Do not attempt to upload "no factor" results to primenet yet.

 2011-06-09, 14:20 #25 diep     Sep 2006 The Netherlands 2×17×23 Posts Code: *res_lo = mul24(a,b); *res_hi = (mul_hi(a,b) << 8) | (*res_lo >> 24); *res_lo &= 0xFFFFFF; Also when AMD adds a function mul_hi16 (or something) generating the 16 top bits in a fast manner using MULHI_UINT24, then still this piece of programming is turtle slow for GPU's. Let me explain. You generate the result of res_lo using a mul24. This is a fast instruction. The problem comes after this. Directly after this. The GPU's are not out of order processors. They focus upon throughput. So i hope i formulate it not too bad if i say that the retirement of the execution unit doing the mul24, it eats a cycle or 8 for the result to have available. However already directly you need the result to be shifted right 24. That's too quick. The GPU already hides a little bit of the latency by running multiple threads, but that doesn't cover everything. The mul_hi gets shifted left 8. Now a good compiler would hide this latency. Yet that's not what you can expect here. Directly then it needs the result for an OR with the shifted part o fthe res_lo. So the mul_hi itself eats 5 cycles at the 5000 series, or better it requires all units at the same time, then another 8 cycles later it's available for usage to be OR-ed with the other half, which we can expect to be available at the same time, this 8 cycles later. So in reality we have a few operations that are 'in flight' at the same time here. The shiftleft after the Mul_hi and the shiftright are at the same time "in flight". Yet not yet available, as that eats another 8 cycles. So directly issuing then an 'or' is not so clever. The AND with the res_lo we can safely assume by then to be available of course, as we already needed it the line above it. So from programming viewpoint this is utmost beginnerscode for 2 reasons. Last fiddled with by diep on 2011-06-09 at 14:39
2011-06-09, 14:45   #26
diep

Sep 2006
The Netherlands

2×17×23 Posts

Quote:
 Originally Posted by Bdot Just remember: this is a first test version. Though performance figures may be interesting, I'm more interested to see if it runs at all, stable and with or without odd side-effects. Do not attempt to upload "no factor" results to primenet yet.
Where is that code?

My email : diep@xs4all.nl

2011-06-10, 00:32   #27
diep

Sep 2006
The Netherlands

2×17×23 Posts

Quote:
 Originally Posted by Bdot This is an early announcement that I have ported parts of Olivers (aka TheJudger) mfaktc to OpenCL. Currently, I have only the Win64 binary, running an adapted version of Olivers 71-bit-mul24 kernel. Not yet optimized, not yet making use of the vectors available in OpenCL. A very simple (and slow) 95-bit kernel is there as well so that the complete selftest finished successfully on my box. On my HD5750 it runs about 60M/s in the 50M exponent range - certainly a lot of headroom As I have only this one ATI GPU I wanted to see if anyone would be willing to help testing on different hardware. Current requirements: OpenCL 1.1 (i.e. only ATI GPUs), Windows 64-bit. There's still a lot of work until I may eventually release this to the public, but I'm optimistic for the summer. Next steps (unordered): Linux port (Is Windows 32-bit needed too?) check, if http://mersenneforum.org/showpost.ph...40&postcount=7 can be used (looks like it's way faster)
Bdot i'm looking at the code you shipped me and into your kernel.
the square_72_144 multiplicatoin basically seems like an optimized
version of what Chrisjp posted over here.

So with near 100% certainty your code is faster. Especially if we see
how much you tested it to be.

So i wonder why you posted this comment?

Can you explain what you wrote over here?

Thanks,
Vincent

Quote:
 fast 92/95-bit kernels (barrett) use of vector data types various other performance/optimization tests&enhancements of course, bug fixes docs and licensing stuff clarify if/how this new kid may contribute to primenet Bdot

2011-06-10, 11:19   #28
Bdot

Nov 2010
Germany

59710 Posts

Quote:
 Originally Posted by diep Bdot i'm looking at the code you shipped me and into your kernel. the square_72_144 multiplicatoin basically seems like an optimized version of what Chrisjp posted over here. So with near 100% certainty your code is faster. Especially if we see how much you tested it to be. So i wonder why you posted this comment? Can you explain what you wrote over here? Thanks, Vincent
First of all, I did not even check out Chrisjp's code in detail, so I did not know if it was faster or not.
Mainly I'm looking for a faster modulus because that is where currently ~3/4 of the TF effort are spent (squaring is less than 20%). There must be some reasons why Chrisjp had better performance figures. I want to see why (probably due to barrett), and take over what makes sense. And having a fast 84-bit kernel is of course another advantage over 72 bits.

Last fiddled with by Bdot on 2011-06-10 at 11:40 Reason: typo

 2011-06-13, 07:33 #29 Colt45ws     Jun 2010 1116 Posts My primary machine here has a unlocked (6970 equiv.) HD6950 2GB. I would be willing to test this stuff.
2011-06-13, 19:32   #30
Bdot

Nov 2010
Germany

3×199 Posts

Quote:
 Originally Posted by Colt45ws My primary machine here has a unlocked (6970 equiv.) HD6950 2GB. I would be willing to test this stuff.
Send me a PM with your email, and you should have it. I received test results from only one tester so far ...

I'm not sure I will send out version 0.04 again as I just built a vectorized version of the 71-bit-kernel. It may take another few days to be finalized. My first tests showed a speedup of 30-40%. You'll probably get this one when it's ready.

 2011-06-18, 23:12 #31 Bdot     Nov 2010 Germany 3×199 Posts I just sent out mfakto version 0.05 to a few interested people. Main highlight is the use of vector data types, which on my GPU raises throughput from 60M/s to 100M/s when using multiple instances, and from 36M/s to 88M/s for single instance (on a HD 5750).
 2011-06-19, 03:46 #32 Colt45ws     Jun 2010 17 Posts I tested this with my 6970. Single instance. I Adjusted Sieveprimes to max performance, 55k This resulted in 90M/s rate on a M57 68 to 69. Then I played with the other settings. Vector Best to worst was 4,8,2,1,16 16 had a HUGE performance drop; down to 14M/s Could only just tell 4 was better than 8. Any closer and it would probably be within normal range it runs. 2 and 1 both ran in the 7xM/s range. Best gridsize was 4 then 3,2,1,0. Curiously, the GPU never kicked into HighPerf mode, it stayed at 500MHz. ?? Last fiddled with by Colt45ws on 2011-06-19 at 04:10
 2011-06-19, 10:35 #33 Bdot     Nov 2010 Germany 3·199 Posts Thanks for testing so quickly! I guess SievePrimes is at 5k, not 55k? Are the gridsize differences big enough to try even bigger ones? For my GPU the UI was not usable anymore with the next bigger one ... but for fast GPUs I could check if bigger grids would always fit ... Did you monitor the CPU load? On my box I see it never go really high (max 50%). I'm afraid, for now, the only way to fully utilize such a capable GPU is many instances of mfakto (working on different exponents). Which on my machine has the nasty issue of freezing the machine sometimes ... working on it. Maybe it's time for a multi-threaded siever ... until the GPU-siever comes.

 Similar Threads Thread Thread Starter Forum Replies Last Post preda GpuOwl 2760 2022-05-15 00:00 TheJudger GPU Computing 3541 2022-04-21 22:37 msft GPU Computing 433 2019-06-23 21:11 TObject GPU Computing 2 2013-10-12 21:09 Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 21:19.

Sun May 22 21:19:13 UTC 2022 up 38 days, 19:20, 0 users, load averages: 1.35, 1.30, 1.25