mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-06-23, 17:34   #45
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
Warning: I don't know anything about OpenCL...

Why do you use ||, && et ?: at all? Doesn't OpenCL say a comparison result is either 0 or 1?
For scalar data types that is true. I could have saved a conditional load, but I guess the compiler will optimize that out anyway.

I already have in my mind to use the same code for a vector of data (just replace all uint by uint4, for instance), and then the result of a comparison is 0 or -1 (all bits set).

What I was really hoping for is something that propagates the carry with a little less logic, as that will really slow down additions and subtractions ...
Bdot is offline   Reply With Quote
Old 2011-06-23, 22:15   #46
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

2×197 Posts
Default

Would the "ulong" data type (a 64-bit unsigned integer) help?
Ken_g6 is offline   Reply With Quote
Old 2011-06-24, 00:18   #47
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·3·541 Posts
Default

If you cast the arguments to uint64 then the borrow can be computed by shifting...

tmp = (uint64)a - (uint64)b;
sub = (uint32)tmp;
borrow = tmp >> 63;
bsquared is offline   Reply With Quote
Old 2011-06-24, 08:41   #48
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

17×31 Posts
Default

Quote:
Originally Posted by Bdot View Post
What I was really hoping for is something that propagates the carry with a little less logic, as that will really slow down additions and subtractions ...
Two other random ideas (again sorry if it's not applicable...):

- do as many computations as you can without taking care of carries and do a specific pass for handling them; of course that could lead to a big slowdown if you have to reload values and memory is the limiting factor

- another way to compute carries is bit arithmetic; let's say you want the carry from a - b
Code:
res = a - b;
carry = ((b & ~a) | (res & ~a) | (b & res)) >> (bitsize-1);
where bitsize is the number of bits of a, b and res. Again that could be slower than your original code.
ldesnogu is offline   Reply With Quote
Old 2011-06-24, 16:27   #49
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Thanks for all your suggestions so far. I'll definitely try the (ulong) conversion and compare it to my current bunch of logic. I still welcome suggestions ;-)

Here's the task again:

Inputs: uint a, uint b, uint carry (which is the borrow of the previous (lower) 32 bits)
Output: uint res=a-b, carry (which should be the borrow for the next (higher) 32 bits)

currently this basically looks like
Code:
 res   = a - b - carry;
 carry = (res > a) || ((res == a) && carry);
I'm looking for something simpler for the evaluation of the new carry. Something like
Code:
  carry = (res >= a) || carry;
(Yes, I know this one is not correct)

We have available all logical operators, +, -, and the OpenCL Integer Functions. But maybe a total of 6 operations for one 32-bit subtraction with borrow is already the minimum for OpenCL?

I did not quite understand how evaluating carries afterwards can save something. Access to the operands is no problem, it's all in registers. The bit-wise operations lead to a total of 10 instructions (?) for one subtraction ... less likely to be an acceleration ;-)
Bdot is offline   Reply With Quote
Old 2011-06-24, 16:43   #50
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

17×31 Posts
Default

Quote:
Originally Posted by Bdot View Post
I did not quite understand how evaluating carries afterwards can save something. Access to the operands is no problem, it's all in registers. The bit-wise operations lead to a total of 10 instructions (?) for one subtraction ... less likely to be an acceleration ;-)
Well that all depends on two things: what your compiler is able to find depending on the form of your program and what your micro-architecture is able to do.

The post pass carry evaluation for instance is very useful for vector like architectures. And it might be possible on some micro-architectures that the result of a comparison blocks a pipeline while logical operations don't, thus making the logical variant faster even though it uses more operations.

But then again I don't know anything about your target and OpenGL, so I'm probably completely off track
ldesnogu is offline   Reply With Quote
Old 2011-07-03, 15:42   #51
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by Bdot View Post

As I have only this one ATI GPU I wanted to see if anyone would be willing to help testing on different hardware.

Current requirements: OpenCL 1.1 (i.e. only ATI GPUs), Windows 64-bit.
I have HD 4550, Windows 2008 R2 x64. Would that work?

Andriy
apsen is offline   Reply With Quote
Old 2011-07-03, 18:52   #52
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

1100010102 Posts
Default

Bdot, what did you wind up finding did the fastest math, 32-bit numbers or 24-bit numbers? And what form of math? Or are you still working on it?
Ken_g6 is offline   Reply With Quote
Old 2011-07-04, 11:35   #53
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

I played around with this a little ...

Quote:
Originally Posted by ldesnogu View Post
Two other random ideas (again sorry if it's not applicable...):

- do as many computations as you can without taking care of carries and do a specific pass for handling them; of course that could lead to a big slowdown if you have to reload values and memory is the limiting factor
I have at most 5-6 operations that I can do before checking the carries. The runtime stays exactly the same and the reason is that the compiler reorders the instructions anyway as it thinks fit better. Even a few repeated steps that were necessary for the carry ops did not influence runtime as they were optimized out

Quote:
Originally Posted by ldesnogu View Post
- another way to compute carries is bit arithmetic; let's say you want the carry from a - b
Code:
res = a - b;
carry = ((b & ~a) | (res & ~a) | (b & res)) >> (bitsize-1);
where bitsize is the number of bits of a, b and res. Again that could be slower than your original code.
I was really surprised by that one. Even though this is way more operations than my original code, it runs the same speed! Not a bit faster, not a bit slower with the single-vector kernel. Comparing the assembly it turns out that many of the additional operations are "hidden" in otherwise unused slots. Following up with the vector-version of the kernel that has less unused slots, I really saw the kernel takes 3 cycles more - with a total number of ~700 cycles thats less than .5% ...

Here's the current performance analysis of the 79-bit barrett kernel:

Code:
Name                Throughput
Radeon HD 5870      135 M Threads\Sec
Radeon HD 6970      118 M Threads\Sec
Radeon HD 6870      100 M Threads\Sec
Radeon HD 5770       68 M Threads\Sec
Radeon HD 4890       66 M Threads\Sec
Radeon HD 4870       58 M Threads\Sec
FireStream 9270      58 M Threads\Sec
FireStream 9250      48 M Threads\Sec
Radeon HD 4770       46 M Threads\Sec
Radeon HD 6670       38 M Threads\Sec
Radeon HD 4670       35 M Threads\Sec
Radeon HD 5670       23 M Threads\Sec
Radeon HD 6450       10 M Threads\Sec
Radeon HD 4550        7 M Threads\Sec
Radeon HD 5450        6 M Threads\Sec
This is the peak performance of the kernel given enough CPU-power to feed the factor candidates fast enough. Empirical translation tells that 1M Threads/sec is good for 1.2 - 1.5 GHz-days/day.

Unfortunately I right now have a problem that some of the kernel's code is skipped unless I enable kernel tracing. I need that fixed before I can get you another version for testing.
Bdot is offline   Reply With Quote
Old 2011-07-04, 12:54   #54
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
Bdot, what did you wind up finding did the fastest math, 32-bit numbers or 24-bit numbers? And what form of math? Or are you still working on it?
Currently the fastest kernel is a 24-bit based kernel working on a vector of 4 factor candidates at once. Here's the list of kernels I currently have, along with the performance on a HD5770:

76 M/s mfakto_cl_71_4: 3x24-bit, 4-vectored kernel
68 M/s mfakto_cl_barrett79: 2.5x32-bit unvectored barrett kernel
53 M/s mfakto_cl_barrett92: 3x32-bit unvectored barrett kernel
44 M/s mfakto_cl_71: 3x24-bit unvectored kernel

The barrett kernels currently need to use a nasty workaround for a bug of the compiler, costing ~ 3%. I'm still working on vectorizing the barretts, a similar speedup as for the 24-bit kernel can be expected, so that the 32-bit based kernels will be a lot faster than 24-bit.

A 24-bit based barrett kernel that was suggested on the forum runs at 75M/s, but as it is using vectors for the representation of the factors, it cannot (easily) be enhanced to run on a vector of candidates. If that were easily possible, then the 24-bit kernel might run for the crown again. But it will not be far ahead of the 32-bit kernel. And the 32-bit one has the advantage of running FCs up to 79 bit instead of 71 bit.
Bdot is offline   Reply With Quote
Old 2011-07-05, 08:12   #55
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Oh boy, I finally found why the barrett kernels sometimes behaved odd ...

OpenCL does bit-shifts only up to the number of bits in the target, and for that it only evaluates the necessary amount of bits of the operand. So for a bit-shift a >> b with 32-bit-values, only the lowest 5 bits of b are used, the rest is ignored ... Therefore the code that used bit-shifts of 32 or more positions to implicitely zero the target did not deliver the expected result.

The fix for that goes without performance-penalty ... a little more testing and version 0.06 will be ready.
Bdot is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GPU Computing 2277 2020-06-06 04:26
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3271 2020-05-19 22:42
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 04:49.

Sat Jun 6 04:49:11 UTC 2020 up 73 days, 2:22, 0 users, load averages: 1.51, 1.42, 1.35

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.