[QUOTE=ldesnogu;264485]Warning: I don't know anything about OpenCL...
Why do you use , && et ?: at all? Doesn't OpenCL say a comparison result is either 0 or 1? [/QUOTE] For scalar data types that is true. I could have saved a conditional load, but I guess the compiler will optimize that out anyway. I already have in my mind to use the same code for a vector of data (just replace all uint by uint4, for instance), and then the result of a comparison is 0 or 1 (all bits set). What I was really hoping for is something that propagates the carry with a little less logic, as that will really slow down additions and subtractions ... 
Would the "ulong" data type (a 64bit unsigned integer) help?

If you cast the arguments to uint64 then the borrow can be computed by shifting...
tmp = (uint64)a  (uint64)b; sub = (uint32)tmp; borrow = tmp >> 63; 
[QUOTE=Bdot;264505]What I was really hoping for is something that propagates the carry with a little less logic, as that will really slow down additions and subtractions ...[/QUOTE]
Two other random ideas (again sorry if it's not applicable...):  do as many computations as you can without taking care of carries and do a specific pass for handling them; of course that could lead to a big slowdown if you have to reload values and memory is the limiting factor  another way to compute carries is bit arithmetic; let's say you want the carry from a  b [code]res = a  b; carry = ((b & ~a)  (res & ~a)  (b & res)) >> (bitsize1); [/code]where bitsize is the number of bits of a, b and res. Again that could be slower than your original code. 
Thanks for all your suggestions so far. I'll definitely try the (ulong) conversion and compare it to my current bunch of logic. I still welcome suggestions ;)
Here's the task again: Inputs: uint a, uint b, uint carry (which is the borrow of the previous (lower) 32 bits) Output: uint res=ab, carry (which should be the borrow for the next (higher) 32 bits) currently this basically looks like [code] res = a  b  carry; carry = (res > a)  ((res == a) && carry); [/code]I'm looking for something simpler for the evaluation of the new carry. Something like [code] carry = (res >= a)  carry; [/code](Yes, I know this one is not correct) We have available all logical operators, +, , and the [URL="http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/integerFunctions.html"]OpenCL Integer Functions[/URL]. But maybe a total of 6 operations for one 32bit subtraction with borrow is already the minimum for OpenCL? I did not quite understand how evaluating carries afterwards can save something. Access to the operands is no problem, it's all in registers. The bitwise operations lead to a total of 10 instructions (?) for one subtraction ... less likely to be an acceleration ;) 
[QUOTE=Bdot;264555]I did not quite understand how evaluating carries afterwards can save something. Access to the operands is no problem, it's all in registers. The bitwise operations lead to a total of 10 instructions (?) for one subtraction ... less likely to be an acceleration ;)[/QUOTE]
Well that all depends on two things: what your compiler is able to find depending on the form of your program and what your microarchitecture is able to do. The post pass carry evaluation for instance is very useful for vector like architectures. And it might be possible on some microarchitectures that the result of a comparison blocks a pipeline while logical operations don't, thus making the logical variant faster even though it uses more operations. But then again I don't know anything about your target and OpenGL, so I'm probably completely off track :smile: 
[QUOTE=Bdot;263174]
As I have only this one ATI GPU I wanted to see if anyone would be willing to help testing on different hardware. Current requirements: OpenCL 1.1 (i.e. only ATI GPUs), Windows 64bit. [/QUOTE] I have HD 4550, Windows 2008 R2 x64. Would that work? Andriy 
Bdot, what did you wind up finding did the fastest math, 32bit numbers or 24bit numbers? And what form of math? Or are you still working on it?

I played around with this a little ...
[QUOTE=ldesnogu;264544]Two other random ideas (again sorry if it's not applicable...):  do as many computations as you can without taking care of carries and do a specific pass for handling them; of course that could lead to a big slowdown if you have to reload values and memory is the limiting factor [/QUOTE] I have at most 56 operations that I can do before checking the carries. The runtime stays exactly the same and the reason is that the compiler reorders the instructions anyway as it thinks fit better. Even a few repeated steps that were necessary for the carry ops did not influence runtime as they were optimized out :smile: [QUOTE=ldesnogu;264544]  another way to compute carries is bit arithmetic; let's say you want the carry from a  b [code]res = a  b; carry = ((b & ~a)  (res & ~a)  (b & res)) >> (bitsize1); [/code]where bitsize is the number of bits of a, b and res. Again that could be slower than your original code.[/QUOTE] I was really surprised by that one. Even though this is way more operations than my original code, it runs the same speed! Not a bit faster, not a bit slower with the singlevector kernel. Comparing the assembly it turns out that many of the additional operations are "hidden" in otherwise unused slots. Following up with the vectorversion of the kernel that has less unused slots, I really saw the kernel takes 3 cycles more  with a total number of ~700 cycles thats less than .5% ... Here's the current performance analysis of the 79bit barrett kernel: [code] Name Throughput Radeon HD 5870 135 M Threads\Sec Radeon HD 6970 118 M Threads\Sec Radeon HD 6870 100 M Threads\Sec Radeon HD 5770 68 M Threads\Sec Radeon HD 4890 66 M Threads\Sec Radeon HD 4870 58 M Threads\Sec FireStream 9270 58 M Threads\Sec FireStream 9250 48 M Threads\Sec Radeon HD 4770 46 M Threads\Sec Radeon HD 6670 38 M Threads\Sec Radeon HD 4670 35 M Threads\Sec Radeon HD 5670 23 M Threads\Sec Radeon HD 6450 10 M Threads\Sec Radeon HD 4550 7 M Threads\Sec Radeon HD 5450 6 M Threads\Sec [/code]This is the peak performance of the kernel given enough CPUpower to feed the factor candidates fast enough. Empirical translation tells that 1M Threads/sec is good for 1.2  1.5 GHzdays/day. Unfortunately I right now have a problem that some of the kernel's code is skipped unless I enable kernel tracing. I need that fixed before I can get you another version for testing. 
[QUOTE=Ken_g6;265320]Bdot, what did you wind up finding did the fastest math, 32bit numbers or 24bit numbers? And what form of math? Or are you still working on it?[/QUOTE]
Currently the fastest kernel is a 24bit based kernel working on a vector of 4 factor candidates at once. Here's the list of kernels I currently have, along with the performance on a HD5770: 76 M/s mfakto_cl_71_4: 3x24bit, 4vectored kernel 68 M/s mfakto_cl_barrett79: 2.5x32bit unvectored barrett kernel 53 M/s mfakto_cl_barrett92: 3x32bit unvectored barrett kernel 44 M/s mfakto_cl_71: 3x24bit unvectored kernel The barrett kernels currently need to use a nasty workaround for a bug of the compiler, costing ~ 3%. I'm still working on vectorizing the barretts, a similar speedup as for the 24bit kernel can be expected, so that the 32bit based kernels will be a lot faster than 24bit. A 24bit based barrett kernel that was suggested on the forum runs at 75M/s, but as it is using vectors for the representation of the factors, it cannot (easily) be enhanced to run on a vector of candidates. If that were easily possible, then the 24bit kernel might run for the crown again. But it will not be far ahead of the 32bit kernel. And the 32bit one has the advantage of running FCs up to 79 bit instead of 71 bit. 
Oh boy, I finally found why the barrett kernels sometimes behaved odd ...
OpenCL does bitshifts only up to the number of bits in the target, and for that it only evaluates the necessary amount of bits of the operand. So for a bitshift a >> b with 32bitvalues, only the lowest 5 bits of b are used, the rest is ignored ... Therefore the code that used bitshifts of 32 or more positions to implicitely zero the target did not deliver the expected result. The fix for that goes without performancepenalty ... a little more testing and version 0.06 will be ready. 
All times are UTC. The time now is 20:28. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2023, Jelsoft Enterprises Ltd.