mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

Bdot 2014-06-30 14:21

If nothing helps, how about doint it all in 32-bit math?

facdist = (2 * NUM_CLASSES * (exponent%prime))%prime;

Could even be faster ... I'll test that for AMD cards. I saw that many emulated 64-bit operations take more than double the time of 32-bits ...

Edit: OK, I should have thought about this before writing ... this would work only for primes smaller than 464823 ...

Bdot 2014-06-30 16:19

[QUOTE=Prime95;377027]
Next mini-bug: In extract_bits exponent, k_base, shiftcount, bit_max64, bb are undefined in this code
[/QUOTE]
Fixed. I did not pay attention when I moved the common code to a shared method.

Prime95 2014-07-01 02:30

Please replace

facdist = (ulong) (2 * NUM_CLASSES) * (ulong) exponent;

with:

facdist = mul_16_32 (2 * NUM_CLASSES, exponent);

and then:

#define mul_16_32(a,b) ((ulong)(a) * (ulong)(b))

then for Intel we need (I hope I did this right):

#define mul_16_32(a,b) ((((ulong) ((uint)(a) * ((uint)(b) >> 16))) << 16) + (ulong) ((uint)(a) * ((uint)(b) & 0xFFFF)))


You can use the Intel definition for AMD if you think it will generate better code. This fixes one bug, but there is at least one more. I'm betting the calculation of bit_to_clr also will need a similar macro.

Prime95 2014-07-01 04:59

[QUOTE=Prime95;377093]I'm betting the calculation of bit_to_clr also will need a similar macro.[/QUOTE]

Nope, the bug is elsewhere. Time for some more nasty debugging.

tului 2014-07-01 21:06

Has anyone looked at what the upcoming/available HSA stuff might add to the mix? Obviously an APU isn't a 290X but having the same memory address space and whatever other stuff HSA allows seems really cool to me.

Bdot 2014-07-01 21:16

It would allow for faster "transfer" of data when sieving on the CPU, which can yield a better output (read: more GHz-days) than GPU sieving anyway. However, the memory transfer is not a bottleneck until you meet multiple high-end cards in the same system. For these cases, GPU sieving is the better choice these days. And HSA will not help there.

Therefore it may be exciting but will not help tf throughput a lot.

Bdot 2014-07-03 15:38

[QUOTE=Prime95;377093]Please replace

facdist = (ulong) (2 * NUM_CLASSES) * (ulong) exponent;

with:

facdist = mul_16_32 (2 * NUM_CLASSES, exponent);

and then:

#define mul_16_32(a,b) ((ulong)(a) * (ulong)(b))

then for Intel we need (I hope I did this right):

#define mul_16_32(a,b) ((((ulong) ((uint)(a) * ((uint)(b) >> 16))) << 16) + (ulong) ((uint)(a) * ((uint)(b) & 0xFFFF)))


You can use the Intel definition for AMD if you think it will generate better code. This fixes one bug, but there is at least one more. I'm betting the calculation of bit_to_clr also will need a similar macro.[/QUOTE]

The bit-shifting code is quite a slowdown on AMD (older cards: 6 cycles instead of 3, newer cards 17 cycles instead of 8). Not a big deal, as it is not the sieving itself, but I'd like to avoid it anyway.

Could you please try if the following code works on Intel? The OpenCL standard discourages using type casts, but use conversion functions instead. These two versions create exactly the same assembly on AMD (which is the same as the original):
[code]
facdist = convert_ulong(2 * NUM_CLASSES) * convert_ulong(exponent);
[/code][code]
uint2 temp;

temp.x = exponent * (2 * NUM_CLASSES);
temp.y = mul_hi(exponent, 2 * NUM_CLASSES);

facdist = as_ulong (temp);
[/code]

Bdot 2014-07-03 16:09

[QUOTE=Prime95;377098]Nope, the bug is elsewhere. Time for some more nasty debugging.[/QUOTE]
This would have been my bet as well. Did you verify that using the GWDEBUG printfs? This verification may also fall to the same 32-bit calculation, and maybe 2 wrongs make it right, and therefore no FAILs are reported?

Other occurances of ulong conversions are

memory access (maybe only 32 bits are saved?):
#define locsieve64 ((__local ulong *) locsieve)

sieve mask calculation (if that is done in 32 bits, we'll have way more zeros):
mask1 = (i131 > 63 ? 0 : ((ulong) 1 << i131)) | (i137 > 63 ? 0 : ((ulong) 1 << i137));

Prime95 2014-07-03 16:17

Neither of those versions work. Can we #define a work-around flag for Intel devices? If so, we'd need to release two different executables, right?

Do you want to wait until I find the next bug before making any decisions?

Bdot 2014-07-03 16:32

[QUOTE=Prime95;377287]Neither of those versions work. Can we #define a work-around flag for Intel devices? If so, we'd need to release two different executables, right?

Do you want to wait until I find the next bug before making any decisions?[/QUOTE]
That's really sad, and Intel should fix it.

For now, I will modify the host code to pass the detected device type as a define to the kernel. So we can do

[code]
#ifdef INTEL
#define mul_16_32(a,b) ((((ulong) ((uint)(a) * ((uint)(b) >> 16))) << 16) + (ulong) ((uint)(a) * ((uint)(b) & 0xFFFF)))
#else
#define mul_16_32(a,b) ((ulong)(a) * (ulong)(b))
#endif
[/code]This requires separation of device detection from code compilation, which I have not yet done. I'll try to get to that tonight.

Prime95 2014-07-03 16:37

Not all ulongs are a problem. I was surprised that bit_to_clr ulong operations were OK. I did turn on the GWDEBUG checks with an ugly and slow mul_32_32 macro and all was OK.

Presently, self tests pass with DEBUG_FACTOR_FIRST set, but only about half pass if it is not set.


All times are UTC. The time now is 23:06.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.