![]() |
If nothing helps, how about doint it all in 32-bit math?
facdist = (2 * NUM_CLASSES * (exponent%prime))%prime; Could even be faster ... I'll test that for AMD cards. I saw that many emulated 64-bit operations take more than double the time of 32-bits ... Edit: OK, I should have thought about this before writing ... this would work only for primes smaller than 464823 ... |
[QUOTE=Prime95;377027]
Next mini-bug: In extract_bits exponent, k_base, shiftcount, bit_max64, bb are undefined in this code [/QUOTE] Fixed. I did not pay attention when I moved the common code to a shared method. |
Please replace
facdist = (ulong) (2 * NUM_CLASSES) * (ulong) exponent; with: facdist = mul_16_32 (2 * NUM_CLASSES, exponent); and then: #define mul_16_32(a,b) ((ulong)(a) * (ulong)(b)) then for Intel we need (I hope I did this right): #define mul_16_32(a,b) ((((ulong) ((uint)(a) * ((uint)(b) >> 16))) << 16) + (ulong) ((uint)(a) * ((uint)(b) & 0xFFFF))) You can use the Intel definition for AMD if you think it will generate better code. This fixes one bug, but there is at least one more. I'm betting the calculation of bit_to_clr also will need a similar macro. |
[QUOTE=Prime95;377093]I'm betting the calculation of bit_to_clr also will need a similar macro.[/QUOTE]
Nope, the bug is elsewhere. Time for some more nasty debugging. |
Has anyone looked at what the upcoming/available HSA stuff might add to the mix? Obviously an APU isn't a 290X but having the same memory address space and whatever other stuff HSA allows seems really cool to me.
|
It would allow for faster "transfer" of data when sieving on the CPU, which can yield a better output (read: more GHz-days) than GPU sieving anyway. However, the memory transfer is not a bottleneck until you meet multiple high-end cards in the same system. For these cases, GPU sieving is the better choice these days. And HSA will not help there.
Therefore it may be exciting but will not help tf throughput a lot. |
[QUOTE=Prime95;377093]Please replace
facdist = (ulong) (2 * NUM_CLASSES) * (ulong) exponent; with: facdist = mul_16_32 (2 * NUM_CLASSES, exponent); and then: #define mul_16_32(a,b) ((ulong)(a) * (ulong)(b)) then for Intel we need (I hope I did this right): #define mul_16_32(a,b) ((((ulong) ((uint)(a) * ((uint)(b) >> 16))) << 16) + (ulong) ((uint)(a) * ((uint)(b) & 0xFFFF))) You can use the Intel definition for AMD if you think it will generate better code. This fixes one bug, but there is at least one more. I'm betting the calculation of bit_to_clr also will need a similar macro.[/QUOTE] The bit-shifting code is quite a slowdown on AMD (older cards: 6 cycles instead of 3, newer cards 17 cycles instead of 8). Not a big deal, as it is not the sieving itself, but I'd like to avoid it anyway. Could you please try if the following code works on Intel? The OpenCL standard discourages using type casts, but use conversion functions instead. These two versions create exactly the same assembly on AMD (which is the same as the original): [code] facdist = convert_ulong(2 * NUM_CLASSES) * convert_ulong(exponent); [/code][code] uint2 temp; temp.x = exponent * (2 * NUM_CLASSES); temp.y = mul_hi(exponent, 2 * NUM_CLASSES); facdist = as_ulong (temp); [/code] |
[QUOTE=Prime95;377098]Nope, the bug is elsewhere. Time for some more nasty debugging.[/QUOTE]
This would have been my bet as well. Did you verify that using the GWDEBUG printfs? This verification may also fall to the same 32-bit calculation, and maybe 2 wrongs make it right, and therefore no FAILs are reported? Other occurances of ulong conversions are memory access (maybe only 32 bits are saved?): #define locsieve64 ((__local ulong *) locsieve) sieve mask calculation (if that is done in 32 bits, we'll have way more zeros): mask1 = (i131 > 63 ? 0 : ((ulong) 1 << i131)) | (i137 > 63 ? 0 : ((ulong) 1 << i137)); |
Neither of those versions work. Can we #define a work-around flag for Intel devices? If so, we'd need to release two different executables, right?
Do you want to wait until I find the next bug before making any decisions? |
[QUOTE=Prime95;377287]Neither of those versions work. Can we #define a work-around flag for Intel devices? If so, we'd need to release two different executables, right?
Do you want to wait until I find the next bug before making any decisions?[/QUOTE] That's really sad, and Intel should fix it. For now, I will modify the host code to pass the detected device type as a define to the kernel. So we can do [code] #ifdef INTEL #define mul_16_32(a,b) ((((ulong) ((uint)(a) * ((uint)(b) >> 16))) << 16) + (ulong) ((uint)(a) * ((uint)(b) & 0xFFFF))) #else #define mul_16_32(a,b) ((ulong)(a) * (ulong)(b)) #endif [/code]This requires separation of device detection from code compilation, which I have not yet done. I'll try to get to that tonight. |
Not all ulongs are a problem. I was surprised that bit_to_clr ulong operations were OK. I did turn on the GWDEBUG checks with an ugly and slow mul_32_32 macro and all was OK.
Presently, self tests pass with DEBUG_FACTOR_FIRST set, but only about half pass if it is not set. |
| All times are UTC. The time now is 23:06. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.