![]() |
[QUOTE=Bdot;331795]Well then, why's nobody porting cudalucas to OpenCL :smile:?[/QUOTE]
I would, if I "could", heh. :no: |
How about this ?
[CODE] mask = (1 << (i37 & 31)) | (1 << (i41 & 31)) | (1 << (i43 & 31)) | (1 << (i47 & 31)) | (1 << (i53 & 31)) | (1 << (i59 & 31)) | (1 << (i61 & 31)); [/CODE] |
[QUOTE=Larifari;331812]How about this ?
[CODE] mask = (1 << (i37 & 31)) | (1 << (i41 & 31)) | (1 << (i43 & 31)) | (1 << (i47 & 31)) | (1 << (i53 & 31)) | (1 << (i59 & 31)) | (1 << (i61 & 31)); [/CODE][/QUOTE] This is exactly what OpenCL does "for free" and isn't useful in our (Bdots) case. |
Does OpenCL generate better code than CUDA does for 64-bit variables? The masking code can be rewritten to use 64-bit masks -- although it is not a trivial matter.
If memory lookups are cheap then 1 << i37 can be replaced by a lookup into a 37-element array. If generating 0 or 1 from a conditional is "cheap" then 1 << i37 can be replaced with (i37 < 32) << i37, where i37 < 32 evaluates to 0 or 1. |
[QUOTE=Prime95;331854]Does OpenCL generate better code than CUDA does for 64-bit variables? The masking code can be rewritten to use 64-bit masks -- although it is not a trivial matter.
If memory lookups are cheap then 1 << i37 can be replaced by a lookup into a 37-element array. If generating 0 or 1 from a conditional is "cheap" then 1 << i37 can be replaced with (i37 < 32) << i37, where i37 < 32 evaluates to 0 or 1.[/QUOTE] Thanks for the suggestions, I'll test them. Actually, the GCN cards have a few 64-bit instructions that also run at full speed, and shifts belong to them. Not sure yet if extracting the low word costs anything ... The array-lookup also looks promising, as I have lots of constant-memory available. |
[QUOTE=Bdot;331795]On CUDA, shifts of more than 32 will result in 0, but OpenCL first takes the shift-value mod 32, and shifts only by the remainder. One of the less prominent differences between CUDA and OpenCL.[/QUOTE]
Are both build platforms mapping the HLL shifts to the same hardware shift instruction, just OpenCL is masking the shift count? And does the hardware shift specify "result = 0 if shift count > 31"? If it's a matter of forcing OpenCL to respect your (unmasked) shift count, is writing a tiny inline-ASM macro for such shifts an option? |
[QUOTE=ewmayer;332294]Are both build platforms mapping the HLL shifts to the same hardware shift instruction, just OpenCL is masking the shift count? And does the hardware shift specify "result = 0 if shift count > 31"?
If it's a matter of forcing OpenCL to respect your (unmasked) shift count, is writing a tiny inline-ASM macro for such shifts an option?[/QUOTE] The OpenCL shift results in a single assembly instruction, without additional bit masking required. This means that the AMD hardware does that automatically. AMDs instruction set documentation also details this behavior. I assume that the NV OpenCL compiler needs to issue additional bit mask operations to get to the same semantics (or issue different instructions, if available). |
[QUOTE=Bdot;332299]AMDs instruction set documentation also details this behavior.[/QUOTE]
Bizarre. Is there any explanation as to [B][I][U]why[/U][/I][/B] this is "the way it is done" under OpenCL? Talk about non-portability of assumptions.... |
[QUOTE=chalsall;332306]Talk about non-portability of assumptions....[/QUOTE]
Doesn't the C standard state that x << y is implementation dependent if y is greater than the word size? If so, this is simply a case of non-portable C code written to extract maximum efficiency of a particular architecture. Neither AMD nor Nvidia did anything wrong. |
[QUOTE=Bdot;330988]Regarding GPU sieving: It is not so much that the GPU part of OpenCL is so much different from the GPU part of CUDA - most of that can even be solved by one or two dozen #defines. The CPU part of both is so much different. And a few concepts/abilities are hard to "translate" (Assembler inlines, for instance).
Anyway, I have George's GPU sieve running on OpenCL so that it provides some output. I need some verification of it being correct, and I need to do the adaptations of the kernels to read the raw sieve bitfield. As mfakto has many more kernels than mfaktc, I'm still thinking of a smart way to do this ...[/QUOTE] That's great news! :showoff: |
[QUOTE=Prime95;332336]Doesn't the C standard state that x << y is implementation dependent if y is greater than the word size? If so, this is simply a case of non-portable C code written to extract maximum efficiency of a particular architecture. Neither AMD nor Nvidia did anything wrong.[/QUOTE]
With the greatest of respect, I think you might be confusing the definition of the language with regards to the case where the operands are different sizes (where a warning should be given), and/or what happens when both the operands are the same size and the values are within the defined types but an overflow occurs. [QUOTE]The type of the result is that of the promoted left operand. If the right operand is negative, greater than, or equal to the length in bits of the promoted left operand, the result is undefined. The value of E1 << E2 is E1 (interpreted as a bit pattern) left-shifted E2 bits. Vacated bits are filled with zeros. The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 is unsigned , or if it is signed and its value is nonnegative, vacated bits are filled with zeros. If E1 is signed and its value is negative, vacated bits are filled with ones.[/QUOTE] But, as always, I'm happy to be proven wrong. |
| All times are UTC. The time now is 23:07. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.