mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-03-03, 18:03   #694
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

87816 Posts
Default

Quote:
Originally Posted by Bdot View Post
Well then, why's nobody porting cudalucas to OpenCL ?
I would, if I "could", heh.
kracker is offline   Reply With Quote
Old 2013-03-03, 20:00   #695
Larifari
 
Mar 2013

116 Posts
Default

How about this ?

Code:
mask = (1 << (i37 & 31))
     | (1 << (i41 & 31)) | (1 << (i43 & 31)) | (1 << (i47 & 31))
     | (1 << (i53 & 31)) | (1 << (i59 & 31)) | (1 << (i61 & 31));
Larifari is offline   Reply With Quote
Old 2013-03-03, 22:03   #696
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111110 Posts
Default

Quote:
Originally Posted by Larifari View Post
How about this ?

Code:
mask = (1 << (i37 & 31))
     | (1 << (i41 & 31)) | (1 << (i43 & 31)) | (1 << (i47 & 31))
     | (1 << (i53 & 31)) | (1 << (i59 & 31)) | (1 << (i61 & 31));
This is exactly what OpenCL does "for free" and isn't useful in our (Bdots) case.
TheJudger is offline   Reply With Quote
Old 2013-03-04, 02:00   #697
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Does OpenCL generate better code than CUDA does for 64-bit variables? The masking code can be rewritten to use 64-bit masks -- although it is not a trivial matter.

If memory lookups are cheap then 1 << i37 can be replaced by a lookup into a 37-element array.

If generating 0 or 1 from a conditional is "cheap" then 1 << i37 can be replaced with (i37 < 32) << i37, where i37 < 32 evaluates to 0 or 1.
Prime95 is offline   Reply With Quote
Old 2013-03-04, 20:05   #698
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

10010101012 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Does OpenCL generate better code than CUDA does for 64-bit variables? The masking code can be rewritten to use 64-bit masks -- although it is not a trivial matter.

If memory lookups are cheap then 1 << i37 can be replaced by a lookup into a 37-element array.

If generating 0 or 1 from a conditional is "cheap" then 1 << i37 can be replaced with (i37 < 32) << i37, where i37 < 32 evaluates to 0 or 1.
Thanks for the suggestions, I'll test them. Actually, the GCN cards have a few 64-bit instructions that also run at full speed, and shifts belong to them. Not sure yet if extracting the low word costs anything ...
The array-lookup also looks promising, as I have lots of constant-memory available.
Bdot is offline   Reply With Quote
Old 2013-03-07, 19:22   #699
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

265778 Posts
Default

Quote:
Originally Posted by Bdot View Post
On CUDA, shifts of more than 32 will result in 0, but OpenCL first takes the shift-value mod 32, and shifts only by the remainder. One of the less prominent differences between CUDA and OpenCL.
Are both build platforms mapping the HLL shifts to the same hardware shift instruction, just OpenCL is masking the shift count? And does the hardware shift specify "result = 0 if shift count > 31"?

If it's a matter of forcing OpenCL to respect your (unmasked) shift count, is writing a tiny inline-ASM macro for such shifts an option?
ewmayer is offline   Reply With Quote
Old 2013-03-07, 20:18   #700
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Are both build platforms mapping the HLL shifts to the same hardware shift instruction, just OpenCL is masking the shift count? And does the hardware shift specify "result = 0 if shift count > 31"?

If it's a matter of forcing OpenCL to respect your (unmasked) shift count, is writing a tiny inline-ASM macro for such shifts an option?
The OpenCL shift results in a single assembly instruction, without additional bit masking required. This means that the AMD hardware does that automatically. AMDs instruction set documentation also details this behavior. I assume that the NV OpenCL compiler needs to issue additional bit mask operations to get to the same semantics (or issue different instructions, if available).
Bdot is offline   Reply With Quote
Old 2013-03-07, 20:43   #701
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,767 Posts
Default

Quote:
Originally Posted by Bdot View Post
AMDs instruction set documentation also details this behavior.
Bizarre.

Is there any explanation as to why this is "the way it is done" under OpenCL?

Talk about non-portability of assumptions....
chalsall is online now   Reply With Quote
Old 2013-03-07, 23:32   #702
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

753710 Posts
Default

Quote:
Originally Posted by chalsall View Post
Talk about non-portability of assumptions....
Doesn't the C standard state that x << y is implementation dependent if y is greater than the word size? If so, this is simply a case of non-portable C code written to extract maximum efficiency of a particular architecture. Neither AMD nor Nvidia did anything wrong.
Prime95 is offline   Reply With Quote
Old 2013-03-08, 00:06   #703
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

45338 Posts
Default

Quote:
Originally Posted by Bdot View Post
Regarding GPU sieving: It is not so much that the GPU part of OpenCL is so much different from the GPU part of CUDA - most of that can even be solved by one or two dozen #defines. The CPU part of both is so much different. And a few concepts/abilities are hard to "translate" (Assembler inlines, for instance).

Anyway, I have George's GPU sieve running on OpenCL so that it provides some output. I need some verification of it being correct, and I need to do the adaptations of the kernels to read the raw sieve bitfield. As mfakto has many more kernels than mfaktc, I'm still thinking of a smart way to do this ...
That's great news!
ixfd64 is offline   Reply With Quote
Old 2013-03-08, 00:52   #704
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,767 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Doesn't the C standard state that x << y is implementation dependent if y is greater than the word size? If so, this is simply a case of non-portable C code written to extract maximum efficiency of a particular architecture. Neither AMD nor Nvidia did anything wrong.
With the greatest of respect, I think you might be confusing the definition of the language with regards to the case where the operands are different sizes (where a warning should be given), and/or what happens when both the operands are the same size and the values are within the defined types but an overflow occurs.

Quote:
The type of the result is that of the promoted left operand. If the right operand is negative, greater than, or equal to the length in bits of the promoted left operand, the result is undefined.

The value of E1 << E2 is E1 (interpreted as a bit pattern) left-shifted E2 bits. Vacated bits are filled with zeros.

The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 is unsigned , or if it is signed and its value is nonnegative, vacated bits are filled with zeros. If E1 is signed and its value is negative, vacated bits are filled with ones.
But, as always, I'm happy to be proven wrong.

Last fiddled with by chalsall on 2013-03-08 at 00:54
chalsall is online now   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2718 2021-07-06 18:30
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 03:14.


Mon Aug 2 03:14:08 UTC 2021 up 9 days, 21:43, 0 users, load averages: 1.56, 1.40, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.