mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-04-12, 12:33   #23
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

So that makes the new adding of carry code, without opencl support for it:

Code:
carry = 0;
x       = A+B;
if( x < A ) carry = 1;
diep is offline   Reply With Quote
Old 2011-04-12, 14:38   #24
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

101110000101102 Posts
Default

Quote:
Originally Posted by diep View Post
Very good comments that it can be done without the OR as you can already prove 'sum' has to be smaller than a.

Yet putting it in a seperated function i guess is a lot slower in opencl. i didn't really test it to the limit, but i'd expect some crap compiler/interpreter running it, so all sorts of modern features todays compilers have i wouldn't expect it to do very well :)
Putting it in a function was purely a way of giving you a self-contained example which will compile without any additional material. In production C it would be typical to put the code in-line and avoid a function call overhead (though a modern compiler would probably in-line it anyway).

Paul
xilman is offline   Reply With Quote
Old 2011-04-12, 14:41   #25
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

2×17×347 Posts
Default

Quote:
Originally Posted by diep View Post
So that makes the new adding of carry code, without opencl support for it:

Code:
carry = 0;
x       = A+B;
if( x < A ) carry = 1;
I don't write in opencl, but does the language not have if-then-else?

If so
Code:
x = A+B;
if (x<A) carry=1; else carry=0;
only assigns to carry once. Your code assigns it 1.5 times on average.


Paul
xilman is offline   Reply With Quote
Old 2011-04-12, 15:08   #26
Robert Holmes
 
Robert Holmes's Avatar
 
Oct 2007

2×53 Posts
Default

One can try to use 64-bit integers and hope the compiler is smart enough to use internal ADC:

Code:
x = (u64)A+B;
carry = x>>32;
Robert Holmes is offline   Reply With Quote
Old 2011-04-12, 16:32   #27
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by Robert Holmes View Post
One can try to use 64-bit integers and hope the compiler is smart enough to use internal ADC:

Code:
x = (u64)A+B;
carry = x>>32;
This might go wrong though, the hardware can at most shiftright 31 bits, so compiler might produce random results.
diep is offline   Reply With Quote
Old 2011-04-12, 16:38   #28
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by xilman View Post
I don't write in opencl, but does the language not have if-then-else?

If so
Code:
x = A+B;
if (x<A) carry=1; else carry=0;
only assigns to carry once. Your code assigns it 1.5 times on average.


Paul
If then else is the slowest possible thing at GPU's, except if it produces a CMOV type instruction here. Same is true what i wrote; i assume both codes will produce the CMOV and therefore the exact same code.


Relevant is of course to test it all at the opencl compiler and see how fast it is compared to shifting 31 bits and compared to not doing it at all (thugh producing wrong result, it's about the speedwise compare).

Note that another big problem of GPU's is the hidden latencies.

If we have the next sequential thing in the code:

Code:
x = a+b;
instruction using 'x' here
So the code uses 'x' directly after it wrote to 'x'. This will give a huge penalty. Say 4 cycles or so.
A way to avoid that is to already have another wavefront started at the gpu that runs while it waits for
the result of 'x' here. Yet in prime number code it might be important to schedule the code such that we can avoid another thread from needing to run there; as then we lose registers which we all need to store the primebase for the sieve generating factor candidates - bigger primebase means we waste less system time on testing composites in the form 2 np + 1 with trial factoring.

p.s. not so real important in this context:

As for too many functions in languages, we see in C++ the problem of that. Also the modern compilers have big problems with deep layers of classes, which seemingly would be easy to 'inline'. Resulting C++ codes nearly always are factors slower than C code, which is weird if you realize semantically usually there is no difference between the constructs.

Very few coders on the planet manage to produce in C++ code that's equally fast to C code, meanwhile really looking like C++ code :)

pps this is also 1 explanation why some CUDA codes seem so fast compared to the C++ alternatives; imperative languages simply are faster practical; that is if the code can get managed by 1 person. Obviously for companies there is huge advantages in using C++ for their projects - that's not the discussion here

Last fiddled with by diep on 2011-04-12 at 16:47
diep is offline   Reply With Quote
Old 2011-04-12, 17:03   #29
Robert Holmes
 
Robert Holmes's Avatar
 
Oct 2007

6A16 Posts
Default

Quote:
Originally Posted by diep View Post
This might go wrong though, the hardware can at most shiftright 31 bits, so compiler might produce random results.
Yes, that's why the compiler must be smart about it and recognize what we're trying to do.

Another option is:

Code:
x = A+B;
carry = A>=(-B);
This one's more attractive, as carry and sum calculation become independent. It works because -B is equivalent to 2^32-B; A > 2^32-B => A+B > 2^32.
Robert Holmes is offline   Reply With Quote
Old 2011-04-12, 17:17   #30
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by Robert Holmes View Post
Yes, that's why the compiler must be smart about it and recognize what we're trying to do.

Another option is:

Code:
x = A+B;
carry = A>=(-B);
This one's more attractive, as carry and sum calculation become independent. It works because -B is equivalent to 2^32-B; A > 2^32-B => A+B > 2^32.
This is a rather good idea to be tried!

Vincent

p.s. checking whether it has an efficient negate instruction ;)

Note i also have to write out all cases to check whether it works correct, but that's because i didn't see this one before :)

Last fiddled with by diep on 2011-04-12 at 17:20
diep is offline   Reply With Quote
Old 2011-04-12, 19:07   #31
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

1179810 Posts
Default

Quote:
Originally Posted by diep View Post
If then else is the slowest possible thing at GPU's, except if it produces a CMOV type instruction here.
Fair enough. I did say that I don't speak opencl.

Paul

Last fiddled with by xilman on 2011-04-12 at 19:08 Reason: Fix typI
xilman is offline   Reply With Quote
Old 2011-04-12, 19:19   #32
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by xilman View Post
Fair enough. I did say that I don't speak opencl.

Paul
Well you adress a fundamental problem which is language independant: GPU's are very ugly with branches, to say polite. In the luckiest case you can have, they execute *all code* in both chains of the branches. So if a branch doesn't get taken:

Code:
if( randomnumberPEdependant&1 ) {
  bla bla;
}
else {
  yep;
}
It will here also suffer from bla bla and yep; Of course not execute the code, but you can be sure your delay is at least the same as the time it takes to execute blabla and yep together;

Of course this only if a branch gets taken by some PE's; if all of them do not take it, maybe it's ok (have to check - let's not count at it for now).

Last fiddled with by diep on 2011-04-12 at 19:26
diep is offline   Reply With Quote
Old 2011-04-13, 04:43   #33
chrisjp
 
Jul 2010

2×5 Posts
Default CMOV/Carries

You are correct about if/else, usually for simple arithmetic like in my code it gets easily translated to a conditional move. This saves tons of gpu cycles, a simple if/else if around 80 cycles on the newest hardware just for flow control if it isn't compiled as CMOV.

The issue with implementing any of the carries this way for me, is we are currently using a 6-wide 24-bit integer multiplication (3X3) for the barrett and similar for the squaring. If you use 32-bits with the 1-bit carry you end up needing to ripple through the carry on each multiply, which is quite slow.

If you are only doing 64-bits this is less of a problem of course.
chrisjp is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting the Modulus from publickeyblob RSA 512 26B Homework Help 2 2014-11-30 07:31
Free Trials of GPU Cloud Computing Resources NBtarheel_33 GPU to 72 9 2013-07-31 15:32
Guantanamo trials to be restarted garo Soap Box 39 2011-03-22 23:07
It's possible to calculate an unknown RSA modulus? D2MAC Math 8 2010-12-26 16:32
Factorization of a 768-bit RSA modulus akruppa Factoring 53 2010-03-12 13:50

All times are UTC. The time now is 15:14.


Fri Jul 7 15:14:46 UTC 2023 up 323 days, 12:43, 0 users, load averages: 1.82, 1.33, 1.19

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

โ‰  ยฑ โˆ“ รท ร— ยท โˆ’ โˆš โ€ฐ โŠ— โŠ• โŠ– โŠ˜ โŠ™ โ‰ค โ‰ฅ โ‰ฆ โ‰ง โ‰จ โ‰ฉ โ‰บ โ‰ป โ‰ผ โ‰ฝ โŠ โА โŠ‘ โŠ’ ยฒ ยณ ยฐ
โˆ  โˆŸ ยฐ โ‰… ~ โ€– โŸ‚ โซ›
โ‰ก โ‰œ โ‰ˆ โˆ โˆž โ‰ช โ‰ซ โŒŠโŒ‹ โŒˆโŒ‰ โˆ˜ โˆ โˆ โˆ‘ โˆง โˆจ โˆฉ โˆช โจ€ โŠ• โŠ— ๐–• ๐–– ๐–— โŠฒ โŠณ
โˆ… โˆ– โˆ โ†ฆ โ†ฃ โˆฉ โˆช โІ โŠ‚ โŠ„ โŠŠ โЇ โŠƒ โŠ… โŠ‹ โŠ– โˆˆ โˆ‰ โˆ‹ โˆŒ โ„• โ„ค โ„š โ„ โ„‚ โ„ต โ„ถ โ„ท โ„ธ ๐“Ÿ
ยฌ โˆจ โˆง โŠ• โ†’ โ† โ‡’ โ‡ โ‡” โˆ€ โˆƒ โˆ„ โˆด โˆต โŠค โŠฅ โŠข โŠจ โซค โŠฃ โ€ฆ โ‹ฏ โ‹ฎ โ‹ฐ โ‹ฑ
โˆซ โˆฌ โˆญ โˆฎ โˆฏ โˆฐ โˆ‡ โˆ† ฮด โˆ‚ โ„ฑ โ„’ โ„“
๐›ข๐›ผ ๐›ฃ๐›ฝ ๐›ค๐›พ ๐›ฅ๐›ฟ ๐›ฆ๐œ€๐œ– ๐›ง๐œ ๐›จ๐œ‚ ๐›ฉ๐œƒ๐œ— ๐›ช๐œ„ ๐›ซ๐œ… ๐›ฌ๐œ† ๐›ญ๐œ‡ ๐›ฎ๐œˆ ๐›ฏ๐œ‰ ๐›ฐ๐œŠ ๐›ฑ๐œ‹ ๐›ฒ๐œŒ ๐›ด๐œŽ๐œ ๐›ต๐œ ๐›ถ๐œ ๐›ท๐œ™๐œ‘ ๐›ธ๐œ’ ๐›น๐œ“ ๐›บ๐œ”