mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-04-11, 09:57   #12
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by chrisjp View Post
Hey all, Sorry for the long long delay. Lots of changes in life, but here is the barebones opencl code. Currently not usable unless you're quite dedicated :)

It has lots of debug stuff and old code, but you can get the jist of the algorithm.

It currently runs around 93M primes / second with everything precalculated and ready to go on the CPU side on an HD 6870 @ 1Ghz.

I'll try to incorporate it into a .exe sometime this or next week.

Chris
I assume you mean a 6870 @ 900Mhz. The RAM is clocked 1.05Ghz yet the PE's are clocked 900Mhz. Except when you overclock it but i guess that's not recommended for number crunching :)

http://www.amd.com/us/products/deskt...verview.aspx#2

Maybe i should ask you at which bitlevel you are trying FC's, as i was counting at the 880Mhz @ 1536 streamcores i have here (radeon hd 6970), to reach speeds of nearly a billion per second or more this at up to 63 bits, maybe 64 bits.

93M/s is very disappointing i'd argue.

Vincent

Last fiddled with by diep on 2011-04-11 at 10:00
diep is offline   Reply With Quote
Old 2011-04-11, 21:32   #13
chrisjp
 
Jul 2010

A16 Posts
Default

No I mean the 6870 at 1000 Mhz, overclocking hasn't produced any errors but you may do what you wish :).

The code is constant up to 71-bits... I and many others would be interested to see code that can work at 1B/second with these factors :)

Chris

Quote:
Originally Posted by diep View Post
I assume you mean a 6870 @ 900Mhz. The RAM is clocked 1.05Ghz yet the PE's are clocked 900Mhz. Except when you overclock it but i guess that's not recommended for number crunching :)

http://www.amd.com/us/products/deskt...verview.aspx#2

Maybe i should ask you at which bitlevel you are trying FC's, as i was counting at the 880Mhz @ 1536 streamcores i have here (radeon hd 6970), to reach speeds of nearly a billion per second or more this at up to 63 bits, maybe 64 bits.

93M/s is very disappointing i'd argue.

Vincent
chrisjp is offline   Reply With Quote
Old 2011-04-11, 21:35   #14
chrisjp
 
Jul 2010

A16 Posts
Default

Right, but 24-bit instructions get issued in 4-lanes, wheres 32-bits go in one lane. 4 times the throughput.. basic stuff here. It's easy to see the difference if you profile code with 32-bit versus 24-bit multiplies. Fermi can do 32-bit at full-speed, whereas ATI cannot. The code already uses the high and low instructions in other places.

Quote:
Originally Posted by diep View Post
24 bits????

older nvidia's could only do 24 bits (fermi can do 32 now as well). AMD always could do 32 x 32 bits multiplications since the 4000 series AFAIK. One instruction gives top 32 bits another instruction lower 32 bits. I already wanted to post how to get those in openCL, but maybe you want to post this :)
chrisjp is offline   Reply With Quote
Old 2011-04-11, 22:20   #15
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by chrisjp View Post
Right, but 24-bit instructions get issued in 4-lanes, wheres 32-bits go in one lane. 4 times the throughput.. basic stuff here. It's easy to see the difference if you profile code with 32-bit versus 24-bit multiplies. Fermi can do 32-bit at full-speed, whereas ATI cannot. The code already uses the high and low instructions in other places.
Right, except that the 6000 series is supposed to have more than doubled that capacity; from 1 in 5 PE's now 2 out of 4 can execute it and you're not issuing multiplication most of the time, there is other instructions as well.

Besides the crucial first 64 bits you cover with 2 units using 2 x 32 bits,
versus 3 x 32 bits for the 24 bits equivalent.

But of course we agree that one should test it all very well, as that'll decide what and how :)

Realize with wagstaff we're still trial factoring around the 10M bits sizes, so getting to 64 bits is already quite something, as default it's 61 bits now.

thanks to the hard work here of the gpu's doing TF for mersenne, did i read it correct that for GIMPS things get TF'ed now at the current range to 69 bits isn't it?

Thanks,
Vincent
diep is offline   Reply With Quote
Old 2011-04-11, 22:25   #16
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

14478 Posts
Default

Quote:
Originally Posted by chrisjp View Post
No I mean the 6870 at 1000 Mhz, overclocking hasn't produced any errors but you may do what you wish :).

The code is constant up to 71-bits... I and many others would be interested to see code that can work at 1B/second with these factors :)

Chris
I calculated with 64 bits of course, as my goal is to Trial Factor to 64 bits of course, where 61 bits got accomplished by my 16 core AMD @ 16 cores, as that's about what you need to run TF to keep Jeff Gilchrist' fed enough exponents to test; maybe it's cold during winters in Canada!
diep is offline   Reply With Quote
Old 2011-04-11, 22:50   #17
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by chrisjp View Post
No I mean the 6870 at 1000 Mhz, overclocking hasn't produced any errors but you may do what you wish :).

The code is constant up to 71-bits... I and many others would be interested to see code that can work at 1B/second with these factors :)

Chris
Didn't know you were in the league of overclockers :)

As serious note on the 1B/s. Let's consider i've got 1536 cores @ 880Mhz.
Now let's skip silly definitions of AMD/Nvidia (both multiply their gpu's with a phantom factor to determine number of gflops, in case of AMD that's in the manual explained as well) and look realistic.

The ideal throughput is 1.0 per PE. To reach 1B/s we can throw effectively a tad less, in my case as i'm intending doing the sieving in another wavefront, than 1536 * 0.88 cycles per FC into battle.

That's 1351 cycles for a 64 bits FC.

On a cpu, 1351 cycles is hell of a lot. Within 1351 cycles you can calculate for example what distance it takes to travel around the planet. Even a small bathroom stop is allowed in fact.

I'm testing numbers of around 10-12M now, which all already have been processed to 61 bits.

Let's take a 10M number.

That's 24 bits deep. So to see whether a 64 bits number divides this, let's say roughly we need roughly 37 squares and 37 modulo's. Using barrett (note i remember one day i had reinvented barrett myself - see GMP mailing list - it beated for big numbers the default modulo at the time) that means extension of this 128 bits number to 192 bits. So another multiplication.

Let's assume now stupid schoolboy algorithm, which for GPU's usually is not such a bad idea (huh?).

Let's calculate everything in units for 1 step:

First we have 2 units that we get to 4. That's say 4 multiplications.
Then we have 4 units times 2, that's 8 multiplications.
So 12 multiplications in total.

37 * 12 = 444 multiplications, and of course a lot of extra work, but all the extra work should go at full PE speed.

I wrote things down in a naive manner here, without further indepth research yet, but i hope you see what i was guessing that would be theoretical possible.

Now i also know the theoretist problem - paper supports everything

But what's your estimate on what is possible?

Realize so far i didn't count the sieve yet and that thing isn't exactly 'free' either.

Regards,
Vincent
diep is offline   Reply With Quote
Old 2011-04-11, 23:52   #18
chrisjp
 
Jul 2010

2·5 Posts
Default

My code currently runs (on the 6870 not 6970) in 2334 clock cycles. Roughly 1/10th of that is in the squaring and the remainder in the modulus calculation.

In terms of what could be done at only 64-bits, I believe you could get 400M if you refined my code. The main limitations are going to be the carry steps (lots of serial dependencies that limit your ALU packing) and subtraction steps (no integrated carry registers in the alu's that are exposed to opencl.)

You would also run into problems solely relying on 64-bit(2X32) multiplies as you need guard bits for the carries (no carry register in opencl). So for 61-or-62 bit factors 400M might be possible with a 6970 fairly easily.

But again, then you have to worry about precalculating the mu (like mkafct) on the GPU side to really get good CPU thoroughput.

In short 400M I think I could do at 64 exclusive, but it's hard to know!
Quote:
Originally Posted by diep View Post
Didn't know you were in the league of overclockers :)

As serious note on the 1B/s. Let's consider i've got 1536 cores @ 880Mhz.
Now let's skip silly definitions of AMD/Nvidia (both multiply their gpu's with a phantom factor to determine number of gflops, in case of AMD that's in the manual explained as well) and look realistic.

The ideal throughput is 1.0 per PE. To reach 1B/s we can throw effectively a tad less, in my case as i'm intending doing the sieving in another wavefront, than 1536 * 0.88 cycles per FC into battle.

That's 1351 cycles for a 64 bits FC.

On a cpu, 1351 cycles is hell of a lot. Within 1351 cycles you can calculate for example what distance it takes to travel around the planet. Even a small bathroom stop is allowed in fact.

I'm testing numbers of around 10-12M now, which all already have been processed to 61 bits.

Let's take a 10M number.

That's 24 bits deep. So to see whether a 64 bits number divides this, let's say roughly we need roughly 37 squares and 37 modulo's. Using barrett (note i remember one day i had reinvented barrett myself - see GMP mailing list - it beated for big numbers the default modulo at the time) that means extension of this 128 bits number to 192 bits. So another multiplication.

Let's assume now stupid schoolboy algorithm, which for GPU's usually is not such a bad idea (huh?).

Let's calculate everything in units for 1 step:

First we have 2 units that we get to 4. That's say 4 multiplications.
Then we have 4 units times 2, that's 8 multiplications.
So 12 multiplications in total.

37 * 12 = 444 multiplications, and of course a lot of extra work, but all the extra work should go at full PE speed.

I wrote things down in a naive manner here, without further indepth research yet, but i hope you see what i was guessing that would be theoretical possible.

Now i also know the theoretist problem - paper supports everything

But what's your estimate on what is possible?

Realize so far i didn't count the sieve yet and that thing isn't exactly 'free' either.

Regards,
Vincent
chrisjp is offline   Reply With Quote
Old 2011-04-12, 08:30   #19
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11001001112 Posts
Default

Quote:
Originally Posted by chrisjp View Post
My code currently runs (on the 6870 not 6970) in 2334 clock cycles. Roughly 1/10th of that is in the squaring and the remainder in the modulus calculation.

...snip
Some sooner here you posted: "It currently runs around 93M primes / second with everything precalculated and ready to go on the CPU side on an HD 6870 @ 1Ghz."

2334 clockcycles @ 1120 PE's ==> 480M /s

Yet you report 93M/s which actually is in number of cycles per FC:

1120G / 93M = 1120k / 93 = 12k cycles per FC.

Quote:
In terms of what could be done at only 64-bits, I believe you could get 400M if you refined my code. The main limitations are going to be the carry steps (lots of serial dependencies that limit your ALU packing) and subtraction steps (no integrated carry registers in the alu's that are exposed to opencl.)

You would also run into problems solely relying on 64-bit(2X32) multiplies as you need guard bits for the carries (no carry register in opencl). So for 61-or-62 bit factors 400M might be possible with a 6970 fairly easily.

But again, then you have to worry about precalculating the mu (like mkafct) on the GPU side to really get good CPU thoroughput.

In short 400M I think I could do at 64 exclusive, but it's hard to know!
No carry possibilities in opencl, if AMD gpu's do allow using it, would be severe problem for gpgpu in opencl at AMD gpu's as it throws away a lot of performance.

Last fiddled with by diep on 2011-04-12 at 08:31
diep is offline   Reply With Quote
Old 2011-04-12, 11:41   #20
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by chrisjp View Post
[snip]
In terms of what could be done at only 64-bits, I believe you could get 400M if you refined my code. The main limitations are going to be the carry steps (lots of serial dependencies that limit your ALU packing) and subtraction steps (no integrated carry registers in the alu's that are exposed to opencl.)

You would also run into problems solely relying on 64-bit(2X32) multiplies as you need guard bits for the carries (no carry register in opencl). So for 61-or-62 bit factors 400M might be possible with a 6970 fairly easily.
Let's skip the discussion on what refining is, as probably it means starting from scratch :)

Reality is that being the first writing something always is toughest.

REFINING

As i'm after 64 bits initially, i'll have to start from scratch anyway.

It's not needed to have guard bits at all of course, just it's a lot more expensive than just adding the carry.

Much depends upon the throughput independant instructions at the gpu's.

When writing a special implementation for 61-63 bits or so, so something less than 64, you will need to shiftright top bits in order to add them.

That might not be so fast.

It might be a problem for 2 reasons, but of course it would require alternatives to be there. These gpu's have fast CMOV capabilities i hope.

The instructionset of amd already for years shows at gpu level that they have really a lot of different instructions for that.

Now what i don't know is how many PE's actually can execute it.

Manual carry:

adding A and B:

Code:
carry = 0;
y       = A|B;
x       = A+B;
if( x < y ) carry = 1;


Till so far for now.


Last fiddled with by diep on 2011-04-12 at 11:42
diep is offline   Reply With Quote
Old 2011-04-12, 12:13   #21
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

2·17·347 Posts
Default

Quote:
Originally Posted by diep View Post
Manual carry:

adding A and B:

Code:
carry = 0;
y       = A|B;
x       = A+B;
if( x < y ) carry = 1;
You don't give declarations for the types of A and B but I'm going to assume that they are word-sized unsigned integers. If so, a standard and usually efficient way of computing the carry uses fewer lines of code and fewer operations on the variables than your sample above. A C function runs something like:
Code:
/* add with carry-out function.
  Result of addition is stored where the first argument points.
  Carry_out is returned as the value of the function.
*/
unsigned adc (unsigned *sum, unsigned a, unsigned b)
{
   *sum = a+b;
   return *sum < a;   /* sum is unsigned less than a if the addition overflowed */
}
Paul
xilman is offline   Reply With Quote
Old 2011-04-12, 12:27   #22
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by xilman View Post
You don't give declarations for the types of A and B but I'm going to assume that they are word-sized unsigned integers. If so, a standard and usually efficient way of computing the carry uses fewer lines of code and fewer operations on the variables than your sample above. A C function runs something like:
Code:
/* add with carry-out function.
  Result of addition is stored where the first argument points.
  Carry_out is returned as the value of the function.
*/
unsigned adc (unsigned *sum, unsigned a, unsigned b)
{
   *sum = a+b;
   return *sum < a;   /* sum is unsigned less than a if the addition overflowed */
}
Paul
Very good comments that it can be done without the OR as you can already prove 'sum' has to be smaller than a.

Yet putting it in a seperated function i guess is a lot slower in opencl. i didn't really test it to the limit, but i'd expect some crap compiler/interpreter running it, so all sorts of modern features todays compilers have i wouldn't expect it to do very well :)
diep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting the Modulus from publickeyblob RSA 512 26B Homework Help 2 2014-11-30 07:31
Free Trials of GPU Cloud Computing Resources NBtarheel_33 GPU to 72 9 2013-07-31 15:32
Guantanamo trials to be restarted garo Soap Box 39 2011-03-22 23:07
It's possible to calculate an unknown RSA modulus? D2MAC Math 8 2010-12-26 16:32
Factorization of a 768-bit RSA modulus akruppa Factoring 53 2010-03-12 13:50

All times are UTC. The time now is 15:12.


Fri Jul 7 15:12:10 UTC 2023 up 323 days, 12:40, 0 users, load averages: 1.20, 1.08, 1.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

โ‰  ยฑ โˆ“ รท ร— ยท โˆ’ โˆš โ€ฐ โŠ— โŠ• โŠ– โŠ˜ โŠ™ โ‰ค โ‰ฅ โ‰ฆ โ‰ง โ‰จ โ‰ฉ โ‰บ โ‰ป โ‰ผ โ‰ฝ โŠ โА โŠ‘ โŠ’ ยฒ ยณ ยฐ
โˆ  โˆŸ ยฐ โ‰… ~ โ€– โŸ‚ โซ›
โ‰ก โ‰œ โ‰ˆ โˆ โˆž โ‰ช โ‰ซ โŒŠโŒ‹ โŒˆโŒ‰ โˆ˜ โˆ โˆ โˆ‘ โˆง โˆจ โˆฉ โˆช โจ€ โŠ• โŠ— ๐–• ๐–– ๐–— โŠฒ โŠณ
โˆ… โˆ– โˆ โ†ฆ โ†ฃ โˆฉ โˆช โІ โŠ‚ โŠ„ โŠŠ โЇ โŠƒ โŠ… โŠ‹ โŠ– โˆˆ โˆ‰ โˆ‹ โˆŒ โ„• โ„ค โ„š โ„ โ„‚ โ„ต โ„ถ โ„ท โ„ธ ๐“Ÿ
ยฌ โˆจ โˆง โŠ• โ†’ โ† โ‡’ โ‡ โ‡” โˆ€ โˆƒ โˆ„ โˆด โˆต โŠค โŠฅ โŠข โŠจ โซค โŠฃ โ€ฆ โ‹ฏ โ‹ฎ โ‹ฐ โ‹ฑ
โˆซ โˆฌ โˆญ โˆฎ โˆฏ โˆฐ โˆ‡ โˆ† ฮด โˆ‚ โ„ฑ โ„’ โ„“
๐›ข๐›ผ ๐›ฃ๐›ฝ ๐›ค๐›พ ๐›ฅ๐›ฟ ๐›ฆ๐œ€๐œ– ๐›ง๐œ ๐›จ๐œ‚ ๐›ฉ๐œƒ๐œ— ๐›ช๐œ„ ๐›ซ๐œ… ๐›ฌ๐œ† ๐›ญ๐œ‡ ๐›ฎ๐œˆ ๐›ฏ๐œ‰ ๐›ฐ๐œŠ ๐›ฑ๐œ‹ ๐›ฒ๐œŒ ๐›ด๐œŽ๐œ ๐›ต๐œ ๐›ถ๐œ ๐›ท๐œ™๐œ‘ ๐›ธ๐œ’ ๐›น๐œ“ ๐›บ๐œ”