mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-06-08, 09:44   #12
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

67410 Posts
Default

Quote:
Originally Posted by KingKurly View Post
Would the Radeon HD 5450 work for this, eventually? I am using Linux, so I would have to wait for that port to be ready.

http://www.amd.com/us/products/deskt...verview.aspx#2

Linux's 'lspci' says:
02:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450]
Of course this one will work fine under linux. Linux has a great driver.
diep is offline   Reply With Quote
Old 2011-06-08, 09:59   #13
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

10101000102 Posts
Default

Quote:
Originally Posted by KingKurly View Post
Would the Radeon HD 5450 work for this, eventually? I am using Linux, so I would have to wait for that port to be ready.

http://www.amd.com/us/products/deskt...verview.aspx#2

Linux's 'lspci' says:
02:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450]
Note that besides driver you also need to install the APP SDK 2.4
both are at AMD site.
diep is offline   Reply With Quote
Old 2011-06-08, 11:49   #14
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Thanks for the replies and help offers, I'll contact you directly with more info.

Quote:
Originally Posted by diep View Post
If you have written OpenCL, where do you need a linux port for, as OpenCL is working for linux isn't it?
The sieve, handling command line parameters, reading config files, writing result files is all taken over from Oliver and should be no trouble on Linux. What I need to port is all the OpenCL initialization, and driving the GPU. I don't expect a lot of issues, yet is has to be done. Plus, I don't even have the AMD SDK installed on Linux yet .

Quote:
Originally Posted by diep View Post
p.s. you also test in same manner like TheJudger, just multiplying zero's?
Hehe, yes. Just multiplying zeros. Very fast, I tell you!

Quote:
Originally Posted by diep View Post
heh Bdot what speed does your card run at?
To OpenCL it claims to have 10 compute cores (which would be 800 PEs), but AMD docs for HD 5750 say 9 cores (720 PEs) ... not sure about this. Claimed clock speed is 850 MHz.

Quote:
Originally Posted by vsuite View Post
Hi, would it work with an integrated ATI GPU?
Actually, I don't know. If you manage to install the AMD APP (http://developer.amd.com/sdks/AMDAPP...s/default.aspx) then it will run, but it may ignore the GPU and run on the CPU instead. You may end up with 1.5M/s

Quote:
Originally Posted by vsuite View Post
How soon the windows 32bit version?
Hmm, I did not think that was still necessary. But I just compiled the 32-bit version and it is running the self test just fine. So no big deal if needed.

Quote:
Originally Posted by KingKurly View Post
Would the Radeon HD 5450 work for this, eventually? I am using Linux, so I would have to wait for that port to be ready.
It would certainly work, but comparing the specs I'd expect a speed of ~4M/s with the current kernel. mprime on your CPU is probably way faster (but those 4M/s would be on top of it at almost no CPU cost). Worth testing anyway.
Bdot is offline   Reply With Quote
Old 2011-06-08, 16:30   #15
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·337 Posts
Default

diep@xs4all.nl for email
diepchess for a range of messengers ranging from skype to aim to yahoo etc
also at some IRC servers sometimes (seldom). had also shipped you that in a private message but maybe you didn't get a notification there.

Lemme ask to thejudger whether he has a GPL header on top of his code as well, didn't checkout his latest code there.
diep is offline   Reply With Quote
Old 2011-06-08, 16:42   #16
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·337 Posts
Default

Quote:
Originally Posted by Bdot View Post
Thanks for the replies and help offers, I'll contact you directly with more info.



The sieve, handling command line parameters, reading config files, writing result files is all taken over from Oliver and should be no trouble on Linux. What I need to port is all the OpenCL initialization, and driving the GPU. I don't expect a lot of issues, yet is has to be done. Plus, I don't even have the AMD SDK installed on Linux yet .



Hehe, yes. Just multiplying zeros. Very fast, I tell you!



To OpenCL it claims to have 10 compute cores (which would be 800 PEs), but AMD docs for HD 5750 say 9 cores (720 PEs) ... not sure about this. Claimed clock speed is 850 MHz.



Actually, I don't know. If you manage to install the AMD APP (http://developer.amd.com/sdks/AMDAPP...s/default.aspx) then it will run, but it may ignore the GPU and run on the CPU instead. You may end up with 1.5M/s



Hmm, I did not think that was still necessary. But I just compiled the 32-bit version and it is running the self test just fine. So no big deal if needed.



It would certainly work, but comparing the specs I'd expect a speed of ~4M/s with the current kernel. mprime on your CPU is probably way faster (but those 4M/s would be on top of it at almost no CPU cost). Worth testing anyway.
The 5450 has 1 SIMD array. Or better: 1 compute unit.

http://www.amd.com/us/products/deskt...verview.aspx#2

Quoted at most sites as being 650Mhz core frequency.

In the long run i'd expect the speed of it for the 72 bits kernel
to be around 8M/s - 12M/s when APP SDK 2.5 releases, sieving
to generate FC's not counted.

Which is same speed like 2 AMD64 cores achieve.

The theoretic calculation is not so complicated i'd say. It delivers 650Mhz * 80 cores == 52k M instructions per cycle. Naive guess what cost is of a single FC operation would be that one needs a max of around 4500 instructions for that. That's a pretty safe guess if i may say so.

Not achieving that would be very bad in OpenCL. If the compiler somehow focks up there, it might be needed to wait until the next SDK for AMD to fix that.

Vincent

Last fiddled with by diep on 2011-06-08 at 16:46
diep is offline   Reply With Quote
Old 2011-06-08, 19:22   #17
stefano.c
 
Aug 2010
sigma gimps team (Italia)

310 Posts
Default

Quote:
Originally Posted by Bdot View Post
Thanks for the replies and help offers, I'll contact you directly with more info.
For email: sigma_gimps_team@yahoo.it
stefano.c is offline   Reply With Quote
Old 2011-06-09, 00:52   #18
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

2×197 Posts
Default

Quote:
Originally Posted by Bdot View Post
If you manage to install the AMD APP (http://developer.amd.com/sdks/AMDAPP...s/default.aspx) then it will run, but it may ignore the GPU and run on the CPU instead. You may end up with 1.5M/s
I dealt with that! I eventually found this selects only GPUs:
Code:
  cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM,
    (cl_context_properties)platform,
    0
  };

context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);
Feel free to browse my ppsieve-cl source; just remember that it's GPLed if you want to use any of the rest of it.

Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL. He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)

Last fiddled with by Ken_g6 on 2011-06-09 at 00:56 Reason: Context for context=
Ken_g6 is offline   Reply With Quote
Old 2011-06-09, 09:20   #19
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2×337 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
I dealt with that! I eventually found this selects only GPUs:
Code:
  cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM,
    (cl_context_properties)platform,
    0
  };

context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);
Feel free to browse my ppsieve-cl source; just remember that it's GPLed if you want to use any of the rest of it.

Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL. He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)
See other postings elsewhere of mine. I figured out that AMD had forgotten to implement in OpenCL the top16 bits instruction.

Initially, as usual someone posted desinformation that this would be slow instruction. Posted to the AMD forum it was. Seemed like AMD guy, but i'm not sure. You never know in this forum.

I filed to AMD then official question how fast it would be. How many would've done that if someone already answerred it in the forum?

Answer after some weeks was: it runs at full speed.

The logical question then was to include it in OpenCL. It's possible they do that for APP SDK 2.5

So until then it takes 5 cycles to get the full 48 bits, 1 instruction for low 32 bits and 4 cycles for the top bits using the 32x32 topbits instruction (and we assume of course you have no more than 24 bits info per operand otherwise you'll need another 2 AND's).

This means that at APP SDK 2.5 the 24x24==48 bits operands run at unrivalled speed at AMD gpu's in OpenCL.

How many guys who 'toyed' with those gpu's in opencl have been asleep?

I don't know sir, but if i look to the AMD helpdesk, their official ticket system just progresses a few questions a week. My initial questions though asked over a period over a few days, had sequential numbers.

Nearly no one asks them official questions. Find it weird it never improved, they hardly get official feedback!

In those forums, just like here, people just shout something.

You can throw away all those benchmarks on how fast those gpu's are for integers in opencl.

Obviously one needs patience if you figure out how to do something faster.

Yet this is an immense speedup. The quick code port of chrisjp of the code of TheJudger to opencl, it uses this slow junk you know.

It is not so hard to write down what until the release of the 2.5 app sdk is fastest.

That's using a 70 bits implementation and store 14 bits per limb. Use 5 limbs.

With multiply-add it eats then 25 cycles for the multiplication and a bit of shifting added bla bla you soon end up at nearly 50 cycles, so that's where my guess was based upon that it would take a cycle or 4500 to do the complete FC testing.

These AMD gpu's delivers many teraflops however, completely unrivalled.

I would guess that modifying the code from Bdot using 5 limbs @ 70 bits would be peanuts to do so. That will get a high speed.

So the peak of the 5870 and especially 6970 theoretically lies somewhere around 1.351 Tflop / 4.5k = 1351M / 4.5 = 300M /s

Not based upon multiplying zero's, yet it's a theoretic number. Don't forget the problem of theoretic numbers :)

Then when the app sdk 2.5 releases, suddenly we can build a fast 72 bits kernel with quite a higher peak performance :)
diep is offline   Reply With Quote
Old 2011-06-09, 09:40   #20
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

67410 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
I dealt with that! I eventually found this selects only GPUs:
Code:
  cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM,
    (cl_context_properties)platform,
    0
  };

context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);
Feel free to browse my ppsieve-cl source; just remember that it's GPLed if you want to use any of the rest of it.

Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL. He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)
the 0 cycles required no-op is nonsense. this hardware delivers teraflops. there is nothing else that delivers that much that you can buy for $90 on ebay.

It is great if everything eats 1 cycle. There is no competition against those AMD gpu's possible at 24x24=48 bits level.

Just most 'testers' are not so clever to check actually the architecture manual to see whether there is a possibility to do it faster.

Forget 64 bits thinking, these gpu's are inherently 32 bits entities. If you want to force them to throw in extra transistors to do 64 bits arithmetic that's going to penalize our TF speed, as we want of course preferably a petaflop in 32 bits.

If you take a good look at TheJudgers code the fastest kernel is based upon 24 bits multiplications.

That is for a good reason. It's naive to believe that the gpu's will get 64 bits.

Graphics in itself can do with 16 bits. So having 24 bits or even 32 bits is already luxury.

64 bits integers always will be too slow on those gpu's of course.

Instead of extra transistors thrown away to that, i prefer another few thousands of PE's.

In the first place those gpu's deliver all this calculation power because all those kids who game at 'em pay for us. Only when sold in huge quantities, so billions of those gpu's, then the price can keep at its current level.

The "fast highend HPC " chips, take power6 or power7, that's what is it, like a $200k per unit or so?

A single gpu has the same performance, yes even at double precision, like such entire power7.

Just programming for it is a lot more complicated and that's because AMD and Nvidia simply have no good explanation on the internal workings of the gpu's. Very silly, yet it's the truth.

The factories that produce this hardware have a building costs of many billions. Intel was expecting by 2020 that a new factory would have a price of $20 billion. Each new proces technology, also the price of the factories goes up. Intel calls that the 2nd law of Moore, google for it.

This means that only processors that get produced in HUGE QUANTITY have a chance of getting sold at some economic price.

Right now all those gamers pay for that development as i said earlier on and what's fastest for them doesn't involve any 64 bits logics.

Vincent
diep is offline   Reply With Quote
Old 2011-06-09, 11:03   #21
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

24×3×23 Posts
Default

Hi Vincent,

Quote:
Originally Posted by diep View Post
If you take a good look at TheJudgers code the fastest kernel is based upon 24 bits multiplications.
this is not 100% true. For compute capability 1.x GPUs this is true but the "barrett79_mul32" kernel comes very close to the speed of the "71bit_mul24" kernel. For compute capability 2.x GPUs the "71bit_mul24" kernel is the slowest of the five kernels. Only the "71bit_mul24" kernel uses 24 bit multiplications.

Oliver
TheJudger is offline   Reply With Quote
Old 2011-06-09, 13:05   #22
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

12428 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi Vincent,



this is not 100% true. For compute capability 1.x GPUs this is true but the "barrett79_mul32" kernel comes very close to the speed of the "71bit_mul24" kernel. For compute capability 2.x GPUs the "71bit_mul24" kernel is the slowest of the five kernels. Only the "71bit_mul24" kernel uses 24 bit multiplications.

Oliver
And that isn't 64 bits code either, that's my most important point. It's all 32 bits code.

Now where AMD has a similar amount of 'units' that provide 32x32 bits multiplications to nvidia, for 24x24==48 bits, it has 4 times that amount (in fact it uses in the 6000 series precisely 4 simple units to emulate 32x32 bits or 64 bits or double precision).

Now it might be easier to code things in 64 bits (we would be missing a 64x64 == 128 bits multiplication then), yet we must face that the gpu's keep 32 bits.

As for AMD gpu's the time it takes to multiply 32x32 == 64 bits, which is 2 instructions, you can push through 8 simple instructions during the same timespan.

8 simple instructions will multiply 4 x 24x24== 48 bits ==> so that outputs 192 bits in total, whereas 2 x 32 bits multiplies outputs 64 bits in the same timespan.

For Nvidia that break even point is a tad different of course as you correctly indicate.

So per cycle a single AMD gpu can output 1536 * 48 * 0.5 = 36864 bits

The older AMD gpu's on paper even a tad more than that.

Good codes really achieve a very high IPC on those modern gpu's (Fermi, 5000 or 6000 series).

How many integer bits can a GTX580 output per cycle?
What was it, using 32x32 bits multiplications something like:

512 * 64 * 0.5 = 16384 bits

Is that correct?

Or was it 2 cycles per instruction making it 8192 bits per cycle?

Last fiddled with by diep on 2011-06-09 at 13:06
diep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GPU Computing 2284 2020-06-06 23:36
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3271 2020-05-19 22:42
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 07:08.

Sun Jun 7 07:08:19 UTC 2020 up 74 days, 4:41, 0 users, load averages: 1.54, 1.77, 1.79

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.