mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

diep 2011-06-08 09:44

[QUOTE=KingKurly;263233]Would the Radeon HD 5450 work for this, eventually? I am using Linux, so I would have to wait for that port to be ready.

[URL]http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2[/URL]

Linux's 'lspci' says:
02:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450][/QUOTE]

Of course this one will work fine under linux. Linux has a great driver.

diep 2011-06-08 09:59

[QUOTE=KingKurly;263233]Would the Radeon HD 5450 work for this, eventually? I am using Linux, so I would have to wait for that port to be ready.

[URL]http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2[/URL]

Linux's 'lspci' says:
02:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450][/QUOTE]

Note that besides driver you also need to install the APP SDK 2.4
both are at AMD site.

Bdot 2011-06-08 11:49

Thanks for the replies and help offers, I'll contact you directly with more info.

[QUOTE=diep;263199]If you have written OpenCL, where do you need a linux port for, as OpenCL is working for linux isn't it?
[/QUOTE]

The sieve, handling command line parameters, reading config files, writing result files is all taken over from Oliver and should be no trouble on Linux. What I need to port is all the OpenCL initialization, and driving the GPU. I don't expect a lot of issues, yet is has to be done. Plus, I don't even have the AMD SDK installed on Linux yet :smile:.

[QUOTE=diep;263199]
p.s. you also test in same manner like TheJudger, just multiplying zero's?
[/QUOTE]

Hehe, yes. Just multiplying zeros. Very fast, I tell you!

[QUOTE=diep;263200]heh Bdot what speed does your card run at?
[/QUOTE]

To OpenCL it claims to have 10 compute cores (which would be 800 PEs), but AMD docs for HD 5750 say 9 cores (720 PEs) ... not sure about this. Claimed clock speed is 850 MHz.

[QUOTE=vsuite;263212]Hi, would it work with an integrated ATI GPU?
[/QUOTE]

Actually, I don't know. If you manage to install the AMD APP ([URL]http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx[/URL]) then it will run, but it may ignore the GPU and run on the CPU instead. You may end up with 1.5M/s :smile:

[QUOTE=vsuite;263212]How soon the windows 32bit version?
[/QUOTE]

Hmm, I did not think that was still necessary. But I just compiled the 32-bit version and it is running the self test just fine. So no big deal if needed.

[QUOTE=KingKurly;263233]Would the Radeon HD 5450 work for this, eventually? I am using Linux, so I would have to wait for that port to be ready.
[/QUOTE]

It would certainly work, but comparing the specs I'd expect a speed of ~4M/s with the current kernel. mprime on your CPU is probably way faster (but those 4M/s would be on top of it at almost no CPU cost). Worth testing anyway.

diep 2011-06-08 16:30

[email]diep@xs4all.nl[/email] for email
diepchess for a range of messengers ranging from skype to aim to yahoo etc
also at some IRC servers sometimes (seldom). had also shipped you that in a private message but maybe you didn't get a notification there.

Lemme ask to thejudger whether he has a GPL header on top of his code as well, didn't checkout his latest code there.

diep 2011-06-08 16:42

[QUOTE=Bdot;263268]Thanks for the replies and help offers, I'll contact you directly with more info.



The sieve, handling command line parameters, reading config files, writing result files is all taken over from Oliver and should be no trouble on Linux. What I need to port is all the OpenCL initialization, and driving the GPU. I don't expect a lot of issues, yet is has to be done. Plus, I don't even have the AMD SDK installed on Linux yet :smile:.



Hehe, yes. Just multiplying zeros. Very fast, I tell you!



To OpenCL it claims to have 10 compute cores (which would be 800 PEs), but AMD docs for HD 5750 say 9 cores (720 PEs) ... not sure about this. Claimed clock speed is 850 MHz.



Actually, I don't know. If you manage to install the AMD APP ([URL]http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx[/URL]) then it will run, but it may ignore the GPU and run on the CPU instead. You may end up with 1.5M/s :smile:



Hmm, I did not think that was still necessary. But I just compiled the 32-bit version and it is running the self test just fine. So no big deal if needed.



It would certainly work, but comparing the specs I'd expect a speed of ~4M/s with the current kernel. mprime on your CPU is probably way faster (but those 4M/s would be on top of it at almost no CPU cost). Worth testing anyway.[/QUOTE]

The 5450 has 1 SIMD array. Or better: 1 compute unit.

[url]http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2[/url]

Quoted at most sites as being 650Mhz core frequency.

In the long run i'd expect the speed of it for the 72 bits kernel
to be around 8M/s - 12M/s when APP SDK 2.5 releases, sieving
to generate FC's not counted.

Which is same speed like 2 AMD64 cores achieve.

The theoretic calculation is not so complicated i'd say. It delivers 650Mhz * 80 cores == 52k M instructions per cycle. Naive guess what cost is of a single FC operation would be that one needs a max of around 4500 instructions for that. That's a pretty safe guess if i may say so.

Not achieving that would be very bad in OpenCL. If the compiler somehow focks up there, it might be needed to wait until the next SDK for AMD to fix that.

Vincent

stefano.c 2011-06-08 19:22

[QUOTE=Bdot;263268]Thanks for the replies and help offers, I'll contact you directly with more info.[/QUOTE]
For email: [email]sigma_gimps_team@yahoo.it[/email]

Ken_g6 2011-06-09 00:52

[QUOTE=Bdot;263268]If you manage to install the AMD APP ([URL]http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx[/URL]) then it will run, but it may ignore the GPU and run on the CPU instead. You may end up with 1.5M/s :smile:
[/QUOTE]

I dealt with that! I eventually found this selects only GPUs:
[code]
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
0
};

context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);[/code]

Feel free to browse [url=https://github.com/Ken-g6/PSieve-CUDA/tree/redcl]my ppsieve-cl source[/url]; just remember that it's GPLed if you want to use any of the rest of it.

Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL. He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)

diep 2011-06-09 09:20

[QUOTE=Ken_g6;263327]I dealt with that! I eventually found this selects only GPUs:
[code]
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
0
};

context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);[/code]

Feel free to browse [url=https://github.com/Ken-g6/PSieve-CUDA/tree/redcl]my ppsieve-cl source[/url]; just remember that it's GPLed if you want to use any of the rest of it.

Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL. He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)[/QUOTE]

See other postings elsewhere of mine. I figured out that AMD had forgotten to implement in OpenCL the top16 bits instruction.

Initially, as usual someone posted desinformation that this would be slow instruction. Posted to the AMD forum it was. Seemed like AMD guy, but i'm not sure. You never know in this forum.

I filed to AMD then official question how fast it would be. How many would've done that if someone already answerred it in the forum?

Answer after some weeks was: it runs at full speed.

The logical question then was to include it in OpenCL. It's possible they do that for APP SDK 2.5

So until then it takes 5 cycles to get the full 48 bits, 1 instruction for low 32 bits and 4 cycles for the top bits using the 32x32 topbits instruction (and we assume of course you have no more than 24 bits info per operand otherwise you'll need another 2 AND's).

This means that at APP SDK 2.5 the 24x24==48 bits operands run at unrivalled speed at AMD gpu's in OpenCL.

How many guys who 'toyed' with those gpu's in opencl have been asleep?

I don't know sir, but if i look to the AMD helpdesk, their official ticket system just progresses a few questions a week. My initial questions though asked over a period over a few days, had sequential numbers.

Nearly no one asks them official questions. Find it weird it never improved, they hardly get official feedback!

In those forums, just like here, people just shout something.

You can throw away all those benchmarks on how fast those gpu's are for integers in opencl.

Obviously one needs patience if you figure out how to do something faster.

Yet this is an immense speedup. The quick code port of chrisjp of the code of TheJudger to opencl, it uses this slow junk you know.

It is not so hard to write down what until the release of the 2.5 app sdk is fastest.

That's using a 70 bits implementation and store 14 bits per limb. Use 5 limbs.

With multiply-add it eats then 25 cycles for the multiplication and a bit of shifting added bla bla you soon end up at nearly 50 cycles, so that's where my guess was based upon that it would take a cycle or 4500 to do the complete FC testing.

These AMD gpu's delivers many teraflops however, completely unrivalled.

I would guess that modifying the code from Bdot using 5 limbs @ 70 bits would be peanuts to do so. That will get a high speed.

So the peak of the 5870 and especially 6970 theoretically lies somewhere around 1.351 Tflop / 4.5k = 1351M / 4.5 = 300M /s

Not based upon multiplying zero's, yet it's a theoretic number. Don't forget the problem of theoretic numbers :)

Then when the app sdk 2.5 releases, suddenly we can build a fast 72 bits kernel with quite a higher peak performance :)

diep 2011-06-09 09:40

[QUOTE=Ken_g6;263327]I dealt with that! I eventually found this selects only GPUs:
[code]
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
0
};

context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);[/code]

Feel free to browse [url=https://github.com/Ken-g6/PSieve-CUDA/tree/redcl]my ppsieve-cl source[/url]; just remember that it's GPLed if you want to use any of the rest of it.

Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL. He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)[/QUOTE]

the 0 cycles required no-op is nonsense. this hardware delivers teraflops. there is nothing else that delivers that much that you can buy for $90 on ebay.

It is great if everything eats 1 cycle. There is no competition against those AMD gpu's possible at 24x24=48 bits level.

Just most 'testers' are not so clever to check actually the architecture manual to see whether there is a possibility to do it faster.

Forget 64 bits thinking, these gpu's are inherently 32 bits entities. If you want to force them to throw in extra transistors to do 64 bits arithmetic that's going to penalize our TF speed, as we want of course preferably a petaflop in 32 bits.

If you take a good look at TheJudgers code the fastest kernel is based upon 24 bits multiplications.

That is for a good reason. It's naive to believe that the gpu's will get 64 bits.

Graphics in itself can do with 16 bits. So having 24 bits or even 32 bits is already luxury.

64 bits integers always will be too slow on those gpu's of course.

Instead of extra transistors thrown away to that, i prefer another few thousands of PE's.

In the first place those gpu's deliver all this calculation power because all those kids who game at 'em pay for us. Only when sold in huge quantities, so billions of those gpu's, then the price can keep at its current level.

The "fast highend HPC " chips, take power6 or power7, that's what is it, like a $200k per unit or so?

A single gpu has the same performance, yes even at double precision, like such entire power7.

Just programming for it is a lot more complicated and that's because AMD and Nvidia simply have no good explanation on the internal workings of the gpu's. Very silly, yet it's the truth.

The factories that produce this hardware have a building costs of many billions. Intel was expecting by 2020 that a new factory would have a price of $20 billion. Each new proces technology, also the price of the factories goes up. Intel calls that the 2nd law of Moore, google for it.

This means that only processors that get produced in HUGE QUANTITY have a chance of getting sold at some economic price.

Right now all those gamers pay for that development as i said earlier on and what's fastest for them doesn't involve any 64 bits logics.

Vincent

TheJudger 2011-06-09 11:03

Hi Vincent,

[QUOTE=diep;263348]If you take a good look at TheJudgers code the fastest kernel is based upon 24 bits multiplications.[/QUOTE]

this is not 100% true. For compute capability 1.x GPUs this is true but the "barrett79_mul32" kernel comes very close to the speed of the "71bit_mul24" kernel. For compute capability 2.x GPUs the "71bit_mul24" kernel is the slowest of the five kernels. Only the "71bit_mul24" kernel uses 24 bit multiplications.

Oliver

diep 2011-06-09 13:05

[QUOTE=TheJudger;263353]Hi Vincent,



this is not 100% true. For compute capability 1.x GPUs this is true but the "barrett79_mul32" kernel comes very close to the speed of the "71bit_mul24" kernel. For compute capability 2.x GPUs the "71bit_mul24" kernel is the slowest of the five kernels. Only the "71bit_mul24" kernel uses 24 bit multiplications.

Oliver[/QUOTE]

And that isn't 64 bits code either, that's my most important point. It's all 32 bits code.

Now where AMD has a similar amount of 'units' that provide 32x32 bits multiplications to nvidia, for 24x24==48 bits, it has 4 times that amount (in fact it uses in the 6000 series precisely 4 simple units to emulate 32x32 bits or 64 bits or double precision).

Now it might be easier to code things in 64 bits (we would be missing a 64x64 == 128 bits multiplication then), yet we must face that the gpu's keep 32 bits.

As for AMD gpu's the time it takes to multiply 32x32 == 64 bits, which is 2 instructions, you can push through 8 simple instructions during the same timespan.

8 simple instructions will multiply 4 x 24x24== 48 bits ==> so that outputs 192 bits in total, whereas 2 x 32 bits multiplies outputs 64 bits in the same timespan.

For Nvidia that break even point is a tad different of course as you correctly indicate.

So per cycle a single AMD gpu can output 1536 * 48 * 0.5 = 36864 bits

The older AMD gpu's on paper even a tad more than that.

Good codes really achieve a very high IPC on those modern gpu's (Fermi, 5000 or 6000 series).

How many integer bits can a GTX580 output per cycle?
What was it, using 32x32 bits multiplications something like:

512 * 64 * 0.5 = 16384 bits

Is that correct?

Or was it 2 cycles per instruction making it 8192 bits per cycle?


All times are UTC. The time now is 15:54.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.