mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-11-26, 21:01   #562
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default

Quote:
Originally Posted by diep View Post
I'm no expert on gpu cooling, yet from my viewpoint the default cooler that's on those cards is a crap.

If you look on ebay you see some really good cooling designs to keep the temperature down of the videocards.

http://www.youtube.com/watch?v=DGeERB7BDHM

Of course there is no more warranty then on the gpu, but you have to sacrafice something for cool factoring speeds.
I have swapped the stock coolers on the 580s to aftermarket fan systems and they don't help much. I've never tried watercooling, though from what I've read it's really the only way to keep them cool enough in the summertime. Right now it's winter here, so I can use the window to keep the systems nice and chilly.
flashjh is offline   Reply With Quote
Old 2012-11-26, 21:17   #563
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

Quote:
Originally Posted by flashjh View Post
I have swapped the stock coolers on the 580s to aftermarket fan systems and they don't help much. I've never tried watercooling, though from what I've read it's really the only way to keep them cool enough in the summertime. Right now it's winter here, so I can use the window to keep the systems nice and chilly.
+1
watercooling is the only way to cool when the ambient tempature is hot, imagine air "cooling" a gpu with "warm" air.
kracker is offline   Reply With Quote
Old 2012-11-27, 09:38   #564
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

25516 Posts
Default

Quote:
Originally Posted by JacksonML View Post
Please add 32-bit. At the moment I only have a 32-bit computer. Also, would I be able to take the current jobs on Prime95 and move them over to this GPU version? Please answer as that would be helpful
As you are talking about prime95 I assume you're running Windows. For that, the 32-bit version is built as well, have a look at http://www.mersenneforum.org/mfakto/...o-0.12-win.zip, in there you'll see mfakto-x32.exe (or so).

In case you're running Linux ... I could add a 32-bit version too, but there it should be easier (and better) for you to go to 64-bit.

Trial-factoring jobs can be taken over from prime95 to mfakto, but mfakto will not read prime95's save files. Depending on your GPU/CPU speed ratio it may or may not be better to finish already started trial factoring on the CPU (but generally it is not recommended anymore to do trial-factoring on the CPU - use it for job types that can not (yet) run on the GPU, especially P-1).
Bdot is offline   Reply With Quote
Old 2012-11-27, 10:36   #565
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

11258 Posts
Default

Hi diep, welcome back to this forum.

While I understand parts of your frustration with AMD and their responsiveness, there's no reason to fill this thread with all this gibberish and wrong statements ("you lose 4x2=16 cycles", "cannot use multiply-add in trial factoring", "7970 is a shrink of the 6790" and all that crap).

  • 7970, or GCN in general runs your programs totally different, you should notice immediately when you try it. Almost no vectorization necessary anymore, statement dependencies do not hurt anymore, jumps/loops/conditionals are way more efficient etc.
  • 32x32bit mul/mul_hi runs at the DP rate, which is different for 79xx and 77xx/78xx. On 79xx it is usable and faster than methods to reduce (24x24) or avoid (16x16) them. On the others, you're better of not using 32x32, but not by much.
  • Check out http://devgurus.amd.com/thread/159954 if you want to use mul24_hi. GCN assembly programming is fun!
Bdot is offline   Reply With Quote
Old 2012-11-27, 12:26   #566
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

10110110012 Posts
Default

Quote:
Originally Posted by Bdot View Post
Hi diep, welcome back to this forum.

While I understand parts of your frustration with AMD and their responsiveness, there's no reason to fill this thread with all this gibberish and wrong statements ("you lose 4x2=16 cycles", "cannot use multiply-add in trial factoring", "7970 is a shrink of the 6790" and all that crap).

  • 7970, or GCN in general runs your programs totally different, you should notice immediately when you try it. Almost no vectorization necessary anymore, statement dependencies do not hurt anymore, jumps/loops/conditionals are way more efficient etc.
  • 32x32bit mul/mul_hi runs at the DP rate, which is different for 79xx and 77xx/78xx. On 79xx it is usable and faster than methods to reduce (24x24) or avoid (16x16) them. On the others, you're better of not using 32x32, but not by much.
  • Check out http://devgurus.amd.com/thread/159954 if you want to use mul24_hi. GCN assembly programming is fun!
AMD has released documents on which we must base ourselves. They did not give any indication anything is different on 79xx architecture than 69xx.
Of course that's because they had their most junior dude rewrite the old document, basically a few diagrams were stripped (which explained how a compute core worked).

So if you claim it is different. PROVE it with facts.
diep is offline   Reply With Quote
Old 2012-11-27, 12:58   #567
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

36 Posts
Default

The hard facts are that we need multiplication for prime numbers and AMD didn't change anything there, which is what you admit, so nothing changed from my viewpoint in 6970 to 7900 in IPC of the GPU.

The fact that you admit they didn't change anything there makes me wonder why you wrote that posting.

Important in wagstaff is getting to 70 bits quick now. 72 bits would be even more wonderful.

If we use 24x24 bits we can get to 72 bits relative fast. It's on paper 2 clockcycles if AMD would not be giving such bad support, assuming they have nothing to hide.

It's 3 units there. 2 multiplications needed (hi and lo), so that's with schoolboy 18 clocks and some overhead distributing. Very attractive.

Doing that at 79xx and 69xx with 32x32 bits multiplications we also need 3 units. However it eats 8 clock units (units times number of cycles).

So the price then is 8 * 9 = 72 clocks for schoolboy.

As i described before, the current situation with just low bits 24x24 available we can in theory use up to 16 bits out of it. Now with FMA if we use 14 bits and 5 units we can get to 70 bits with 5 x 5 = 25 multiplications = 25 clocks in this case and relative little overhead (under factor 2).

Yet its limit is 70 bits.

To get to 72 bits we need 6 units (6 * 14 = 84 bits in total).

That's 36 clockcycles and some overhead for the rest.

What AMD modified or didn't modify is therefore total irrelevant as long as their 32x32 multiplication is this slow and as long as they keep the 24x24 topbits results (16 bits) not available to the user, meanwhile claiming it's fast (which we cannot verify).

Nvidia is 4x faster there. Nvidia doesn't need to combine 4 of its cores for a single multiplication. That's why.

That's why Nvidia rules in Trial factoring, because by now old generation Fermi is 4x faster in multiplication than AMD and total kicks the latest generation AMD gpu.

Note that i also showed you why a newer proces technology that allows 2x faster GPU, why it didn't result in a gpu that's faster than Nvidia ones for TF, as they increased computational power only factor 1.4.

The reason is that AMD clocked the GPU higher and especially added bandwidth to the RAM. All this for games.

Nothing wrong optimizing for that - but let's stick to the facts. Nvidia is *way* faster for Trial factoring and the latest AMD generation didn't change anything there - to get to 70-72 bits for TF i better use one of the tesla's here. Which is a waste however, as i want to use them for a modified FFT - modified for Wagstaff.

So the GPU i have here is gonna get that job up to a bit or 70 :)

Last fiddled with by diep on 2012-11-27 at 13:06
diep is offline   Reply With Quote
Old 2012-11-27, 14:32   #568
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

59710 Posts
Default

OpenCL 32x32 bit multiplication is still slower than most other operations on GCN. That is still true. But there's so much architectural changes from 6970 to 7970 that have a huge effect on mfakto, so that a 7970 achieves about twice the throughput (not just 1.4 times, as the raw Flops).

Have a look at https://github.com/Bdot42/mfakto/blo...c/barrett15.cl for an implementation of 5x15 bit and 6x15 bit trial factoring.

Why did you want to use just 14 bits per int?
Bdot is offline   Reply With Quote
Old 2012-11-27, 16:21   #569
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2D916 Posts
Default

Quote:
Originally Posted by Bdot View Post
OpenCL 32x32 bit multiplication is still slower than most other operations on GCN. That is still true. But there's so much architectural changes from 6970 to 7970 that have a huge effect on mfakto, so that a 7970 achieves about twice the throughput (not just 1.4 times, as the raw Flops).

Have a look at https://github.com/Bdot42/mfakto/blo...c/barrett15.cl for an implementation of 5x15 bit and 6x15 bit trial factoring.

Why did you want to use just 14 bits per int?
If you scroll back i already explained in depth how to do the 14 bits per ints about 1.5 years ago on this forum.
I was the first one here to propose using 16x16 bits multiplications here at AMD gpu's :)

You can just do a bunch of FMA's then and add up the remnants later, so that speeds up things considerable over
toying with 15 bits.

Disadvantage is that you have 70 bits then. Yet 70 bits for Wagstaffs TF is already big progress.
The 79xx will achieve same speed there like 69xx series of course.

It's always possible to design slower codes than what is objectively possible to achieve,
which lets newer hardware look like better.

We've seen that trick all too much in compilers past decades...

Of course for Mersenne you need a 96 bits kernel now or something, so AMD is not interesting at all for Mersenne.

Last fiddled with by diep on 2012-11-27 at 16:29
diep is offline   Reply With Quote
Old 2012-11-27, 16:25   #570
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

But that is not what the mfakto benchmarks say. Got a loose screw somewhere?

EDIT: I know, I know they shrunk the die, but GCN has vast improvements over VLIW4-5, but believe what you wish.

Last fiddled with by kracker on 2012-11-27 at 16:31
kracker is offline   Reply With Quote
Old 2012-11-27, 16:48   #571
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

13318 Posts
Default

Quote:
Originally Posted by Bdot View Post
OpenCL 32x32 bit multiplication is still slower than most other operations on GCN. That is still true. But there's so much architectural changes from 6970 to 7970 that have a huge effect on mfakto, so that a 7970 achieves about twice the throughput (not just 1.4 times, as the raw Flops).

Have a look at https://github.com/Bdot42/mfakto/blo...c/barrett15.cl for an implementation of 5x15 bit and 6x15 bit trial factoring.

Why did you want to use just 14 bits per int?
I see you do something very similar to what i had proposed (is it already nearly 2 years ago - oh boy) back then.

Yet you manage to get the 25 mad24's i had proposed back then for the
mul_75_150 kernel (which was mul_70_140 in my proposal) to 19
multiplications and 7 shifts besides a few adds and ands.

How fast is shifting on the GPU?

I never could figure that out.

What i do see is that you already use the result of res->d3 directly like this:

Code:
res->d3 = mad24(a.d0, b.d3, res->d3);

res->d4 = mad24(a.d4, b.d0, res->d3 >> 15);
Is that a good idea to use the result of res->d3 directly?

If you run 2 threads at the same time, then it still gives a penalty of 4 cycles isn't it?

As it takes a cycle or 8 to retire the result and free it up for next multiplication.

Isn't it possible to optimize that better?

Kind Regards,
Vincent
diep is offline   Reply With Quote
Old 2012-11-27, 19:59   #572
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

10110110012 Posts
Default

Quote:
Originally Posted by kracker View Post
But that is not what the mfakto benchmarks say. Got a loose screw somewhere?

EDIT: I know, I know they shrunk the die, but GCN has vast improvements over VLIW4-5, but believe what you wish.
Improvements are for non optimal codes.

They keep programming those things always as if they're x86 cpu's.

Code:
a = b;
c = a;
At a cpu this is not a problem. At a GPU you lost in past 8 cycles or so.
By running nowadays a hardware 2 threads they avoided the problem to 4 cycles.

Next runs at full speed:

Code:
a = b;
x = x1;
y = y1;
z = z1;
c = a;
If you want to run this also at full speed at the old nvidia 200 series, you need 8 statements.

In tests performed you can easily get to 70%.
Most codes got however 25% as they kept programming the gpu's as x86.
So nearly all improvements are because of this and basically do not influence optimal codes.

It's interesting to see how government coders also learn pretty slow there.

The solution Nvidia and AMD showed up with some generations ago was to host 2 threads and run those alternating in hardware. Both do kind of the same thing, yet they explain it both in a different manner (avoiding each others patents i assume).

So that reduced problems bigtime.

That's why most codes speeded up - not because the GPU could push through a higher IPC objectively - just the codes were crap.

That sort of improvements you'll see. What would be interesting on AMD is a faster 32x32 multiplication however :)

And i bet they won't soon do that...

Last fiddled with by diep on 2012-11-27 at 20:05
diep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2718 2021-07-06 18:30
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 07:40.


Mon Aug 2 07:40:10 UTC 2021 up 10 days, 2:09, 0 users, load averages: 1.22, 1.29, 1.34

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.