![]() |
|
|
#562 | |
|
"Jerry"
Nov 2011
Vancouver, WA
1,123 Posts |
Quote:
|
|
|
|
|
|
|
#563 | |
|
"Mr. Meeseeks"
Jan 2012
California, USA
1000011110002 Posts |
Quote:
watercooling is the only way to cool when the ambient tempature is hot, imagine air "cooling" a gpu with "warm" air. |
|
|
|
|
|
|
#564 | |
|
Nov 2010
Germany
3×199 Posts |
Quote:
In case you're running Linux ... I could add a 32-bit version too, but there it should be easier (and better) for you to go to 64-bit. Trial-factoring jobs can be taken over from prime95 to mfakto, but mfakto will not read prime95's save files. Depending on your GPU/CPU speed ratio it may or may not be better to finish already started trial factoring on the CPU (but generally it is not recommended anymore to do trial-factoring on the CPU - use it for job types that can not (yet) run on the GPU, especially P-1). |
|
|
|
|
|
|
#565 |
|
Nov 2010
Germany
3×199 Posts |
Hi diep, welcome back to this forum.
While I understand parts of your frustration with AMD and their responsiveness, there's no reason to fill this thread with all this gibberish and wrong statements ("you lose 4x2=16 cycles", "cannot use multiply-add in trial factoring", "7970 is a shrink of the 6790" and all that crap).
|
|
|
|
|
|
#566 | |
|
Sep 2006
The Netherlands
36 Posts |
Quote:
Of course that's because they had their most junior dude rewrite the old document, basically a few diagrams were stripped (which explained how a compute core worked). So if you claim it is different. PROVE it with facts. |
|
|
|
|
|
|
#567 |
|
Sep 2006
The Netherlands
36 Posts |
The hard facts are that we need multiplication for prime numbers and AMD didn't change anything there, which is what you admit, so nothing changed from my viewpoint in 6970 to 7900 in IPC of the GPU.
The fact that you admit they didn't change anything there makes me wonder why you wrote that posting. Important in wagstaff is getting to 70 bits quick now. 72 bits would be even more wonderful. If we use 24x24 bits we can get to 72 bits relative fast. It's on paper 2 clockcycles if AMD would not be giving such bad support, assuming they have nothing to hide. It's 3 units there. 2 multiplications needed (hi and lo), so that's with schoolboy 18 clocks and some overhead distributing. Very attractive. Doing that at 79xx and 69xx with 32x32 bits multiplications we also need 3 units. However it eats 8 clock units (units times number of cycles). So the price then is 8 * 9 = 72 clocks for schoolboy. As i described before, the current situation with just low bits 24x24 available we can in theory use up to 16 bits out of it. Now with FMA if we use 14 bits and 5 units we can get to 70 bits with 5 x 5 = 25 multiplications = 25 clocks in this case and relative little overhead (under factor 2). Yet its limit is 70 bits. To get to 72 bits we need 6 units (6 * 14 = 84 bits in total). That's 36 clockcycles and some overhead for the rest. What AMD modified or didn't modify is therefore total irrelevant as long as their 32x32 multiplication is this slow and as long as they keep the 24x24 topbits results (16 bits) not available to the user, meanwhile claiming it's fast (which we cannot verify). Nvidia is 4x faster there. Nvidia doesn't need to combine 4 of its cores for a single multiplication. That's why. That's why Nvidia rules in Trial factoring, because by now old generation Fermi is 4x faster in multiplication than AMD and total kicks the latest generation AMD gpu. Note that i also showed you why a newer proces technology that allows 2x faster GPU, why it didn't result in a gpu that's faster than Nvidia ones for TF, as they increased computational power only factor 1.4. The reason is that AMD clocked the GPU higher and especially added bandwidth to the RAM. All this for games. Nothing wrong optimizing for that - but let's stick to the facts. Nvidia is *way* faster for Trial factoring and the latest AMD generation didn't change anything there - to get to 70-72 bits for TF i better use one of the tesla's here. Which is a waste however, as i want to use them for a modified FFT - modified for Wagstaff. So the GPU i have here is gonna get that job up to a bit or 70 :) Last fiddled with by diep on 2012-11-27 at 13:06 |
|
|
|
|
|
#568 |
|
Nov 2010
Germany
3×199 Posts |
OpenCL 32x32 bit multiplication is still slower than most other operations on GCN. That is still true. But there's so much architectural changes from 6970 to 7970 that have a huge effect on mfakto, so that a 7970 achieves about twice the throughput (not just 1.4 times, as the raw Flops).
Have a look at https://github.com/Bdot42/mfakto/blo...c/barrett15.cl for an implementation of 5x15 bit and 6x15 bit trial factoring. Why did you want to use just 14 bits per int? |
|
|
|
|
|
#569 | |
|
Sep 2006
The Netherlands
36 Posts |
Quote:
I was the first one here to propose using 16x16 bits multiplications here at AMD gpu's :) You can just do a bunch of FMA's then and add up the remnants later, so that speeds up things considerable over toying with 15 bits. Disadvantage is that you have 70 bits then. Yet 70 bits for Wagstaffs TF is already big progress. The 79xx will achieve same speed there like 69xx series of course. It's always possible to design slower codes than what is objectively possible to achieve, which lets newer hardware look like better. We've seen that trick all too much in compilers past decades... Of course for Mersenne you need a 96 bits kernel now or something, so AMD is not interesting at all for Mersenne. Last fiddled with by diep on 2012-11-27 at 16:29 |
|
|
|
|
|
|
#570 |
|
"Mr. Meeseeks"
Jan 2012
California, USA
41708 Posts |
But that is not what the mfakto benchmarks say. Got a loose screw somewhere?
EDIT: I know, I know they shrunk the die, but GCN has vast improvements over VLIW4-5, but believe what you wish. Last fiddled with by kracker on 2012-11-27 at 16:31 |
|
|
|
|
|
#571 | |
|
Sep 2006
The Netherlands
72910 Posts |
Quote:
Yet you manage to get the 25 mad24's i had proposed back then for the mul_75_150 kernel (which was mul_70_140 in my proposal) to 19 multiplications and 7 shifts besides a few adds and ands. How fast is shifting on the GPU? I never could figure that out. What i do see is that you already use the result of res->d3 directly like this: Code:
res->d3 = mad24(a.d0, b.d3, res->d3); res->d4 = mad24(a.d4, b.d0, res->d3 >> 15); If you run 2 threads at the same time, then it still gives a penalty of 4 cycles isn't it? As it takes a cycle or 8 to retire the result and free it up for next multiplication. Isn't it possible to optimize that better? Kind Regards, Vincent |
|
|
|
|
|
|
#572 | |
|
Sep 2006
The Netherlands
10110110012 Posts |
Quote:
They keep programming those things always as if they're x86 cpu's. Code:
a = b; c = a; By running nowadays a hardware 2 threads they avoided the problem to 4 cycles. Next runs at full speed: Code:
a = b; x = x1; y = y1; z = z1; c = a; In tests performed you can easily get to 70%. Most codes got however 25% as they kept programming the gpu's as x86. So nearly all improvements are because of this and basically do not influence optimal codes. It's interesting to see how government coders also learn pretty slow there. The solution Nvidia and AMD showed up with some generations ago was to host 2 threads and run those alternating in hardware. Both do kind of the same thing, yet they explain it both in a different manner (avoiding each others patents i assume). So that reduced problems bigtime. That's why most codes speeded up - not because the GPU could push through a higher IPC objectively - just the codes were crap. That sort of improvements you'll see. What would be interesting on AMD is a faster 32x32 multiplication however :) And i bet they won't soon do that... Last fiddled with by diep on 2012-11-27 at 20:05 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| gpuOwL: an OpenCL program for Mersenne primality testing | preda | GpuOwl | 2718 | 2021-07-06 18:30 |
| mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3497 | 2021-06-05 12:27 |
| LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
| OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
| Program to TF Mersenne numbers with more than 1 sextillion digits? | Stargate38 | Factoring | 24 | 2011-11-03 00:34 |