mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-11-28, 09:51   #573
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

10010101012 Posts
Default

Quote:
Originally Posted by diep View Post
I see you do something very similar to what i had proposed (is it already nearly 2 years ago - oh boy) back then.

Yet you manage to get the 25 mad24's i had proposed back then for the
mul_75_150 kernel (which was mul_70_140 in my proposal) to 19
multiplications and 7 shifts besides a few adds and ands.

How fast is shifting on the GPU?

I never could figure that out.

What i do see is that you already use the result of res->d3 directly like this:

Code:
res->d3 = mad24(a.d0, b.d3, res->d3);

res->d4 = mad24(a.d4, b.d0, res->d3 >> 15);
Is that a good idea to use the result of res->d3 directly?

If you run 2 threads at the same time, then it still gives a penalty of 4 cycles isn't it?

As it takes a cycle or 8 to retire the result and free it up for next multiplication.

Isn't it possible to optimize that better?

Kind Regards,
Vincent
Hi Vincent,

also 2 years ago, I already explained that the VLIW4/5 architecture has special P-registers that always hold a copy of the previous cycle's operation result. This P register (actually 5x32bit: 1 scalar and 1 vector) is immediately available for the next instruction in the pipeline, without additional delay/penalty.

And for the code you cite, keep in mind that each of the values used is a 4-way vector for VLIW, and a 2-way vector for GCN. Each vector component processes a different factor candidate of the same exponent. Therefore, the code expands (on VLIW) to
Code:
res->d3.x = mad24(a.d0.x, b.d3.x, res->d3.x);
res->d3.y = mad24(a.d0.y, b.d3.y, res->d3.y);
res->d3.z = mad24(a.d0.z, b.d3.z, res->d3.z);
res->d3.w = mad24(a.d0.w, b.d3.w, res->d3.w);
 
res->d4.x = mad24(a.d4.x, b.d0.x, res->d3.x >> 15.x);
res->d4.y = mad24(a.d4.y, b.d0.y, res->d3.y >> 15.y);
res->d4.z = mad24(a.d4.z, b.d0.z, res->d3.z >> 15.z);
res->d4.w = mad24(a.d4.w, b.d0.w, res->d3.w >> 15.w);
Still, using vectors was not necessary to hide latency, but provide enough independent instructions for filling all compute units.

Regarding the timing: shift, add, mul24/mad24, mul, mul_hi etc. all takes one cycle in the pipeline (the difference between 32x32 and 24x24 is not the timing, but that 32x32 can be done only in the t-unit on VLIW5 - still allowing to co-issue other instructions for the xyzw-units). The pipeline length differs a bit between the architectures, but is somewhere ~10. For GCN they reduced it quite a bit (to 7, I think, but not sure). So if you write a kernel that does a single add, then it will take 7 or ~10 cycles for that single calculation.

BTW, on GCN, you now have a separate scalar unit that helps with instruction decoding, branch/jump calculation etc. When using assembly, you can run 32x32 muls on this unit at full speed! OpenCL will only issue mul instructions to the vector units, but on the AMD forum, you can find that for each mul24 in the vector units you can issue an additional mul for the scalar unit, increasing total throughput quite a bit.

Last fiddled with by Bdot on 2012-11-28 at 10:12 Reason: Actually, the constant 15 is also widened to a vector ...
Bdot is offline   Reply With Quote
Old 2012-11-28, 10:40   #574
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

36 Posts
Default

Quote:
Originally Posted by Bdot View Post
Hi Vincent,

also 2 years ago, I already explained that the VLIW4/5 architecture has special P-registers that always hold a copy of the previous cycle's operation result. This P register (actually 5x32bit: 1 scalar and 1 vector) is immediately available for the next instruction in the pipeline, without additional delay/penalty.

And for the code you cite, keep in mind that each of the values used is a 4-way vector for VLIW, and a 2-way vector for GCN. Each vector component processes a different factor candidate of the same exponent. Therefore, the code expands (on VLIW) to
Code:
res->d3.x = mad24(a.d0.x, b.d3.x, res->d3.x);
res->d3.y = mad24(a.d0.y, b.d3.y, res->d3.y);
res->d3.z = mad24(a.d0.z, b.d3.z, res->d3.z);
res->d3.w = mad24(a.d0.w, b.d3.w, res->d3.w);
 
res->d4.x = mad24(a.d4.x, b.d0.x, res->d3.x >> 15.x);
res->d4.y = mad24(a.d4.y, b.d0.y, res->d3.y >> 15.y);
res->d4.z = mad24(a.d4.z, b.d0.z, res->d3.z >> 15.z);
res->d4.w = mad24(a.d4.w, b.d0.w, res->d3.w >> 15.w);
Still, using vectors was not necessary to hide latency, but provide enough independent instructions for filling all compute units.

Regarding the timing: shift, add, mul24/mad24, mul, mul_hi etc. all takes one cycle in the pipeline (the difference between 32x32 and 24x24 is not the timing, but that 32x32 can be done only in the t-unit on VLIW5 - still allowing to co-issue other instructions for the xyzw-units). The pipeline length differs a bit between the architectures, but is somewhere ~10. For GCN they reduced it quite a bit (to 7, I think, but not sure). So if you write a kernel that does a single add, then it will take 7 or ~10 cycles for that single calculation.

BTW, on GCN, you now have a separate scalar unit that helps with instruction decoding, branch/jump calculation etc. When using assembly, you can run 32x32 muls on this unit at full speed! OpenCL will only issue mul instructions to the vector units, but on the AMD forum, you can find that for each mul24 in the vector units you can issue an additional mul for the scalar unit, increasing total throughput quite a bit.
Thanks for your extensive answer!

If i read well the 7970 is 2.5% faster for multiplying than the 6900 series. This according to the guy testing 'at home'.

Especially the response of the AMD helpdesk is typical:

"AMD does not provide any tools or support to do what you want."

Last fiddled with by diep on 2012-11-28 at 10:41
diep is offline   Reply With Quote
Old 2012-11-28, 12:51   #575
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

10110110012 Posts
Default

Interesting is the decoding you mention - nowadays the biggest concern for CPU designers so i read one posting.

The L1i of the AMD gpu's used gets described as being an ultratiny 8KB. What i do not know is the instruction length of each instruction inside that 8KB. How many instructions can it store?

As each compute unit doesn't have a local L2, there is only slow L2's nearby the memory channels, i assume that it's too expensive to have instructions sitting outside the L1i.

I had been calculating how to get a chess program to work within that 8 KB but didn't find a solution so far.

How many instructions can i get inside that L1i?

What i assume is that several alternating threads can reuse the same L1i (which makes sense as an assumption).

The only solution i saw is build different kernels doing different things and sometimes launch one of the kernels that can store to the RAM the transpositiontable (=hashtable) entries, where we can have a bunch of numbers in the RAM that we didn't proces yet, which can remove dubious entries which we didn't hit yet. This second kernel B would just launch very quickly and do real little.

Yet all that is a pretty big instructionstream if we add up the 2 kernels, so i didn't see a way how to do that at AMD yet.
diep is offline   Reply With Quote
Old 2012-11-28, 15:06   #576
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by diep View Post
If i read well the 7970 is 2.5% faster for multiplying than the 6900 series. This according to the guy testing 'at home'.
I don't think you read that right. There was no comparison to 6900. The mentioned 2.33% compare the 88GOps (int 32x32) of the S-unit to the 3788.8 GFlops that the 7970 is supposed to deliver (so it compares 7970 using the S-unit to 7970 without using the S-unit). If you just count 32x32 muls, then there's a 18.6% improvement.

But you can do the math yourself:
5870: 2720GFlops / 544 DP-GFlops / 272 (32x32int-)Gops (320 SIMDs x 850MHz) + 1088 Gops (no mul32: e.g. mul24/add/shift/load/logic)
6970: 2703GFlops / 675 DP-GFlops / 338 (32x32int-)Gops (384 SIMDs x 880MHz)
7970: 3789GFlops / 947 DP-GFlops / 473(+88 if using S-unit) (32x32int-)Gops

The 5870 is better than 6970 at 32x32 code because it can run all the shifts/masking/adding/carry-calculation in parallel. My 32-bit kernels achieve ~65% occupation on VLIW5. That means that in average, the t-unit and 2.2 additional units are occupied. This in turn means that less than a third of the instructions are really bound to the t-unit (note that also all type conversions float-to-int and back are t-unit only).

So far, so good. but the lower end GCN cards now have a problem:
7850: 1761GFlops / 110 DP-GFlops / 55(+41 if using S-unit) (32x32int-)Gops
compare to
5770: 1360GFlops / 0 DP-GFlops / 136 (32x32int-)Gops (160 SIMDs x 850MHz) + 544 Gops (no mul32: e.g. mul24/add/shift/load/logic)

They suck so terribly in 32x32 performance now due to the relation to the crippled DPFP speed, that the boost by the S-unit is more than welcome. It will be hard to use, and it still needs to be tested that it runs at full SP speed on the lower-end cards too. But it may be worth the effort.

Still it needs to be noted that despite the above bad numbers, the 7850 is 30% faster than the 5770 when running the same 32-bit-mul kernel (the 15-bit kernels are ~55% faster). A big part of that comes from the architecture improvements that I mentioned. And that's why it is plain wrong to call GCN "just a shrink" of VLIW.

Quote:
Originally Posted by diep View Post
Especially the response of the AMD helpdesk is typical:

"AMD does not provide any tools or support to do what you want."
sad but true
Bdot is offline   Reply With Quote
Old 2012-12-06, 11:53   #577
dbaugh
 
dbaugh's Avatar
 
Aug 2005

2×59 Posts
Default

I just got a machine with a GTX590. It shows up in GPU-Z as two cards. With 6 instances of mfaktc it saturates one card and does not use the other. I was hoping it would just look like one fast card. How do I get the other half of the card in the game? Acually, four is enough to take half the card to 99% if the SievePrimes drops low enough.

Last fiddled with by dbaugh on 2012-12-06 at 12:14 Reason: more info
dbaugh is offline   Reply With Quote
Old 2012-12-06, 14:09   #578
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

36 Posts
Default

Quote:
Originally Posted by dbaugh View Post
I just got a machine with a GTX590. It shows up in GPU-Z as two cards. With 6 instances of mfaktc it saturates one card and does not use the other. I was hoping it would just look like one fast card. How do I get the other half of the card in the game? Acually, four is enough to take half the card to 99% if the SievePrimes drops low enough.
How many FC's a second are you seeing and for which exponent?
diep is offline   Reply With Quote
Old 2012-12-06, 15:29   #579
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

Quote:
Originally Posted by dbaugh View Post
I just got a machine with a GTX590. It shows up in GPU-Z as two cards. With 6 instances of mfaktc it saturates one card and does not use the other. I was hoping it would just look like one fast card. How do I get the other half of the card in the game? Acually, four is enough to take half the card to 99% if the SievePrimes drops low enough.
Add -d 1 (or maybe 2?) to the command line to access the second GPU.
kjaget is offline   Reply With Quote
Old 2012-12-06, 15:46   #580
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by kjaget View Post
Add -d 1 (or maybe 2?) to the command line to access the second GPU.
Yes, the -d switch should work very similar for mfakto and mfaktc: -d <GPU-num> with GPU-num starting at 0, and 0 being the default.

Edit: Oops, I just checked the code and GPU-num is rather starting at 1, with 1 being the default for mfakto. And now I'm not sure what this is for mfaktc ...
Edit2: Now I checked mfaktc, and there the device number starts at 0 ... Hmmm

Last fiddled with by Bdot on 2012-12-06 at 16:28 Reason: Oops, oops again
Bdot is offline   Reply With Quote
Old 2012-12-06, 16:23   #581
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

I thought the 590 or 690 didn't need a -d switch, thought they were together with internal SLI.. hmm..
kracker is offline   Reply With Quote
Old 2012-12-06, 17:10   #582
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

3·3,221 Posts
Default

SLI or not SLI, you need the -d. You can enable/disable SLI from the nvidia drivers (right click on the desktop in windows, chose nvidia ctrl panel, if you have it). Interesting, I have some computer with 2x gtx580 in the box, if I enable/disable SLI, the -d switches are reverted (say, upper card, the one which is connected to the monitors, is 0, the other is 1, if I play with SLI, they are reverted, now I don't know exactly which is which). The connector is two-ways. There is no speed difference (no SLI seems faster, but this may be subjective, I did not really measured wall-clock). For 590 the differences would be a lower clock and a PCB (hardware) wired SLI, instead of th etwo-way cable. The rest would be the same.

Edit: I saw a Mars II, unfortunately not mine, it was the same.

Last fiddled with by LaurV on 2012-12-06 at 17:13
LaurV is offline   Reply With Quote
Old 2012-12-06, 18:06   #583
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Quote:
Originally Posted by LaurV View Post
SLI or not SLI, you need the -d. You can enable/disable SLI from the nvidia drivers (right click on the desktop in windows, chose nvidia ctrl panel, if you have it). Interesting, I have some computer with 2x gtx580 in the box, if I enable/disable SLI, the -d switches are reverted (say, upper card, the one which is connected to the monitors, is 0, the other is 1, if I play with SLI, they are reverted, now I don't know exactly which is which). The connector is two-ways. There is no speed difference (no SLI seems faster, but this may be subjective, I did not really measured wall-clock). For 590 the differences would be a lower clock and a PCB (hardware) wired SLI, instead of th etwo-way cable. The rest would be the same.

Edit: I saw a Mars II, unfortunately not mine, it was the same.
Ah, I see. Thanks for clarifying!
kracker is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2719 2021-08-05 22:43
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 01:05.


Fri Aug 6 01:05:11 UTC 2021 up 13 days, 19:34, 1 user, load averages: 2.17, 2.36, 2.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.