mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

Bdot 2012-11-28 09:51

[QUOTE=diep;319749]I see you do something very similar to what i had proposed (is it already nearly 2 years ago - oh boy) back then.

Yet you manage to get the 25 mad24's i had proposed back then for the
mul_75_150 kernel (which was mul_70_140 in my proposal) to 19
multiplications and 7 shifts besides a few adds and ands.

How fast is shifting on the GPU?

I never could figure that out.

What i do see is that you already use the result of res->d3 directly like this:

[code]
res->d3 = mad24(a.d0, b.d3, res->d3);

res->d4 = mad24(a.d4, b.d0, res->d3 >> 15);
[/code]Is that a good idea to use the result of res->d3 directly?

If you run 2 threads at the same time, then it still gives a penalty of 4 cycles isn't it?

As it takes a cycle or 8 to retire the result and free it up for next multiplication.

Isn't it possible to optimize that better?

Kind Regards,
Vincent[/QUOTE]

Hi Vincent,

also 2 years ago, I already explained that the VLIW4/5 architecture has special P-registers that always hold a copy of the previous cycle's operation result. This P register (actually 5x32bit: 1 scalar and 1 vector) is immediately available for the next instruction in the pipeline, without additional delay/penalty.

And for the code you cite, keep in mind that each of the values used is a 4-way vector for VLIW, and a 2-way vector for GCN. Each vector component processes a different factor candidate of the same exponent. Therefore, the code expands (on VLIW) to
[code]
res->d3.x = mad24(a.d0.x, b.d3.x, res->d3.x);
res->d3.y = mad24(a.d0.y, b.d3.y, res->d3.y);
res->d3.z = mad24(a.d0.z, b.d3.z, res->d3.z);
res->d3.w = mad24(a.d0.w, b.d3.w, res->d3.w);

res->d4.x = mad24(a.d4.x, b.d0.x, res->d3.x >> 15.x);
res->d4.y = mad24(a.d4.y, b.d0.y, res->d3.y >> 15.y);
res->d4.z = mad24(a.d4.z, b.d0.z, res->d3.z >> 15.z);
res->d4.w = mad24(a.d4.w, b.d0.w, res->d3.w >> 15.w);
[/code]Still, using vectors was not necessary to hide latency, but provide enough independent instructions for filling all compute units.

Regarding the timing: shift, add, mul24/mad24, mul, mul_hi etc. all takes one cycle in the pipeline (the difference between 32x32 and 24x24 is not the timing, but that 32x32 can be done only in the t-unit on VLIW5 - still allowing to co-issue other instructions for the xyzw-units). The pipeline length differs a bit between the architectures, but is somewhere ~10. For GCN they reduced it quite a bit (to 7, I think, but not sure). So if you write a kernel that does a single add, then it will take 7 or ~10 cycles for that single calculation.

BTW, on GCN, you now have a separate scalar unit that helps with instruction decoding, branch/jump calculation etc. When using assembly, you can run 32x32 muls on this unit at full speed! OpenCL will only issue mul instructions to the vector units, but on the [URL="http://devgurus.amd.com/message/1186750#1279865"]AMD forum[/URL], you can find that for each mul24 in the vector units you can issue an additional mul for the scalar unit, increasing total throughput quite a bit.

diep 2012-11-28 10:40

[QUOTE=Bdot;319820]Hi Vincent,

also 2 years ago, I already explained that the VLIW4/5 architecture has special P-registers that always hold a copy of the previous cycle's operation result. This P register (actually 5x32bit: 1 scalar and 1 vector) is immediately available for the next instruction in the pipeline, without additional delay/penalty.

And for the code you cite, keep in mind that each of the values used is a 4-way vector for VLIW, and a 2-way vector for GCN. Each vector component processes a different factor candidate of the same exponent. Therefore, the code expands (on VLIW) to
[code]
res->d3.x = mad24(a.d0.x, b.d3.x, res->d3.x);
res->d3.y = mad24(a.d0.y, b.d3.y, res->d3.y);
res->d3.z = mad24(a.d0.z, b.d3.z, res->d3.z);
res->d3.w = mad24(a.d0.w, b.d3.w, res->d3.w);

res->d4.x = mad24(a.d4.x, b.d0.x, res->d3.x >> 15.x);
res->d4.y = mad24(a.d4.y, b.d0.y, res->d3.y >> 15.y);
res->d4.z = mad24(a.d4.z, b.d0.z, res->d3.z >> 15.z);
res->d4.w = mad24(a.d4.w, b.d0.w, res->d3.w >> 15.w);
[/code]Still, using vectors was not necessary to hide latency, but provide enough independent instructions for filling all compute units.

Regarding the timing: shift, add, mul24/mad24, mul, mul_hi etc. all takes one cycle in the pipeline (the difference between 32x32 and 24x24 is not the timing, but that 32x32 can be done only in the t-unit on VLIW5 - still allowing to co-issue other instructions for the xyzw-units). The pipeline length differs a bit between the architectures, but is somewhere ~10. For GCN they reduced it quite a bit (to 7, I think, but not sure). So if you write a kernel that does a single add, then it will take 7 or ~10 cycles for that single calculation.

BTW, on GCN, you now have a separate scalar unit that helps with instruction decoding, branch/jump calculation etc. When using assembly, you can run 32x32 muls on this unit at full speed! OpenCL will only issue mul instructions to the vector units, but on the [URL="http://devgurus.amd.com/message/1186750#1279865"]AMD forum[/URL], you can find that for each mul24 in the vector units you can issue an additional mul for the scalar unit, increasing total throughput quite a bit.[/QUOTE]

Thanks for your extensive answer!

If i read well the 7970 is 2.5% faster for multiplying than the 6900 series. This according to the guy testing 'at home'.

Especially the response of the AMD helpdesk is typical:

"AMD does not provide any tools or support to do what you want."

diep 2012-11-28 12:51

Interesting is the decoding you mention - nowadays the biggest concern for CPU designers so i read one posting.

The L1i of the AMD gpu's used gets described as being an ultratiny 8KB. What i do not know is the instruction length of each instruction inside that 8KB. How many instructions can it store?

As each compute unit doesn't have a local L2, there is only slow L2's nearby the memory channels, i assume that it's too expensive to have instructions sitting outside the L1i.

I had been calculating how to get a chess program to work within that 8 KB but didn't find a solution so far.

How many instructions can i get inside that L1i?

What i assume is that several alternating threads can reuse the same L1i (which makes sense as an assumption).

The only solution i saw is build different kernels doing different things and sometimes launch one of the kernels that can store to the RAM the transpositiontable (=hashtable) entries, where we can have a bunch of numbers in the RAM that we didn't proces yet, which can remove dubious entries which we didn't hit yet. This second kernel B would just launch very quickly and do real little.

Yet all that is a pretty big instructionstream if we add up the 2 kernels, so i didn't see a way how to do that at AMD yet.

Bdot 2012-11-28 15:06

[QUOTE=diep;319824]
If i read well the 7970 is 2.5% faster for multiplying than the 6900 series. This according to the guy testing 'at home'.
[/QUOTE]
I don't think you read that right. There was no comparison to 6900. The mentioned 2.33% compare the 88GOps (int 32x32) of the S-unit to the 3788.8 GFlops that the 7970 is supposed to deliver (so it compares 7970 using the S-unit to 7970 without using the S-unit). If you just count 32x32 muls, then there's a 18.6% improvement.

But you can do the math yourself:
5870: 2720GFlops / 544 DP-GFlops / 272 (32x32int-)Gops (320 SIMDs x 850MHz) + 1088 Gops (no mul32: e.g. mul24/add/shift/load/logic)
6970: 2703GFlops / 675 DP-GFlops / 338 (32x32int-)Gops (384 SIMDs x 880MHz)
7970: 3789GFlops / 947 DP-GFlops / 473(+88 if using S-unit) (32x32int-)Gops

The 5870 is better than 6970 at 32x32 code because it can run all the shifts/masking/adding/carry-calculation in parallel. My 32-bit kernels achieve ~65% occupation on VLIW5. That means that in average, the t-unit and 2.2 additional units are occupied. This in turn means that less than a third of the instructions are really bound to the t-unit (note that also all type conversions float-to-int and back are t-unit only).

So far, so good. but the lower end GCN cards now have a problem:
7850: 1761GFlops / 110 DP-GFlops / 55(+41 if using S-unit) (32x32int-)Gops
compare to
5770: 1360GFlops / 0 DP-GFlops / 136 (32x32int-)Gops (160 SIMDs x 850MHz) + 544 Gops (no mul32: e.g. mul24/add/shift/load/logic)

They suck so terribly in 32x32 performance now due to the relation to the crippled DPFP speed, that the boost by the S-unit is more than welcome. It will be hard to use, and it still needs to be tested that it runs at full SP speed on the lower-end cards too. But it may be worth the effort.

Still it needs to be noted that despite the above bad numbers, the 7850 is 30% faster than the 5770 when running the same 32-bit-mul kernel (the 15-bit kernels are ~55% faster). A big part of that comes from the architecture improvements that I mentioned. And that's why it is plain wrong to call GCN "just a shrink" of VLIW.

[QUOTE=diep;319824]Especially the response of the AMD helpdesk is typical:

"AMD does not provide any tools or support to do what you want."[/QUOTE]

sad but true

dbaugh 2012-12-06 11:53

I just got a machine with a GTX590. It shows up in GPU-Z as two cards. With 6 instances of mfaktc it saturates one card and does not use the other. I was hoping it would just look like one fast card. How do I get the other half of the card in the game? Acually, four is enough to take half the card to 99% if the SievePrimes drops low enough.

diep 2012-12-06 14:09

[QUOTE=dbaugh;320720]I just got a machine with a GTX590. It shows up in GPU-Z as two cards. With 6 instances of mfaktc it saturates one card and does not use the other. I was hoping it would just look like one fast card. How do I get the other half of the card in the game? Acually, four is enough to take half the card to 99% if the SievePrimes drops low enough.[/QUOTE]

How many FC's a second are you seeing and for which exponent?

kjaget 2012-12-06 15:29

[QUOTE=dbaugh;320720]I just got a machine with a GTX590. It shows up in GPU-Z as two cards. With 6 instances of mfaktc it saturates one card and does not use the other. I was hoping it would just look like one fast card. How do I get the other half of the card in the game? Acually, four is enough to take half the card to 99% if the SievePrimes drops low enough.[/QUOTE]

Add -d 1 (or maybe 2?) to the command line to access the second GPU.

Bdot 2012-12-06 15:46

[QUOTE=kjaget;320738]Add -d 1 (or maybe 2?) to the command line to access the second GPU.[/QUOTE]

Yes, the -d switch should work very similar for mfakto and mfaktc: -d <GPU-num> with GPU-num starting at 0, and 0 being the default.

Edit: Oops, I just checked the code and GPU-num is rather starting at 1, with 1 being the default for mfakto. And now I'm not sure what this is for mfaktc ...
Edit2: Now I checked mfaktc, and there the device number starts at 0 ... Hmmm

kracker 2012-12-06 16:23

I thought the 590 or 690 didn't need a -d switch, thought they were together with internal SLI.. hmm..

LaurV 2012-12-06 17:10

SLI or not SLI, you need the -d. You can enable/disable SLI from the nvidia drivers (right click on the desktop in windows, chose nvidia ctrl panel, if you have it). Interesting, I have some computer with 2x gtx580 in the box, if I enable/disable SLI, the -d switches are reverted (say, upper card, the one which is connected to the monitors, is 0, the other is 1, if I play with SLI, they are reverted, now I don't know exactly which is which). The connector is two-ways. There is no speed difference (no SLI seems faster, but this may be subjective, I did not really measured wall-clock). For 590 the differences would be a lower clock and a PCB (hardware) wired SLI, instead of th etwo-way cable. The rest would be the same.

Edit: I saw a Mars II, unfortunately not mine, it was the same.

kracker 2012-12-06 18:06

[QUOTE=LaurV;320751]SLI or not SLI, you need the -d. You can enable/disable SLI from the nvidia drivers (right click on the desktop in windows, chose nvidia ctrl panel, if you have it). Interesting, I have some computer with 2x gtx580 in the box, if I enable/disable SLI, the -d switches are reverted (say, upper card, the one which is connected to the monitors, is 0, the other is 1, if I play with SLI, they are reverted, now I don't know exactly which is which). The connector is two-ways. There is no speed difference (no SLI seems faster, but this may be subjective, I did not really measured wall-clock). For 590 the differences would be a lower clock and a PCB (hardware) wired SLI, instead of th etwo-way cable. The rest would be the same.

Edit: I saw a Mars II, unfortunately not mine, it was the same.[/QUOTE]
Ah, I see. :smile: Thanks for clarifying!


All times are UTC. The time now is 23:04.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.