![]() |
[QUOTE=diep;319749]I see you do something very similar to what i had proposed (is it already nearly 2 years ago - oh boy) back then.
Yet you manage to get the 25 mad24's i had proposed back then for the mul_75_150 kernel (which was mul_70_140 in my proposal) to 19 multiplications and 7 shifts besides a few adds and ands. How fast is shifting on the GPU? I never could figure that out. What i do see is that you already use the result of res->d3 directly like this: [code] res->d3 = mad24(a.d0, b.d3, res->d3); res->d4 = mad24(a.d4, b.d0, res->d3 >> 15); [/code]Is that a good idea to use the result of res->d3 directly? If you run 2 threads at the same time, then it still gives a penalty of 4 cycles isn't it? As it takes a cycle or 8 to retire the result and free it up for next multiplication. Isn't it possible to optimize that better? Kind Regards, Vincent[/QUOTE] Hi Vincent, also 2 years ago, I already explained that the VLIW4/5 architecture has special P-registers that always hold a copy of the previous cycle's operation result. This P register (actually 5x32bit: 1 scalar and 1 vector) is immediately available for the next instruction in the pipeline, without additional delay/penalty. And for the code you cite, keep in mind that each of the values used is a 4-way vector for VLIW, and a 2-way vector for GCN. Each vector component processes a different factor candidate of the same exponent. Therefore, the code expands (on VLIW) to [code] res->d3.x = mad24(a.d0.x, b.d3.x, res->d3.x); res->d3.y = mad24(a.d0.y, b.d3.y, res->d3.y); res->d3.z = mad24(a.d0.z, b.d3.z, res->d3.z); res->d3.w = mad24(a.d0.w, b.d3.w, res->d3.w); res->d4.x = mad24(a.d4.x, b.d0.x, res->d3.x >> 15.x); res->d4.y = mad24(a.d4.y, b.d0.y, res->d3.y >> 15.y); res->d4.z = mad24(a.d4.z, b.d0.z, res->d3.z >> 15.z); res->d4.w = mad24(a.d4.w, b.d0.w, res->d3.w >> 15.w); [/code]Still, using vectors was not necessary to hide latency, but provide enough independent instructions for filling all compute units. Regarding the timing: shift, add, mul24/mad24, mul, mul_hi etc. all takes one cycle in the pipeline (the difference between 32x32 and 24x24 is not the timing, but that 32x32 can be done only in the t-unit on VLIW5 - still allowing to co-issue other instructions for the xyzw-units). The pipeline length differs a bit between the architectures, but is somewhere ~10. For GCN they reduced it quite a bit (to 7, I think, but not sure). So if you write a kernel that does a single add, then it will take 7 or ~10 cycles for that single calculation. BTW, on GCN, you now have a separate scalar unit that helps with instruction decoding, branch/jump calculation etc. When using assembly, you can run 32x32 muls on this unit at full speed! OpenCL will only issue mul instructions to the vector units, but on the [URL="http://devgurus.amd.com/message/1186750#1279865"]AMD forum[/URL], you can find that for each mul24 in the vector units you can issue an additional mul for the scalar unit, increasing total throughput quite a bit. |
[QUOTE=Bdot;319820]Hi Vincent,
also 2 years ago, I already explained that the VLIW4/5 architecture has special P-registers that always hold a copy of the previous cycle's operation result. This P register (actually 5x32bit: 1 scalar and 1 vector) is immediately available for the next instruction in the pipeline, without additional delay/penalty. And for the code you cite, keep in mind that each of the values used is a 4-way vector for VLIW, and a 2-way vector for GCN. Each vector component processes a different factor candidate of the same exponent. Therefore, the code expands (on VLIW) to [code] res->d3.x = mad24(a.d0.x, b.d3.x, res->d3.x); res->d3.y = mad24(a.d0.y, b.d3.y, res->d3.y); res->d3.z = mad24(a.d0.z, b.d3.z, res->d3.z); res->d3.w = mad24(a.d0.w, b.d3.w, res->d3.w); res->d4.x = mad24(a.d4.x, b.d0.x, res->d3.x >> 15.x); res->d4.y = mad24(a.d4.y, b.d0.y, res->d3.y >> 15.y); res->d4.z = mad24(a.d4.z, b.d0.z, res->d3.z >> 15.z); res->d4.w = mad24(a.d4.w, b.d0.w, res->d3.w >> 15.w); [/code]Still, using vectors was not necessary to hide latency, but provide enough independent instructions for filling all compute units. Regarding the timing: shift, add, mul24/mad24, mul, mul_hi etc. all takes one cycle in the pipeline (the difference between 32x32 and 24x24 is not the timing, but that 32x32 can be done only in the t-unit on VLIW5 - still allowing to co-issue other instructions for the xyzw-units). The pipeline length differs a bit between the architectures, but is somewhere ~10. For GCN they reduced it quite a bit (to 7, I think, but not sure). So if you write a kernel that does a single add, then it will take 7 or ~10 cycles for that single calculation. BTW, on GCN, you now have a separate scalar unit that helps with instruction decoding, branch/jump calculation etc. When using assembly, you can run 32x32 muls on this unit at full speed! OpenCL will only issue mul instructions to the vector units, but on the [URL="http://devgurus.amd.com/message/1186750#1279865"]AMD forum[/URL], you can find that for each mul24 in the vector units you can issue an additional mul for the scalar unit, increasing total throughput quite a bit.[/QUOTE] Thanks for your extensive answer! If i read well the 7970 is 2.5% faster for multiplying than the 6900 series. This according to the guy testing 'at home'. Especially the response of the AMD helpdesk is typical: "AMD does not provide any tools or support to do what you want." |
Interesting is the decoding you mention - nowadays the biggest concern for CPU designers so i read one posting.
The L1i of the AMD gpu's used gets described as being an ultratiny 8KB. What i do not know is the instruction length of each instruction inside that 8KB. How many instructions can it store? As each compute unit doesn't have a local L2, there is only slow L2's nearby the memory channels, i assume that it's too expensive to have instructions sitting outside the L1i. I had been calculating how to get a chess program to work within that 8 KB but didn't find a solution so far. How many instructions can i get inside that L1i? What i assume is that several alternating threads can reuse the same L1i (which makes sense as an assumption). The only solution i saw is build different kernels doing different things and sometimes launch one of the kernels that can store to the RAM the transpositiontable (=hashtable) entries, where we can have a bunch of numbers in the RAM that we didn't proces yet, which can remove dubious entries which we didn't hit yet. This second kernel B would just launch very quickly and do real little. Yet all that is a pretty big instructionstream if we add up the 2 kernels, so i didn't see a way how to do that at AMD yet. |
[QUOTE=diep;319824]
If i read well the 7970 is 2.5% faster for multiplying than the 6900 series. This according to the guy testing 'at home'. [/QUOTE] I don't think you read that right. There was no comparison to 6900. The mentioned 2.33% compare the 88GOps (int 32x32) of the S-unit to the 3788.8 GFlops that the 7970 is supposed to deliver (so it compares 7970 using the S-unit to 7970 without using the S-unit). If you just count 32x32 muls, then there's a 18.6% improvement. But you can do the math yourself: 5870: 2720GFlops / 544 DP-GFlops / 272 (32x32int-)Gops (320 SIMDs x 850MHz) + 1088 Gops (no mul32: e.g. mul24/add/shift/load/logic) 6970: 2703GFlops / 675 DP-GFlops / 338 (32x32int-)Gops (384 SIMDs x 880MHz) 7970: 3789GFlops / 947 DP-GFlops / 473(+88 if using S-unit) (32x32int-)Gops The 5870 is better than 6970 at 32x32 code because it can run all the shifts/masking/adding/carry-calculation in parallel. My 32-bit kernels achieve ~65% occupation on VLIW5. That means that in average, the t-unit and 2.2 additional units are occupied. This in turn means that less than a third of the instructions are really bound to the t-unit (note that also all type conversions float-to-int and back are t-unit only). So far, so good. but the lower end GCN cards now have a problem: 7850: 1761GFlops / 110 DP-GFlops / 55(+41 if using S-unit) (32x32int-)Gops compare to 5770: 1360GFlops / 0 DP-GFlops / 136 (32x32int-)Gops (160 SIMDs x 850MHz) + 544 Gops (no mul32: e.g. mul24/add/shift/load/logic) They suck so terribly in 32x32 performance now due to the relation to the crippled DPFP speed, that the boost by the S-unit is more than welcome. It will be hard to use, and it still needs to be tested that it runs at full SP speed on the lower-end cards too. But it may be worth the effort. Still it needs to be noted that despite the above bad numbers, the 7850 is 30% faster than the 5770 when running the same 32-bit-mul kernel (the 15-bit kernels are ~55% faster). A big part of that comes from the architecture improvements that I mentioned. And that's why it is plain wrong to call GCN "just a shrink" of VLIW. [QUOTE=diep;319824]Especially the response of the AMD helpdesk is typical: "AMD does not provide any tools or support to do what you want."[/QUOTE] sad but true |
I just got a machine with a GTX590. It shows up in GPU-Z as two cards. With 6 instances of mfaktc it saturates one card and does not use the other. I was hoping it would just look like one fast card. How do I get the other half of the card in the game? Acually, four is enough to take half the card to 99% if the SievePrimes drops low enough.
|
[QUOTE=dbaugh;320720]I just got a machine with a GTX590. It shows up in GPU-Z as two cards. With 6 instances of mfaktc it saturates one card and does not use the other. I was hoping it would just look like one fast card. How do I get the other half of the card in the game? Acually, four is enough to take half the card to 99% if the SievePrimes drops low enough.[/QUOTE]
How many FC's a second are you seeing and for which exponent? |
[QUOTE=dbaugh;320720]I just got a machine with a GTX590. It shows up in GPU-Z as two cards. With 6 instances of mfaktc it saturates one card and does not use the other. I was hoping it would just look like one fast card. How do I get the other half of the card in the game? Acually, four is enough to take half the card to 99% if the SievePrimes drops low enough.[/QUOTE]
Add -d 1 (or maybe 2?) to the command line to access the second GPU. |
[QUOTE=kjaget;320738]Add -d 1 (or maybe 2?) to the command line to access the second GPU.[/QUOTE]
Yes, the -d switch should work very similar for mfakto and mfaktc: -d <GPU-num> with GPU-num starting at 0, and 0 being the default. Edit: Oops, I just checked the code and GPU-num is rather starting at 1, with 1 being the default for mfakto. And now I'm not sure what this is for mfaktc ... Edit2: Now I checked mfaktc, and there the device number starts at 0 ... Hmmm |
I thought the 590 or 690 didn't need a -d switch, thought they were together with internal SLI.. hmm..
|
SLI or not SLI, you need the -d. You can enable/disable SLI from the nvidia drivers (right click on the desktop in windows, chose nvidia ctrl panel, if you have it). Interesting, I have some computer with 2x gtx580 in the box, if I enable/disable SLI, the -d switches are reverted (say, upper card, the one which is connected to the monitors, is 0, the other is 1, if I play with SLI, they are reverted, now I don't know exactly which is which). The connector is two-ways. There is no speed difference (no SLI seems faster, but this may be subjective, I did not really measured wall-clock). For 590 the differences would be a lower clock and a PCB (hardware) wired SLI, instead of th etwo-way cable. The rest would be the same.
Edit: I saw a Mars II, unfortunately not mine, it was the same. |
[QUOTE=LaurV;320751]SLI or not SLI, you need the -d. You can enable/disable SLI from the nvidia drivers (right click on the desktop in windows, chose nvidia ctrl panel, if you have it). Interesting, I have some computer with 2x gtx580 in the box, if I enable/disable SLI, the -d switches are reverted (say, upper card, the one which is connected to the monitors, is 0, the other is 1, if I play with SLI, they are reverted, now I don't know exactly which is which). The connector is two-ways. There is no speed difference (no SLI seems faster, but this may be subjective, I did not really measured wall-clock). For 590 the differences would be a lower clock and a PCB (hardware) wired SLI, instead of th etwo-way cable. The rest would be the same.
Edit: I saw a Mars II, unfortunately not mine, it was the same.[/QUOTE] Ah, I see. :smile: Thanks for clarifying! |
| All times are UTC. The time now is 23:04. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.