![]() |
|
|
#573 | |
|
Nov 2010
Germany
11258 Posts |
Quote:
also 2 years ago, I already explained that the VLIW4/5 architecture has special P-registers that always hold a copy of the previous cycle's operation result. This P register (actually 5x32bit: 1 scalar and 1 vector) is immediately available for the next instruction in the pipeline, without additional delay/penalty. And for the code you cite, keep in mind that each of the values used is a 4-way vector for VLIW, and a 2-way vector for GCN. Each vector component processes a different factor candidate of the same exponent. Therefore, the code expands (on VLIW) to Code:
res->d3.x = mad24(a.d0.x, b.d3.x, res->d3.x); res->d3.y = mad24(a.d0.y, b.d3.y, res->d3.y); res->d3.z = mad24(a.d0.z, b.d3.z, res->d3.z); res->d3.w = mad24(a.d0.w, b.d3.w, res->d3.w); res->d4.x = mad24(a.d4.x, b.d0.x, res->d3.x >> 15.x); res->d4.y = mad24(a.d4.y, b.d0.y, res->d3.y >> 15.y); res->d4.z = mad24(a.d4.z, b.d0.z, res->d3.z >> 15.z); res->d4.w = mad24(a.d4.w, b.d0.w, res->d3.w >> 15.w); Regarding the timing: shift, add, mul24/mad24, mul, mul_hi etc. all takes one cycle in the pipeline (the difference between 32x32 and 24x24 is not the timing, but that 32x32 can be done only in the t-unit on VLIW5 - still allowing to co-issue other instructions for the xyzw-units). The pipeline length differs a bit between the architectures, but is somewhere ~10. For GCN they reduced it quite a bit (to 7, I think, but not sure). So if you write a kernel that does a single add, then it will take 7 or ~10 cycles for that single calculation. BTW, on GCN, you now have a separate scalar unit that helps with instruction decoding, branch/jump calculation etc. When using assembly, you can run 32x32 muls on this unit at full speed! OpenCL will only issue mul instructions to the vector units, but on the AMD forum, you can find that for each mul24 in the vector units you can issue an additional mul for the scalar unit, increasing total throughput quite a bit. Last fiddled with by Bdot on 2012-11-28 at 10:12 Reason: Actually, the constant 15 is also widened to a vector ... |
|
|
|
|
|
|
#574 | |
|
Sep 2006
The Netherlands
72910 Posts |
Quote:
If i read well the 7970 is 2.5% faster for multiplying than the 6900 series. This according to the guy testing 'at home'. Especially the response of the AMD helpdesk is typical: "AMD does not provide any tools or support to do what you want." Last fiddled with by diep on 2012-11-28 at 10:41 |
|
|
|
|
|
|
#575 |
|
Sep 2006
The Netherlands
72910 Posts |
Interesting is the decoding you mention - nowadays the biggest concern for CPU designers so i read one posting.
The L1i of the AMD gpu's used gets described as being an ultratiny 8KB. What i do not know is the instruction length of each instruction inside that 8KB. How many instructions can it store? As each compute unit doesn't have a local L2, there is only slow L2's nearby the memory channels, i assume that it's too expensive to have instructions sitting outside the L1i. I had been calculating how to get a chess program to work within that 8 KB but didn't find a solution so far. How many instructions can i get inside that L1i? What i assume is that several alternating threads can reuse the same L1i (which makes sense as an assumption). The only solution i saw is build different kernels doing different things and sometimes launch one of the kernels that can store to the RAM the transpositiontable (=hashtable) entries, where we can have a bunch of numbers in the RAM that we didn't proces yet, which can remove dubious entries which we didn't hit yet. This second kernel B would just launch very quickly and do real little. Yet all that is a pretty big instructionstream if we add up the 2 kernels, so i didn't see a way how to do that at AMD yet. |
|
|
|
|
|
#576 | |
|
Nov 2010
Germany
3·199 Posts |
Quote:
But you can do the math yourself: 5870: 2720GFlops / 544 DP-GFlops / 272 (32x32int-)Gops (320 SIMDs x 850MHz) + 1088 Gops (no mul32: e.g. mul24/add/shift/load/logic) 6970: 2703GFlops / 675 DP-GFlops / 338 (32x32int-)Gops (384 SIMDs x 880MHz) 7970: 3789GFlops / 947 DP-GFlops / 473(+88 if using S-unit) (32x32int-)Gops The 5870 is better than 6970 at 32x32 code because it can run all the shifts/masking/adding/carry-calculation in parallel. My 32-bit kernels achieve ~65% occupation on VLIW5. That means that in average, the t-unit and 2.2 additional units are occupied. This in turn means that less than a third of the instructions are really bound to the t-unit (note that also all type conversions float-to-int and back are t-unit only). So far, so good. but the lower end GCN cards now have a problem: 7850: 1761GFlops / 110 DP-GFlops / 55(+41 if using S-unit) (32x32int-)Gops compare to 5770: 1360GFlops / 0 DP-GFlops / 136 (32x32int-)Gops (160 SIMDs x 850MHz) + 544 Gops (no mul32: e.g. mul24/add/shift/load/logic) They suck so terribly in 32x32 performance now due to the relation to the crippled DPFP speed, that the boost by the S-unit is more than welcome. It will be hard to use, and it still needs to be tested that it runs at full SP speed on the lower-end cards too. But it may be worth the effort. Still it needs to be noted that despite the above bad numbers, the 7850 is 30% faster than the 5770 when running the same 32-bit-mul kernel (the 15-bit kernels are ~55% faster). A big part of that comes from the architecture improvements that I mentioned. And that's why it is plain wrong to call GCN "just a shrink" of VLIW. sad but true |
|
|
|
|
|
|
#577 |
|
Aug 2005
2×59 Posts |
I just got a machine with a GTX590. It shows up in GPU-Z as two cards. With 6 instances of mfaktc it saturates one card and does not use the other. I was hoping it would just look like one fast card. How do I get the other half of the card in the game? Acually, four is enough to take half the card to 99% if the SievePrimes drops low enough.
Last fiddled with by dbaugh on 2012-12-06 at 12:14 Reason: more info |
|
|
|
|
|
#578 | |
|
Sep 2006
The Netherlands
36 Posts |
Quote:
|
|
|
|
|
|
|
#579 | |
|
Jun 2005
8116 Posts |
Quote:
|
|
|
|
|
|
|
#580 | |
|
Nov 2010
Germany
3×199 Posts |
Quote:
Edit: Oops, I just checked the code and GPU-num is rather starting at 1, with 1 being the default for mfakto. And now I'm not sure what this is for mfaktc ... Edit2: Now I checked mfaktc, and there the device number starts at 0 ... Hmmm Last fiddled with by Bdot on 2012-12-06 at 16:28 Reason: Oops, oops again |
|
|
|
|
|
|
#581 |
|
"Mr. Meeseeks"
Jan 2012
California, USA
1000011110002 Posts |
I thought the 590 or 690 didn't need a -d switch, thought they were together with internal SLI.. hmm..
|
|
|
|
|
|
#582 |
|
Romulan Interpreter
Jun 2011
Thailand
72·197 Posts |
SLI or not SLI, you need the -d. You can enable/disable SLI from the nvidia drivers (right click on the desktop in windows, chose nvidia ctrl panel, if you have it). Interesting, I have some computer with 2x gtx580 in the box, if I enable/disable SLI, the -d switches are reverted (say, upper card, the one which is connected to the monitors, is 0, the other is 1, if I play with SLI, they are reverted, now I don't know exactly which is which). The connector is two-ways. There is no speed difference (no SLI seems faster, but this may be subjective, I did not really measured wall-clock). For 590 the differences would be a lower clock and a PCB (hardware) wired SLI, instead of th etwo-way cable. The rest would be the same.
Edit: I saw a Mars II, unfortunately not mine, it was the same. Last fiddled with by LaurV on 2012-12-06 at 17:13 |
|
|
|
|
|
#583 | |
|
"Mr. Meeseeks"
Jan 2012
California, USA
23×271 Posts |
Quote:
Thanks for clarifying!
|
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| gpuOwL: an OpenCL program for Mersenne primality testing | preda | GpuOwl | 2718 | 2021-07-06 18:30 |
| mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3497 | 2021-06-05 12:27 |
| LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
| OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
| Program to TF Mersenne numbers with more than 1 sextillion digits? | Stargate38 | Factoring | 24 | 2011-11-03 00:34 |