![]() |
[QUOTE=James Heinrich;373890]That would provide consistent data to map the 3D performance variance for the various GPUs. More data than I currently want to analyze, but could be interesting.[/QUOTE]
I would be very interested in having access to that kind of data -- to analyze in a 3 or 4 dimensional space. As you know, I don't have privileged access to Primenet. But I understand that Primenet records (or, at least, is told) what client did what work. If this knowledge could be exposed to those interested, it could be quite valuable. |
[QUOTE=kracker;373891]60M on a HD 7770:
70-71: 153 GHz 71-72: 154 GHz 72-73: 154 GHz 73-74: 132 GHz 35M on same: 68-69: 188 69-70: 178 70-71: 160 71-72: 160 I'm curious if mfaktc is more "smooth".[/QUOTE] Well, with mfakto, it switches from barrett15_73_gs to barrett15_82_gs where mfaktc is using barrett76_mul32 |
[QUOTE=NickOfTime;373893]Well, with mfakto, it switches from barrett15_73_gs to barrett15_82_gs where mfaktc is using barrett76_mul32[/QUOTE]
[URL="https://github.com/Bdot42/mfakto/blob/master/src/mfakto.cpp"]If interested...[/URL] |
[QUOTE=kracker;373894][URL="https://github.com/Bdot42/mfakto/blob/master/src/mfakto.cpp"]If interested...[/URL][/QUOTE]
Hmm, there is a BARRETT76_MUL32_GS. The only obvious difference is that it has stages 1 flag. Checked my ini and stages=1, maybe something about GCN is disabling it or some other bug.... Nope 76 is Mul32 where 82 is mul15 in find_fastest_kernel [CODE]/* GPU_GCN (7850@1050MHz, v=2) / (7770@1100MHz)*/ BARRETT69_MUL15, // "cl_barrett15_69" (393.88 M/s) / (259.96 M/s) BARRETT70_MUL15, // "cl_barrett15_70" (393.47 M/s) / (259.69 M/s) BARRETT71_MUL15, // "cl_barrett15_71" (365.89 M/s) / (241.50 M/s) BARRETT73_MUL15, // "cl_barrett15_73" (322.45 M/s) / (212.96 M/s) BARRETT82_MUL15, // "cl_barrett15_82" (285.47 M/s) / (188.74 M/s) BARRETT76_MUL32, // "cl_barrett32_76" (282.95 M/s) / (186.72 M/s) BARRETT77_MUL32, // "cl_barrett32_77" (274.09 M/s) / (180.93 M/s)[/CODE] |
My HD7950 @900MHz is also more 'efficient' in the DC TF range.
mfakto v.014 35M 69-70 [cl_barrett15_71_gs_2] [B]420GHz-d[/B] 70-71 [cl_barrett15_73_gs_2] 380GHz-d 69M 71-72 [cl_barrett15_73_gs_2] 366GHz-d 72-73 [cl_barrett15_73_gs_2] 366GHz-d 73-74 [cl_barrett15_82_gs_2] 327GHz-d |
[QUOTE=VictordeHolland;373902]My HD7950 @900MHz is also more 'efficient' in the DC TF range.[/QUOTE]
Then do everything you can in the DC range to 70. Others will finish the exponents and release them for DCing. |
[QUOTE=James Heinrich;373882]That seems unexpectedly lower than the 212GHd/d [URL="http://www.mersenne.ca/mfaktc.php"]my chart[/URL] predicts. [/QUOTE]
Not really, as we commented/discussed before, mfakto (AMD/OpenCL (?!?)) is known for getting lazy at higher bit levels. See my former posts about the subject. Now I can prove that it come from the (barrett? monty?) kernels which are better taking advantage of architecture, for lower bit levels. For example my 7970 crunches 630G at ~40M to 69, but it gets as low as 400G at ~65M to 74. The best use (optimum point) for this card is either TF to ~70/71 bits, or DC of a ~37M exponent (where a power of 2 FFT is used optimally). |
There are 3 factors that influence mfakto (and mfaktc) performance:
[LIST][*]most important: the kernel being used (selected only by target bitlevel) Different algorithms / data chunk sizes have different effects ... For mfaktc you can see the effect when going beyond 76 bits, then it will also switch kernels.[*]measurable: size of the exponent in bits For each bit, the exponentiation/modulo loop needs to be run once. The first ~6 bits are for free, and there is some one-time overhead, so the effect is not proportional, but that is why the same bit-level in the DC-range is faster than in the LL-range.[*]negligible: the number of '1's vs. '0's in the exponent (in binary) For every '1' a small step needs to be done in addition. I think this is only measurable if you have no other 'noise' impacting the speed.[/LIST]On AMD H/W, the 32-bit kernel have quite some penalty because 32-bit muls are executed by the DP unit, so they have the same SP/DP performance ratio (1:16 on low and mid-level H/W, 1:4 on high end). In addition, the carry flag is not usable in OpenCL and needs extra mimic to get it. Therefore, 15-bit kernels were my fastest implementation, utilizing fast 24-bit multiplications and having room for the carry flag. |
[QUOTE=LaurV;373916]Not really, as we commented/discussed before, mfakto (AMD/OpenCL (?!?)) is known for getting lazy at higher bit levels. See my former posts about the subject. Now I can prove that it come from the (barrett? monty?) kernels which are better taking advantage of architecture, for lower bit levels.
For example my 7970 crunches 630G at ~40M to 69, but it gets as low as 400G at ~65M to 74. The best use (optimum point) for this card is either TF to ~70/71 bits, or DC of a ~37M exponent (where a power of 2 FFT is used optimally).[/QUOTE] I think anything below 73 bits is fine. |
[QUOTE=kracker;373938]I think anything below 73 bits is fine.[/QUOTE]
OK. Can we think about and discuss this? The whole point of GPU72 is to optimize the available GPU firepower. I have been using James' analysis as to where the cross-over points should be (read: where TF'ing Makes More Sense than LL'ing or DC'ing). I'm more than happy to add additional "WMS" options for different card types. |
If you are collecting these tests now, I can participate: Just checked the stats for my cards on mfactc 0.20:
GTX670@1176MHz: [CODE]Exp toBit Ghzd/d 66M 74 275.8 66M 73 275.9 66M 72 276.2 66M 71 276.2 35M 71 284.8 35M 70 284.7 35M 69 284.6[/CODE] GTX460M@675MHz [CODE]66M 74 96.5 66M 73 96.4 66M 72 96.4 66M 71 96.4 35M 71 100.2 35M 70 100.2 35M 69 100.1[/CODE] seems to be a lot smoother between the exponents and bitlevels. |
| All times are UTC. The time now is 23:17. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.