![]() |
[QUOTE=Mark Rose;389073]How much of it is the GHz-d calculation and how much of it is extra math? I haven't looked very much at mfakto's code. I'm curious. mfaktc's barrett76 kernel needs only 5 32-bit ints and 9 multiplies for a 76 bit x 76 bit product, but the barrett77 kernel requires 6 ints and 12 multiplies for 77 bit x 77 bit product. There's about a 20% drop in performance going from 76 to 77 bits, before taking into account the GHz-d formula penalty for higher bit levels.[/QUOTE]
[I]Barrett[/I] is more than just one square operation (for which you counted to ops). [QUOTE=Mark Rose;389079]There's a big ~20% performance hit beyond 76 bits for all Nvidia cards.[/QUOTE] [B]Not true[/B], but to be fair you've noticed yourself. See below. [QUOTE=Mark Rose;389148]Trial factoring anything up to 76 bits is fast with mfaktc. Trial factoring 77 bits is slower. Here are some GHz-d/day numbers for M39467291 on a GTX 580 (at 1544MHz): 69,70: 426.02 70,71: 426.02 71,72: 424.85 72,73: 424.52 73,74: 424.32 74,75: 424.56 75,76: 424.39 76,77: 423.28 77,78: 414.38 // okay, not as bad as I remembered! The 20% I remembered was from [url=http://mersenneforum.org/showpost.php?p=306572&postcount=1824]this post[/url]. The new barrett76 kernel is only usable for 76 bits (77 overflows), and so a less efficient kernel must be used. 78,79: 414.24 79,80: 414.32 I don't have time to do more benchmarking at the moment.[/QUOTE] You'll see the same performance up to 2[SUP]87[/SUP] and a very minor performance drop to 2[SUP]88[/SUP]. Above 2[SUP]88[/SUP] there will be a bigger drop. [B][U]RAW[/U][/B] kernel benchmarks (million FCs per second without sieve): [CODE]GeForce GT 440 (CC 2.1) mfaktc 0.21-pre4 // 319.60 + CUDA 5.5 ./mfaktc.exe -tf 66362159 66 67 71bit_mul24 29.23M/s 75bit_mul32 42.22M/s 95bit_mul32 33.16M/s barrett76_mul32 79.23M/s barrett77_mul32 74.94M/s barrett79_mul32 64.18M/s barrett87_mul32 75.51M/s barrett88_mul32 75.46M/s barrett92_mul32 61.93M/s ------------------------------------- Tesla K20m (CC 3.5) mfaktc 0.21-pre4 // 331.20 + CUDA 5.5 ./mfaktc.exe -tf 66362159 68 69 71bit_mul24 160.51M/s 75bit_mul32 200.32M/s 95bit_mul32 155.13M/s barrett76_mul32 392.31M/s barrett77_mul32 367.17M/s barrett79_mul32 314.82M/s barrett87_mul32 368.01M/s (without funnel-shift 357.09M/s) barrett88_mul32 367.45M/s (without funnel-shift 347.80M/s) barrett92_mul32 306.60M/s (without funnel-shift 293.69M/s) ------------------------------------- GeForce GTX 275 (CC 1.3) mfaktc 0.21-pre5 // 319.37 + CUDA 5.5 ./mfaktc.exe -tf 66362159 66 67 71bit_mul24 77.64M/s 75bit_mul32 62.59M/s 95bit_mul32 50.34M/s barrett76_mul32 85.83M/s barrett77_mul32 82.48M/s barrett79_mul32 73.56M/s barrett87_mul32 75.93M/s barrett88_mul32 75.41M/s barrett92_mul32 65.80M/s[/CODE] With Sieving there will be a constant penalty added to each kernel so the relative performance difference between those kernels will be a little bit smaller than the [B][U]RAW[/U][/B] speeds suggests. barrett76,77 and 79 can do 2[SUP]64[/SUP] to 2[SUP]<upper limit for the kernel>[/SUP] in ONE step while barrett87, 88 and 92 can only do one bitlevel at once. But above 2[SUP]76[/SUP] I think this is not a real concern. Oliver |
So if I read the GPU72 report correctly...and if hypothetically we maintain the current DC-TF rate...then DC-TF would be a thing of the past before the end of 2015...
|
[QUOTE=Mark Rose;389148]Here are some GHz-d/day numbers for M39467291 on a GTX 580 (at 1544MHz):[/QUOTE]Same idea I had, I just did the same kind of benchmark on M99999989 on a GTX 570:[code]66,67 = 274.5
67,68 = 274.5 68,69 = 278.8 69,70 = 280.2 70,71 = 279.9 71,72 = 280.0 72,73 = 280.0 73,74 = 276.9 74,75 = 276.9 75,76 = 276.9 76,77 = 268.0 77,78 = 270.7 78,79 = 270.6 79,80 = 270.6[/code] |
[QUOTE=TheJudger;389158][I]Barrett[/I] is more than just one square operation (for which you counted to ops).[/quote]
When I was looking at the code, which was a while ago, it seemed to me that the square operation was where the most operations were saved with the barrett76 kernel versus the others. Is there a significant difference in the number of operations elsewhere in the barrett76 kernel? I ask only to understand better! [quote] [B]Not true[/B], but to be fair you've noticed yourself. See below. You'll see the same performance up to 2[SUP]87[/SUP] and a very minor performance drop to 2[SUP]88[/SUP]. Above 2[SUP]88[/SUP] there will be a bigger drop. With Sieving there will be a constant penalty added to each kernel so the relative performance difference between those kernels will be a little bit smaller than the [B][U]RAW[/U][/B] speeds suggests. barrett76,77 and 79 can do 2[SUP]64[/SUP] to 2[SUP]<upper limit for the kernel>[/SUP] in ONE step while barrett87, 88 and 92 can only do one bitlevel at once. But above 2[SUP]76[/SUP] I think this is not a real concern. [/QUOTE] Thanks for the corrections! |
I'm anxious to see similar numbers for AMC VLIW and GCN cards using the soon-to-be-released mfakto.
Looking at Mark's numbers I'm leaning toward removing GPU info from the web form. The 3% speed difference between lowest and highest bit levels isn't worth worrying about. As for LL/TF crossovers that only come into play if one chooses the lowest exponent preference, I'll assume a GTX 770 which does the least TF before LL becomes a more profitable use of the card. For those that are looking to maximize their GHz-days/day, the optional bit level to factor to and exponent range can always be used to get suitable work. |
Hi,
[QUOTE=Mark Rose;389163]When I was looking at the code, which was a while ago, it seemed to me that the square operation was where the most operations were saved with the barrett76 kernel versus the others. Is there a significant difference in the number of operations elsewhere in the barrett76 kernel? I ask only to understand better![/QUOTE] mfaktc evolution // basic ideas:[LIST=1][*]barrett_92 is "basic barrett" with full 96/192 bit integer[*]barrett_79 is like barrett_92 with [B]fixed[/B] (2[SUP]160[/SUP]) value for the scaled integer inverse, this[LIST][*]saves some multiword shifts (compare both kernels) because 2[SUP]160[/SUP] = 2[SUP]5*32[/SUP] ist easy to shift on 32bit words[*]allows multiple bitlevels at once[*]reduces the upper limit for fixed size integers (compared to barrett_92)[/LIST][*]reduced accuracy in interim steps[SUP]*1[/SUP][LIST][*]barrett_87/88 are barrett_92 with less accuracy in interim steps[*]barrett_76/77 are barrett_79 with less accuracy in interim steps[/LIST][/LIST] [SUP]*1[/SUP]Less accuracy as in: "n mod f == x + <small integer> * f", e.g. 1234 mod 10 = 24 (instead of 4) Oliver |
Thank you!
That gives me the insight needed to study the code further :) |
[QUOTE=Prime95;389091]That might be a good idea -- or I could delete the gpu info completely and assume the lowest crossovers since 99% of the time the information will not be used. Or see below for where I might use this info more often...
[/QUOTE] I think using the lowest crossovers makes sense at least until we are reasonably confident that we can factor everything to that level. We are not at that point yet. |
[QUOTE=James Heinrich;389160]Same idea I had, I just did the same kind of benchmark on M99999989 on a GTX 570:[code]66,67 = 274.5
67,68 = 274.5 68,69 = 278.8 69,70 = 280.2 70,71 = 279.9 71,72 = 280.0 72,73 = 280.0 73,74 = 276.9 74,75 = 276.9 75,76 = 276.9 76,77 = 268.0 77,78 = 270.7 78,79 = 270.6 79,80 = 270.6[/code][/QUOTE] Same exponent on my R9 285(AMD GCN)... [code] 66,67 = 433.9 67,68 = 433.9 68,69 = 433.9 69,70 = 407.6 70,71 = 369.1 71,72 = 369.1 72,73 = 369.1 73,74 = 357.3 74,75 = 329.1 75,76 = 327.7 76,77 = 327.7 ~Not much more variation until past 82 bits~ [/code] |
[QUOTE=Prime95;389174]I'm anxious to see similar numbers for AMC VLIW and GCN cards using the soon-to-be-released mfakto.
[/QUOTE] Here are a few results that I received for the test version. It shows the best kernel per bitlevel for a real GPU-sieve run of about 3 seconds per test. This is VLIW5 (HD6550D from an APU): [code] Resulting speed for M66362159: bit_min - bit_max GHz-days/day kernelname 60 - 64 39.993 cl_barrett15_69_gs 64 - 76 41.567 cl_barrett32_76_gs 76 - 77 39.926 cl_barrett32_77_gs 77 - 87 39.404 cl_barrett32_87_gs 87 - 88 36.753 cl_barrett32_88_gs 88 - 92 35.173 cl_barrett32_92_gs [/code]This is a first-generation GCN with 1:4 DP (HD7950): [code] Resulting speed for M66362159: bit_min - bit_max GHz-days/day kernelname 60 - 69 499.674 cl_barrett15_69_gs 69 - 70 476.535 cl_barrett15_71_gs 70 - 73 427.422 cl_barrett15_73_gs 73 - 74 412.430 cl_barrett15_74_gs 74 - 82 378.749 cl_barrett15_82_gs 82 - 83 354.878 cl_barrett15_83_gs 83 - 87 330.658 cl_barrett32_87_gs 87 - 88 327.284 cl_barrett15_88_gs 88 - 92 289.456 cl_barrett32_92_gs [/code]This is a newer GCN with 1:16 DP (R285) [code] Resulting speed for M66362159: bit_min - bit_max GHz-days/day kernelname 60 - 69 475.043 cl_barrett15_69_gs 69 - 70 443.636 cl_barrett15_71_gs 70 - 73 402.419 cl_barrett15_73_gs 73 - 74 389.251 cl_barrett15_74_gs 74 - 82 334.707 cl_barrett15_82_gs 82 - 83 313.099 cl_barrett15_83_gs 83 - 87 294.592 cl_barrett32_87_gs 87 - 88 291.010 cl_barrett15_88_gs 88 - 92 258.739 cl_barrett32_92_gs [/code]This is the new top-level GCN with improved int32 math (R290x): [code] Resulting speed for M66362159: bit_min - bit_max GHz-days/day kernelname 60 - 69 757.628 cl_barrett15_69_gs 69 - 76 749.778 cl_barrett32_76_gs 76 - 77 720.362 cl_barrett32_77_gs 77 - 79 645.730 cl_barrett32_79_gs 79 - 87 642.553 cl_barrett32_87_gs 87 - 88 614.766 cl_barrett32_88_gs 88 - 92 565.309 cl_barrett32_92_gs [/code]And finally, this is Intel HD4600: [code] Resulting speed for M66362159: bit_min - bit_max GHz-days/day kernelname 60 - 64 15.081 cl_barrett15_69_gs 64 - 76 19.707 cl_barrett32_76_gs 76 - 77 19.345 cl_barrett32_77_gs 77 - 87 17.208 cl_barrett32_87_gs 87 - 88 16.816 cl_barrett32_88_gs 88 - 92 14.507 cl_barrett32_92_gs [/code]The exponent size also has some influence on the performance, for example R290x: [code] M2000093: 69 - 76 942.631 cl_barrett32_76_gs M39000037: 69 - 76 749.779 cl_barrett32_76_gs M66362159: 69 - 76 749.778 cl_barrett32_76_gs M74000077: 69 - 76 721.255 cl_barrett32_76_gs M78000071: 69 - 76 720.259 cl_barrett32_76_gs M332900047: 69 - 76 667.219 cl_barrett32_76_gs M999900079: 69 - 76 645.028 cl_barrett32_76_gs M2001862367: 64 - 76 621.290 cl_barrett32_76_gs M4201971233: 69 - 76 602.682 cl_barrett32_76_gs [/code] |
1 Attachment(s)
[QUOTE=Bdot;389225]The exponent size also has some influence on the performance, for example R290x[/QUOTE]
I do take this into account by assuming the first 7 bits of the exponent are "free" -- multiplying the TF cost by (ceil (log2 (exponent)) - 7) The full SQL code currently is attached |
| All times are UTC. The time now is 23:17. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.