mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU to 72 (https://www.mersenneforum.org/forumdisplay.php?f=95)
-   -   GPU to 72 status... (https://www.mersenneforum.org/showthread.php?t=16263)

TheJudger 2014-12-04 17:55

[QUOTE=Mark Rose;389073]How much of it is the GHz-d calculation and how much of it is extra math? I haven't looked very much at mfakto's code. I'm curious. mfaktc's barrett76 kernel needs only 5 32-bit ints and 9 multiplies for a 76 bit x 76 bit product, but the barrett77 kernel requires 6 ints and 12 multiplies for 77 bit x 77 bit product. There's about a 20% drop in performance going from 76 to 77 bits, before taking into account the GHz-d formula penalty for higher bit levels.[/QUOTE]

[I]Barrett[/I] is more than just one square operation (for which you counted to ops).

[QUOTE=Mark Rose;389079]There's a big ~20% performance hit beyond 76 bits for all Nvidia cards.[/QUOTE]

[B]Not true[/B], but to be fair you've noticed yourself. See below.

[QUOTE=Mark Rose;389148]Trial factoring anything up to 76 bits is fast with mfaktc. Trial factoring 77 bits is slower. Here are some GHz-d/day numbers for M39467291 on a GTX 580 (at 1544MHz):

69,70: 426.02
70,71: 426.02
71,72: 424.85
72,73: 424.52
73,74: 424.32
74,75: 424.56
75,76: 424.39
76,77: 423.28
77,78: 414.38 // okay, not as bad as I remembered! The 20% I remembered was from [url=http://mersenneforum.org/showpost.php?p=306572&postcount=1824]this post[/url]. The new barrett76 kernel is only usable for 76 bits (77 overflows), and so a less efficient kernel must be used.
78,79: 414.24
79,80: 414.32

I don't have time to do more benchmarking at the moment.[/QUOTE]
You'll see the same performance up to 2[SUP]87[/SUP] and a very minor performance drop to 2[SUP]88[/SUP]. Above 2[SUP]88[/SUP] there will be a bigger drop.

[B][U]RAW[/U][/B] kernel benchmarks (million FCs per second without sieve):
[CODE]GeForce GT 440 (CC 2.1)
mfaktc 0.21-pre4 // 319.60 + CUDA 5.5
./mfaktc.exe -tf 66362159 66 67

71bit_mul24 29.23M/s
75bit_mul32 42.22M/s
95bit_mul32 33.16M/s
barrett76_mul32 79.23M/s
barrett77_mul32 74.94M/s
barrett79_mul32 64.18M/s
barrett87_mul32 75.51M/s
barrett88_mul32 75.46M/s
barrett92_mul32 61.93M/s

-------------------------------------

Tesla K20m (CC 3.5)
mfaktc 0.21-pre4 // 331.20 + CUDA 5.5
./mfaktc.exe -tf 66362159 68 69

71bit_mul24 160.51M/s
75bit_mul32 200.32M/s
95bit_mul32 155.13M/s
barrett76_mul32 392.31M/s
barrett77_mul32 367.17M/s
barrett79_mul32 314.82M/s
barrett87_mul32 368.01M/s (without funnel-shift 357.09M/s)
barrett88_mul32 367.45M/s (without funnel-shift 347.80M/s)
barrett92_mul32 306.60M/s (without funnel-shift 293.69M/s)

-------------------------------------

GeForce GTX 275 (CC 1.3)
mfaktc 0.21-pre5 // 319.37 + CUDA 5.5
./mfaktc.exe -tf 66362159 66 67

71bit_mul24 77.64M/s
75bit_mul32 62.59M/s
95bit_mul32 50.34M/s
barrett76_mul32 85.83M/s
barrett77_mul32 82.48M/s
barrett79_mul32 73.56M/s
barrett87_mul32 75.93M/s
barrett88_mul32 75.41M/s
barrett92_mul32 65.80M/s[/CODE]

With Sieving there will be a constant penalty added to each kernel so the relative performance difference between those kernels will be a little bit smaller than the [B][U]RAW[/U][/B] speeds suggests.

barrett76,77 and 79 can do 2[SUP]64[/SUP] to 2[SUP]<upper limit for the kernel>[/SUP] in ONE step while barrett87, 88 and 92 can only do one bitlevel at once. But above 2[SUP]76[/SUP] I think this is not a real concern.

Oliver

petrw1 2014-12-04 17:56

So if I read the GPU72 report correctly...and if hypothetically we maintain the current DC-TF rate...then DC-TF would be a thing of the past before the end of 2015...

James Heinrich 2014-12-04 18:00

[QUOTE=Mark Rose;389148]Here are some GHz-d/day numbers for M39467291 on a GTX 580 (at 1544MHz):[/QUOTE]Same idea I had, I just did the same kind of benchmark on M99999989 on a GTX 570:[code]66,67 = 274.5
67,68 = 274.5
68,69 = 278.8
69,70 = 280.2
70,71 = 279.9
71,72 = 280.0
72,73 = 280.0
73,74 = 276.9
74,75 = 276.9
75,76 = 276.9
76,77 = 268.0
77,78 = 270.7
78,79 = 270.6
79,80 = 270.6[/code]

Mark Rose 2014-12-04 18:10

[QUOTE=TheJudger;389158][I]Barrett[/I] is more than just one square operation (for which you counted to ops).[/quote]

When I was looking at the code, which was a while ago, it seemed to me that the square operation was where the most operations were saved with the barrett76 kernel versus the others. Is there a significant difference in the number of operations elsewhere in the barrett76 kernel? I ask only to understand better!

[quote]
[B]Not true[/B], but to be fair you've noticed yourself. See below.

You'll see the same performance up to 2[SUP]87[/SUP] and a very minor performance drop to 2[SUP]88[/SUP]. Above 2[SUP]88[/SUP] there will be a bigger drop.
With Sieving there will be a constant penalty added to each kernel so the relative performance difference between those kernels will be a little bit smaller than the [B][U]RAW[/U][/B] speeds suggests.

barrett76,77 and 79 can do 2[SUP]64[/SUP] to 2[SUP]<upper limit for the kernel>[/SUP] in ONE step while barrett87, 88 and 92 can only do one bitlevel at once. But above 2[SUP]76[/SUP] I think this is not a real concern.
[/QUOTE]

Thanks for the corrections!

Prime95 2014-12-04 19:19

I'm anxious to see similar numbers for AMC VLIW and GCN cards using the soon-to-be-released mfakto.

Looking at Mark's numbers I'm leaning toward removing GPU info from the web form. The 3% speed difference between lowest and highest bit levels isn't worth worrying about. As for LL/TF crossovers that only come into play if one chooses the lowest exponent preference, I'll assume a GTX 770 which does the least TF before LL becomes a more profitable use of the card.

For those that are looking to maximize their GHz-days/day, the optional bit level to factor to and exponent range can always be used to get suitable work.

TheJudger 2014-12-04 20:15

Hi,

[QUOTE=Mark Rose;389163]When I was looking at the code, which was a while ago, it seemed to me that the square operation was where the most operations were saved with the barrett76 kernel versus the others. Is there a significant difference in the number of operations elsewhere in the barrett76 kernel? I ask only to understand better![/QUOTE]

mfaktc evolution // basic ideas:[LIST=1][*]barrett_92 is "basic barrett" with full 96/192 bit integer[*]barrett_79 is like barrett_92 with [B]fixed[/B] (2[SUP]160[/SUP]) value for the scaled integer inverse, this[LIST][*]saves some multiword shifts (compare both kernels) because 2[SUP]160[/SUP] = 2[SUP]5*32[/SUP] ist easy to shift on 32bit words[*]allows multiple bitlevels at once[*]reduces the upper limit for fixed size integers (compared to barrett_92)[/LIST][*]reduced accuracy in interim steps[SUP]*1[/SUP][LIST][*]barrett_87/88 are barrett_92 with less accuracy in interim steps[*]barrett_76/77 are barrett_79 with less accuracy in interim steps[/LIST][/LIST]
[SUP]*1[/SUP]Less accuracy as in: "n mod f == x + <small integer> * f", e.g. 1234 mod 10 = 24 (instead of 4)

Oliver

Mark Rose 2014-12-04 21:44

Thank you!

That gives me the insight needed to study the code further :)

garo 2014-12-04 22:12

[QUOTE=Prime95;389091]That might be a good idea -- or I could delete the gpu info completely and assume the lowest crossovers since 99% of the time the information will not be used. Or see below for where I might use this info more often...
[/QUOTE]

I think using the lowest crossovers makes sense at least until we are reasonably confident that we can factor everything to that level. We are not at that point yet.

kracker 2014-12-04 23:40

[QUOTE=James Heinrich;389160]Same idea I had, I just did the same kind of benchmark on M99999989 on a GTX 570:[code]66,67 = 274.5
67,68 = 274.5
68,69 = 278.8
69,70 = 280.2
70,71 = 279.9
71,72 = 280.0
72,73 = 280.0
73,74 = 276.9
74,75 = 276.9
75,76 = 276.9
76,77 = 268.0
77,78 = 270.7
78,79 = 270.6
79,80 = 270.6[/code][/QUOTE]

Same exponent on my R9 285(AMD GCN)...
[code]
66,67 = 433.9
67,68 = 433.9
68,69 = 433.9
69,70 = 407.6
70,71 = 369.1
71,72 = 369.1
72,73 = 369.1
73,74 = 357.3
74,75 = 329.1
75,76 = 327.7
76,77 = 327.7
~Not much more variation until past 82 bits~
[/code]

Bdot 2014-12-04 23:51

[QUOTE=Prime95;389174]I'm anxious to see similar numbers for AMC VLIW and GCN cards using the soon-to-be-released mfakto.
[/QUOTE]
Here are a few results that I received for the test version. It shows the best kernel per bitlevel for a real GPU-sieve run of about 3 seconds per test.
This is VLIW5 (HD6550D from an APU):
[code]
Resulting speed for M66362159:
bit_min - bit_max GHz-days/day kernelname
60 - 64 39.993 cl_barrett15_69_gs
64 - 76 41.567 cl_barrett32_76_gs
76 - 77 39.926 cl_barrett32_77_gs
77 - 87 39.404 cl_barrett32_87_gs
87 - 88 36.753 cl_barrett32_88_gs
88 - 92 35.173 cl_barrett32_92_gs
[/code]This is a first-generation GCN with 1:4 DP (HD7950):
[code]
Resulting speed for M66362159:
bit_min - bit_max GHz-days/day kernelname
60 - 69 499.674 cl_barrett15_69_gs
69 - 70 476.535 cl_barrett15_71_gs
70 - 73 427.422 cl_barrett15_73_gs
73 - 74 412.430 cl_barrett15_74_gs
74 - 82 378.749 cl_barrett15_82_gs
82 - 83 354.878 cl_barrett15_83_gs
83 - 87 330.658 cl_barrett32_87_gs
87 - 88 327.284 cl_barrett15_88_gs
88 - 92 289.456 cl_barrett32_92_gs
[/code]This is a newer GCN with 1:16 DP (R285)
[code]
Resulting speed for M66362159:
bit_min - bit_max GHz-days/day kernelname
60 - 69 475.043 cl_barrett15_69_gs
69 - 70 443.636 cl_barrett15_71_gs
70 - 73 402.419 cl_barrett15_73_gs
73 - 74 389.251 cl_barrett15_74_gs
74 - 82 334.707 cl_barrett15_82_gs
82 - 83 313.099 cl_barrett15_83_gs
83 - 87 294.592 cl_barrett32_87_gs
87 - 88 291.010 cl_barrett15_88_gs
88 - 92 258.739 cl_barrett32_92_gs
[/code]This is the new top-level GCN with improved int32 math (R290x):
[code]
Resulting speed for M66362159:
bit_min - bit_max GHz-days/day kernelname
60 - 69 757.628 cl_barrett15_69_gs
69 - 76 749.778 cl_barrett32_76_gs
76 - 77 720.362 cl_barrett32_77_gs
77 - 79 645.730 cl_barrett32_79_gs
79 - 87 642.553 cl_barrett32_87_gs
87 - 88 614.766 cl_barrett32_88_gs
88 - 92 565.309 cl_barrett32_92_gs
[/code]And finally, this is Intel HD4600:
[code]
Resulting speed for M66362159:
bit_min - bit_max GHz-days/day kernelname
60 - 64 15.081 cl_barrett15_69_gs
64 - 76 19.707 cl_barrett32_76_gs
76 - 77 19.345 cl_barrett32_77_gs
77 - 87 17.208 cl_barrett32_87_gs
87 - 88 16.816 cl_barrett32_88_gs
88 - 92 14.507 cl_barrett32_92_gs
[/code]The exponent size also has some influence on the performance, for example R290x:
[code]
M2000093: 69 - 76 942.631 cl_barrett32_76_gs
M39000037: 69 - 76 749.779 cl_barrett32_76_gs
M66362159: 69 - 76 749.778 cl_barrett32_76_gs
M74000077: 69 - 76 721.255 cl_barrett32_76_gs
M78000071: 69 - 76 720.259 cl_barrett32_76_gs
M332900047: 69 - 76 667.219 cl_barrett32_76_gs
M999900079: 69 - 76 645.028 cl_barrett32_76_gs
M2001862367: 64 - 76 621.290 cl_barrett32_76_gs
M4201971233: 69 - 76 602.682 cl_barrett32_76_gs
[/code]

Prime95 2014-12-05 00:24

1 Attachment(s)
[QUOTE=Bdot;389225]The exponent size also has some influence on the performance, for example R290x[/QUOTE]

I do take this into account by assuming the first 7 bits of the exponent are "free" -- multiplying the TF cost by (ceil (log2 (exponent)) - 7)

The full SQL code currently is attached


All times are UTC. The time now is 23:17.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.