mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
Thread Tools
Old 2014-12-04, 17:55   #3279
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
How much of it is the GHz-d calculation and how much of it is extra math? I haven't looked very much at mfakto's code. I'm curious. mfaktc's barrett76 kernel needs only 5 32-bit ints and 9 multiplies for a 76 bit x 76 bit product, but the barrett77 kernel requires 6 ints and 12 multiplies for 77 bit x 77 bit product. There's about a 20% drop in performance going from 76 to 77 bits, before taking into account the GHz-d formula penalty for higher bit levels.
Barrett is more than just one square operation (for which you counted to ops).

Quote:
Originally Posted by Mark Rose View Post
There's a big ~20% performance hit beyond 76 bits for all Nvidia cards.
Not true, but to be fair you've noticed yourself. See below.

Quote:
Originally Posted by Mark Rose View Post
Trial factoring anything up to 76 bits is fast with mfaktc. Trial factoring 77 bits is slower. Here are some GHz-d/day numbers for M39467291 on a GTX 580 (at 1544MHz):

69,70: 426.02
70,71: 426.02
71,72: 424.85
72,73: 424.52
73,74: 424.32
74,75: 424.56
75,76: 424.39
76,77: 423.28
77,78: 414.38 // okay, not as bad as I remembered! The 20% I remembered was from this post. The new barrett76 kernel is only usable for 76 bits (77 overflows), and so a less efficient kernel must be used.
78,79: 414.24
79,80: 414.32

I don't have time to do more benchmarking at the moment.
You'll see the same performance up to 287 and a very minor performance drop to 288. Above 288 there will be a bigger drop.

RAW kernel benchmarks (million FCs per second without sieve):
Code:
GeForce GT 440 (CC 2.1)
mfaktc 0.21-pre4 // 319.60 + CUDA 5.5
./mfaktc.exe -tf 66362159 66 67

71bit_mul24 29.23M/s
75bit_mul32 42.22M/s
95bit_mul32 33.16M/s
barrett76_mul32 79.23M/s
barrett77_mul32 74.94M/s
barrett79_mul32 64.18M/s
barrett87_mul32 75.51M/s
barrett88_mul32 75.46M/s
barrett92_mul32 61.93M/s

-------------------------------------

Tesla K20m (CC 3.5)
mfaktc 0.21-pre4 // 331.20 + CUDA 5.5
./mfaktc.exe -tf 66362159 68 69

71bit_mul24 160.51M/s
75bit_mul32 200.32M/s
95bit_mul32 155.13M/s
barrett76_mul32 392.31M/s
barrett77_mul32 367.17M/s
barrett79_mul32 314.82M/s
barrett87_mul32 368.01M/s (without funnel-shift 357.09M/s)
barrett88_mul32 367.45M/s (without funnel-shift 347.80M/s)
barrett92_mul32 306.60M/s (without funnel-shift 293.69M/s)

-------------------------------------

GeForce GTX 275 (CC 1.3)
mfaktc 0.21-pre5 // 319.37 + CUDA 5.5
./mfaktc.exe -tf 66362159 66 67

71bit_mul24 77.64M/s
75bit_mul32 62.59M/s
95bit_mul32 50.34M/s
barrett76_mul32 85.83M/s
barrett77_mul32 82.48M/s
barrett79_mul32 73.56M/s
barrett87_mul32 75.93M/s
barrett88_mul32 75.41M/s
barrett92_mul32 65.80M/s
With Sieving there will be a constant penalty added to each kernel so the relative performance difference between those kernels will be a little bit smaller than the RAW speeds suggests.

barrett76,77 and 79 can do 264 to 2<upper limit for the kernel> in ONE step while barrett87, 88 and 92 can only do one bitlevel at once. But above 276 I think this is not a real concern.

Oliver
TheJudger is offline   Reply With Quote
Old 2014-12-04, 17:56   #3280
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

111248 Posts
Default

So if I read the GPU72 report correctly...and if hypothetically we maintain the current DC-TF rate...then DC-TF would be a thing of the past before the end of 2015...
petrw1 is offline   Reply With Quote
Old 2014-12-04, 18:00   #3281
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11·311 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Here are some GHz-d/day numbers for M39467291 on a GTX 580 (at 1544MHz):
Same idea I had, I just did the same kind of benchmark on M99999989 on a GTX 570:
Code:
66,67 = 274.5
67,68 = 274.5
68,69 = 278.8
69,70 = 280.2
70,71 = 279.9
71,72 = 280.0
72,73 = 280.0
73,74 = 276.9
74,75 = 276.9
75,76 = 276.9
76,77 = 268.0
77,78 = 270.7
78,79 = 270.6
79,80 = 270.6
James Heinrich is offline   Reply With Quote
Old 2014-12-04, 18:10   #3282
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2×5×293 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Barrett is more than just one square operation (for which you counted to ops).
When I was looking at the code, which was a while ago, it seemed to me that the square operation was where the most operations were saved with the barrett76 kernel versus the others. Is there a significant difference in the number of operations elsewhere in the barrett76 kernel? I ask only to understand better!

Quote:
Not true, but to be fair you've noticed yourself. See below.

You'll see the same performance up to 287 and a very minor performance drop to 288. Above 288 there will be a bigger drop.
With Sieving there will be a constant penalty added to each kernel so the relative performance difference between those kernels will be a little bit smaller than the RAW speeds suggests.

barrett76,77 and 79 can do 264 to 2<upper limit for the kernel> in ONE step while barrett87, 88 and 92 can only do one bitlevel at once. But above 276 I think this is not a real concern.
Thanks for the corrections!
Mark Rose is offline   Reply With Quote
Old 2014-12-04, 19:19   #3283
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

165618 Posts
Default

I'm anxious to see similar numbers for AMC VLIW and GCN cards using the soon-to-be-released mfakto.

Looking at Mark's numbers I'm leaning toward removing GPU info from the web form. The 3% speed difference between lowest and highest bit levels isn't worth worrying about. As for LL/TF crossovers that only come into play if one chooses the lowest exponent preference, I'll assume a GTX 770 which does the least TF before LL becomes a more profitable use of the card.

For those that are looking to maximize their GHz-days/day, the optional bit level to factor to and exponent range can always be used to get suitable work.
Prime95 is offline   Reply With Quote
Old 2014-12-04, 20:15   #3284
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111110 Posts
Default

Hi,

Quote:
Originally Posted by Mark Rose View Post
When I was looking at the code, which was a while ago, it seemed to me that the square operation was where the most operations were saved with the barrett76 kernel versus the others. Is there a significant difference in the number of operations elsewhere in the barrett76 kernel? I ask only to understand better!
mfaktc evolution // basic ideas:
  1. barrett_92 is "basic barrett" with full 96/192 bit integer
  2. barrett_79 is like barrett_92 with fixed (2160) value for the scaled integer inverse, this
    • saves some multiword shifts (compare both kernels) because 2160 = 25*32 ist easy to shift on 32bit words
    • allows multiple bitlevels at once
    • reduces the upper limit for fixed size integers (compared to barrett_92)
  3. reduced accuracy in interim steps*1
    • barrett_87/88 are barrett_92 with less accuracy in interim steps
    • barrett_76/77 are barrett_79 with less accuracy in interim steps

*1Less accuracy as in: "n mod f == x + <small integer> * f", e.g. 1234 mod 10 = 24 (instead of 4)

Oliver
TheJudger is offline   Reply With Quote
Old 2014-12-04, 21:44   #3285
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2×5×293 Posts
Default

Thank you!

That gives me the insight needed to study the code further :)
Mark Rose is offline   Reply With Quote
Old 2014-12-04, 22:12   #3286
garo
 
garo's Avatar
 
Aug 2002
Termonfeckin, IE

1010110011002 Posts
Default

Quote:
Originally Posted by Prime95 View Post
That might be a good idea -- or I could delete the gpu info completely and assume the lowest crossovers since 99% of the time the information will not be used. Or see below for where I might use this info more often...
I think using the lowest crossovers makes sense at least until we are reasonably confident that we can factor everything to that level. We are not at that point yet.
garo is offline   Reply With Quote
Old 2014-12-04, 23:40   #3287
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
Same idea I had, I just did the same kind of benchmark on M99999989 on a GTX 570:
Code:
66,67 = 274.5
67,68 = 274.5
68,69 = 278.8
69,70 = 280.2
70,71 = 279.9
71,72 = 280.0
72,73 = 280.0
73,74 = 276.9
74,75 = 276.9
75,76 = 276.9
76,77 = 268.0
77,78 = 270.7
78,79 = 270.6
79,80 = 270.6
Same exponent on my R9 285(AMD GCN)...
Code:
66,67 = 433.9
67,68 = 433.9
68,69 = 433.9
69,70 = 407.6
70,71 = 369.1
71,72 = 369.1
72,73 = 369.1
73,74 = 357.3
74,75 = 329.1
75,76 = 327.7
76,77 = 327.7
~Not much more variation until past 82 bits~
kracker is offline   Reply With Quote
Old 2014-12-04, 23:51   #3288
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I'm anxious to see similar numbers for AMC VLIW and GCN cards using the soon-to-be-released mfakto.
Here are a few results that I received for the test version. It shows the best kernel per bitlevel for a real GPU-sieve run of about 3 seconds per test.
This is VLIW5 (HD6550D from an APU):
Code:
Resulting speed for M66362159:
bit_min - bit_max  GHz-days/day  kernelname
     60 -      64        39.993  cl_barrett15_69_gs  
     64 -      76        41.567  cl_barrett32_76_gs  
     76 -      77        39.926  cl_barrett32_77_gs  
     77 -      87        39.404  cl_barrett32_87_gs  
     87 -      88        36.753  cl_barrett32_88_gs  
     88 -      92        35.173  cl_barrett32_92_gs
This is a first-generation GCN with 1:4 DP (HD7950):
Code:
Resulting speed for M66362159:
bit_min - bit_max  GHz-days/day  kernelname
     60 -      69       499.674  cl_barrett15_69_gs  
     69 -      70       476.535  cl_barrett15_71_gs  
     70 -      73       427.422  cl_barrett15_73_gs  
     73 -      74       412.430  cl_barrett15_74_gs  
     74 -      82       378.749  cl_barrett15_82_gs  
     82 -      83       354.878  cl_barrett15_83_gs  
     83 -      87       330.658  cl_barrett32_87_gs  
     87 -      88       327.284  cl_barrett15_88_gs  
     88 -      92       289.456  cl_barrett32_92_gs
This is a newer GCN with 1:16 DP (R285)
Code:
Resulting speed for M66362159:
bit_min - bit_max  GHz-days/day  kernelname
     60 -      69       475.043  cl_barrett15_69_gs  
     69 -      70       443.636  cl_barrett15_71_gs  
     70 -      73       402.419  cl_barrett15_73_gs  
     73 -      74       389.251  cl_barrett15_74_gs  
     74 -      82       334.707  cl_barrett15_82_gs  
     82 -      83       313.099  cl_barrett15_83_gs  
     83 -      87       294.592  cl_barrett32_87_gs  
     87 -      88       291.010  cl_barrett15_88_gs  
     88 -      92       258.739  cl_barrett32_92_gs
This is the new top-level GCN with improved int32 math (R290x):
Code:
Resulting speed for M66362159:
bit_min - bit_max  GHz-days/day  kernelname
     60 -      69       757.628  cl_barrett15_69_gs  
     69 -      76       749.778  cl_barrett32_76_gs  
     76 -      77       720.362  cl_barrett32_77_gs  
     77 -      79       645.730  cl_barrett32_79_gs  
     79 -      87       642.553  cl_barrett32_87_gs  
     87 -      88       614.766  cl_barrett32_88_gs  
     88 -      92       565.309  cl_barrett32_92_gs
And finally, this is Intel HD4600:
Code:
Resulting speed for M66362159:
bit_min - bit_max  GHz-days/day  kernelname
     60 -      64        15.081  cl_barrett15_69_gs  
     64 -      76        19.707  cl_barrett32_76_gs  
     76 -      77        19.345  cl_barrett32_77_gs  
     77 -      87        17.208  cl_barrett32_87_gs  
     87 -      88        16.816  cl_barrett32_88_gs  
     88 -      92        14.507  cl_barrett32_92_gs
The exponent size also has some influence on the performance, for example R290x:
Code:
M2000093:    69 -      76       942.631  cl_barrett32_76_gs  
M39000037:   69 -      76       749.779  cl_barrett32_76_gs
M66362159:   69 -      76       749.778  cl_barrett32_76_gs 
M74000077:   69 -      76       721.255  cl_barrett32_76_gs  
M78000071:   69 -      76       720.259  cl_barrett32_76_gs  
M332900047:  69 -      76       667.219  cl_barrett32_76_gs  
M999900079:  69 -      76       645.028  cl_barrett32_76_gs
M2001862367: 64 -      76       621.290  cl_barrett32_76_gs
M4201971233: 69 -      76       602.682  cl_barrett32_76_gs
Bdot is offline   Reply With Quote
Old 2014-12-05, 00:24   #3289
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by Bdot View Post
The exponent size also has some influence on the performance, for example R290x
I do take this into account by assuming the first 7 bits of the exponent are "free" -- multiplying the TF cost by (ceil (log2 (exponent)) - 7)

The full SQL code currently is attached
Attached Files
File Type: txt gpu_tf.txt (6.0 KB, 100 views)

Last fiddled with by Prime95 on 2014-12-05 at 00:25
Prime95 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Status Primeinator Operation Billion Digits 5 2011-12-06 02:35
62 bit status 1997rj7 Lone Mersenne Hunters 27 2008-09-29 13:52
OBD Status Uncwilly Operation Billion Digits 22 2005-10-25 14:05
1-2M LLR status paulunderwood 3*2^n-1 Search 2 2005-03-13 17:03
Status of 26.0M - 26.5M 1997rj7 Lone Mersenne Hunters 25 2004-06-18 16:46

All times are UTC. The time now is 08:02.


Mon Aug 2 08:02:58 UTC 2021 up 10 days, 2:31, 0 users, load averages: 1.19, 1.49, 1.49

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.