mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2019-03-03, 17:50   #2729
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

2×7×47 Posts
Default

Quote:
Originally Posted by kriesel View Post
Tservo, ATH, could you run the problem executables from the beginning of an exponent with
Code:
ReportIterations=1
set in the ini file?
Does every iteration from the very start consist of zero res64, or does it take a while to start?
My results were the same as ATH.
Also, as I mentioned, I looked at the code that generated all the warnings in the 1st attached compile of ATH's just above here and they all looked ok.
My warnings were thesame, of course.
tServo is offline   Reply With Quote
Old 2019-03-03, 17:58   #2730
mognuts
 
mognuts's Avatar
 
Sep 2008
Bromley, England

43 Posts
Default

Quote:
Originally Posted by ATH View Post
Instead of testing M3021377 you should probably test some fixed number of iterations of an exponent with the same size, which you are doing to use CUDALucas for. Either 83M-85M exponent for LL or 46M-48M for LL-D to make sure that version is also fastest at that particular FFT size.
I've run some benchmarks on M57885161 (FFT 3136K), using various CUDALucas v2.06beta 64-bit CUDA binaries using a GTX1060 6GB. These are the times for 190000 iterations. Nvidia driver version 419.17, Windows 7. It might be of some use to somebody, but as always with this type of thing, your mileage will vary.

Code:
CUDA
Ver       Time
====    =======
 4.2    16m 19s
 5.0    15m 46s
 5.5    16m 06s
 6.0    15m 46s
 6.5    15m 48s
 8.0    15m 51s
10.0    20m 15s
10.1    18m 28s
mognuts is offline   Reply With Quote
Old 2019-03-03, 19:14   #2731
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

317 Posts
Default

I can't confirm that there is a big difference between 10.0 and 10.1 performance on an RTX 2060. Can't run any earlier versions of course, since there's no support for CC 7.5. But maybe they did some Pascal-only optimizations in CUDA 10.1. Or judging by the performance drop from 8.0 to 10.0, maybe they just unrolled some of the stuff made back then.

Driver 417.71, CUDA 10.0 :
Code:
Device              GeForce RTX 2060
Compatibility       7.5
clockRate (MHz)     1830
memClockRate (MHz)  7001

  fft    max exp  ms/iter
 1024   19535569   1.0761
 1152   21921901   1.2840
 1296   24599717   1.5576
 1323   25101101   1.7955
 1350   25602229   1.8542
 1372   26010389   1.8911
 1400   26529691   1.9210
 1458   27604673   1.9611
 1512   28604657   1.9716
 1600   30232693   2.0590
 1728   32597297   2.0927
 1792   33778141   2.1136
 2048   38492887   2.1167
 2304   43194913   2.6200
 2560   47885689   3.1887
 2592   48471289   3.2526
 2744   51250889   3.6847
 2880   53735041   3.8500
 2916   54392209   3.8560
 3200   59570449   3.9097
 3456   64229677   3.9697
 4096   75846319   4.2404
 4608   85111207   5.4321
 5184   95507747   6.3966
 5400   99399967   7.4287
 5760  105879517   7.4607
 5832  107174381   7.7607
 6400  117377567   8.0064
 6912  126558077   8.2118
 7168  131142761   9.5563
 8192  149447533  10.1720
Driver 419.17, CUDA 10.1 :
Code:
Device              GeForce RTX 2060
Compatibility       7.5
clockRate (MHz)     1830
memClockRate (MHz)  7001

  fft    max exp  ms/iter
 1024   19535569   1.2139
 1152   21921901   1.2783
 1296   24599717   1.5568
 1323   25101101   1.7832
 1372   26010389   1.8362
 1458   27604673   1.9038
 1512   28604657   1.9646
 1600   30232693   2.0505
 1728   32597297   2.0808
 1792   33778141   2.1063
 2048   38492887   2.1119
 2304   43194913   2.6102
 2560   47885689   3.1769
 2592   48471289   3.2404
 2744   51250889   3.6480
 2916   54392209   3.7917
 3200   59570449   3.8857
 3456   64229677   3.9463
 4096   75846319   4.2507
 4608   85111207   5.4388
 5184   95507747   6.3674
 5400   99399967   7.4289
 5760  105879517   7.4390
 5832  107174381   7.5620
 6400  117377567   8.0081
 6912  126558077   8.1465
 7168  131142761   9.4851
 8192  149447533  10.0911
nomead is offline   Reply With Quote
Old 2019-03-04, 19:49   #2732
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×1,579 Posts
Default

Someone who has access should remove CUDALucas 2.05 from the site completely, to avoid people wasting time on the 2.05 bug.

It is even listed as the "Latest version" here:
https://sourceforge.net/projects/cudalucas/files/

Last fiddled with by ATH on 2019-03-04 at 19:49
ATH is online now   Reply With Quote
Old 2019-03-05, 19:12   #2733
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2·1,579 Posts
Default

Quote:
Originally Posted by mognuts View Post
Cuda 10.1 is faster than Cuda 10.0, but 5.0 is the fastest binary. I test using the known prime, M3021377.
CUDA 5.0 and 5.5 are faster than all the others on my old Titan Black. But 5.5 is slightly faster than 5.0 at 85M and 5.0 is slightly faster than 5.5 at 48M (LL-D).


Did you do -cufftbench and -threadbench for each version? It is slightly different thread counts 5.0 and 5.5 uses, and some of the other versions was fastest at a completely different FFT size like 4608K for 85M.



Code:
CUDA 5.0
Using threads: square 64, splice 128.
Starting M85000007 fft length = 5184K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Mar 05  03:11:44  |  M85000007     50000  0xf8f23b3c461eda23  |  5184K  0.01904   3.2767  163.83s  |   3:05:19:19   0.05%  |
|  Mar 05  03:14:29  |  M85000007    100000  0x51c6acdda4a0afd2  |  5184K  0.01904   3.2949  164.74s  |   3:05:29:27   0.11%  |
|  Mar 05  03:17:14  |  M85000007    150000  0xac94578b70aaa8c4  |  5184K  0.01904   3.2941  164.70s  |   3:05:30:38   0.17%  |
|  Mar 05  03:19:59  |  M85000007    200000  0xee2d0bcbd9606021  |  5184K  0.01904   3.2945  164.72s  |   3:05:29:59   0.23%  |

CUDA 5.5
Using threads: square 64, splice 512.
Starting M85000007 fft length = 5184K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Mar 05  03:24:06  |  M85000007     50000  0xf8f23b3c461eda23  |  5184K  0.01855   3.2741  163.70s  |   3:05:15:41   0.05%  |
|  Mar 05  03:26:51  |  M85000007    100000  0x51c6acdda4a0afd2  |  5184K  0.01880   3.2832  164.16s  |   3:05:19:22   0.11%  |
|  Mar 05  03:29:35  |  M85000007    150000  0xac94578b70aaa8c4  |  5184K  0.01880   3.2870  164.35s  |   3:05:20:34   0.17%  |
|  Mar 05  03:32:19  |  M85000007    200000  0xee2d0bcbd9606021  |  5184K  0.01953   3.2882  164.41s  |   3:05:20:14   0.23%  |



CUDA5.0
Using threads: square 64, splice 256.
Starting M48000013 fft length = 2592K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Mar 05  03:39:55  |  M48000013     50000  0x75639b37776dd731  |  2592K  0.23047   1.5739   78.69s  |     20:57:48   0.10%  |
|  Mar 05  03:41:14  |  M48000013    100000  0x43895b9120fd80b4  |  2592K  0.23438   1.5790   78.95s  |     20:58:33   0.20%  |
|  Mar 05  03:42:33  |  M48000013    150000  0xa14c61edcdb4c50a  |  2592K  0.23438   1.5788   78.94s  |     20:57:52   0.31%  |
|  Mar 05  03:43:52  |  M48000013    200000  0xd4deb59cd8ee0cfe  |  2592K  0.24219   1.5788   78.93s  |     20:56:52   0.41%  |

CUDA5.5
Using threads: square 64, splice 128.
Starting M48000013 fft length = 2592K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Mar 05  03:34:01  |  M48000013     50000  0x75639b37776dd731  |  2592K  0.22266   1.5916   79.58s  |     21:11:59   0.10%  |
|  Mar 05  03:35:20  |  M48000013    100000  0x43895b9120fd80b4  |  2592K  0.24658   1.5958   79.79s  |     21:12:21   0.20%  |
|  Mar 05  03:36:40  |  M48000013    150000  0xa14c61edcdb4c50a  |  2592K  0.24658   1.5957   79.78s  |     21:11:32   0.31%  |
|  Mar 05  03:38:00  |  M48000013    200000  0xd4deb59cd8ee0cfe  |  2592K  0.24805   1.5957   79.78s  |     21:10:28   0.41%  |

Last fiddled with by ATH on 2019-03-05 at 19:14
ATH is online now   Reply With Quote
Old 2019-03-08, 18:02   #2734
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

2·7·47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
If I understand the 2080 architecture correctly, LL test speed could be improved (perhaps greatly) by going to 128-bit fixed point reals represented as four 32-bit integers. I investigated this somewhat 4 years ago when 32-bit adds had huge throughput advantage but 32-bit multiplies had no advantage compared to DP throughput. IIUC, in the 2080 both 32-bit adds and 32-bit multiplies have a huge throughput advantage compared to DP throughput.

The basic idea is that adding two 128-bit fixed point reals requires four 32-bit adds (with carries) plus some overhead for handling signs. Multiplying two 128-bit fixed point reals requires sixteen 32-bit multiplies, plus some adds, and some overhead for handling signs.

Each FFT butterfly adds and subtracts FFT data values which increases the maximum FFT data value by one bit. Thus, the fixed point reals must be shifted one bit prior to a butterfly (i.e. move the implied decimal point). This adds some additional overhead in implementing a fixed-point real FFT.

My research indicated we could store as many as 51 bits of input data in each 128-bit fixed point real. This (51/128) is much more memory efficient than current DP FFTs which store about 17-bits of data in each 64-bit double.

Is there any flaw in my understanding of the 2080 architecture? Does anyone have time to explore the feasibility of this approach?
George,
This sounds quite interesting and. as a matter of fact, I was thinking along these same lines ever since I heard that Volta had separated integer from floats ( it may have started in Pascal ). I was examining the math routines in mmff for inspiration.

A few questions :

When you speak of fixed point format, are you referring to the IEEE 754 standard used in Cuda to represent reals?

Instead of the 16 multiplies when doing a 4x4 32 bit numbers, wouldn't Karatsuba be much better to reduce the multiplies to adds? In fact, 2 levels of Karatsube may be used for 3 separate Karatsubas. It should be unrolled to be as linear as possible instead of using recursion or subroutine calls, of course
tServo is offline   Reply With Quote
Old 2019-03-08, 20:22   #2735
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

152B16 Posts
Default

Quote:
Originally Posted by tServo View Post
George,
This sounds quite interesting and. as a matter of fact, I was thinking along these same lines ever since I heard that Volta had separated integer from floats ( it may have started in Pascal ). I was examining the math routines in mmff for inspiration.

A few questions :

When you speak of fixed point format, are you referring to the IEEE 754 standard used in Cuda to represent reals?

Instead of the 16 multiplies when doing a 4x4 32 bit numbers, wouldn't Karatsuba be much better to reduce the multiplies to adds? In fact, 2 levels of Karatsube may be used for 3 separate Karatsubas. It should be unrolled to be as linear as possible instead of using recursion or subroutine calls, of course
I wondered about that too. "As a rule of thumb, Karatsuba's method is usually faster when the multiplicands are longer than 320–640 bits." https://en.wikipedia.org/wiki/Karatsuba_algorithm That's for the multiplication itself, not taking into account the benefit to the higher level operation of a shorter fft length atop it, and also how memory bandwidth is frequently the limiting factor, and also of fitting a given primality test or P-1 factoring computation of ~50% greater equivalent size into the same fixed size gpu memory, which would allow P-1 stage 2 to higher bounds.
51/128 = 1.5 x 17bits/64bit word. What if we go further, to 256-bit; then how many bits might be usable per word?

Last fiddled with by kriesel on 2019-03-08 at 20:26
kriesel is offline   Reply With Quote
Old 2019-03-09, 00:58   #2736
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by tServo View Post
When you speak of fixed point format, are you referring to the IEEE 754 standard used in Cuda to represent reals?
While this may not be the best approach, what I played with was a mantissa-only float stored as 4 integers. Each FFT element has the same (not stored) exponent. This representation requires 16 32x32 multiplies to do a 128x128 multiply yielding a 128 bit result.

Here is a link to what I was working on:
https://www.dropbox.com/s/g46bkk3yvh...lucas.zip?dl=0

This is NOT working FFT code, it was an effort to mimic the work a fixed point FFT would require. There are many backups showing how the code evolved. This was done 4 years ago, so I probably don't remember much of it.
Prime95 is offline   Reply With Quote
Old 2019-03-09, 01:04   #2737
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

2·7·47 Posts
Default

Quote:
Originally Posted by kriesel View Post
I wondered about that too. "As a rule of thumb, Karatsuba's method is usually faster when the multiplicands are longer than 320–640 bits." https://en.wikipedia.org/wiki/Karatsuba_algorithm That's for the multiplication itself, not taking into account the benefit to the higher level operation of a shorter fft length atop it, and also how memory bandwidth is frequently the limiting factor, and also of fitting a given primality test or P-1 factoring computation of ~50% greater equivalent size into the same fixed size gpu memory, which would allow P-1 stage 2 to higher bounds.
51/128 = 1.5 x 17bits/64bit word. What if we go further, to 256-bit; then how many bits might be usable per word?
I seem to recall doing some benchmarks in the 1990s on when to switch from grade school multiplication to Karatsuba and from Karatsuba to FFTs in a bignum math package I had found on Richard Crandal's website ( Perfectly Scientific or something like that ).
I remember discovering that the cutover points were set waaaay too high for the grade school to Karatsuba. I can't remember about the other cutoff point except that I was impressed with the speed of the FFT multiplication.

My point here is that the Karatsuba method may be better than that Wikipedia article concluded. I suspect it depends on the speed ratio between add & multiply.
Also, your point is a good one on possibly being able to carry more significant bits if a 256 sized format is used thereby making the FFT more efficient.

Sounds like a fun project.
tServo is offline   Reply With Quote
Old 2019-03-09, 16:50   #2738
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

10001001102 Posts
Default

You can find various threshold for multiplication algorithms for GMP lib implementation on many CPU here: https://gmplib.org/devel/thres


Toom22 is Karatsuba.
ldesnogu is offline   Reply With Quote
Old 2019-03-09, 19:54   #2739
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
You can find various threshold for multiplication algorithms for GMP lib implementation on many CPU here: https://gmplib.org/devel/thres

Toom22 is Karatsuba.
Thank you for the link. Those pages could use a little more explanation, in my opinion.
"ABI" is I suppose application binary interface, specifically bit size.
Are the "meas thres" and "conf thres" in units larger than bits, such as words equal in size to the abi value, or 32-bit, or what?
It looks like meas and conf track pretty well usually, though there are cases of 6, 8, or 18 difference. Meas in sqr-toom2 ranges from 14 to 100.
I wonder what the gpu equivalent crossovers look like.

Quote:
Originally Posted by tServo
Sounds like a fun project.
Go for it!

Quote:
Originally Posted by prime95
Here is a link to what I was working on:
https://www.dropbox.com/s/g46bkk3yvh...lucas.zip?dl=0
Is that based on any recent math reference?

Last fiddled with by kriesel on 2019-03-09 at 20:41
kriesel is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 07:03.


Mon Aug 2 07:03:19 UTC 2021 up 10 days, 1:32, 0 users, load averages: 2.00, 1.83, 1.46

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.