mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

tServo 2019-03-03 17:50

[QUOTE=kriesel;510005]Tservo, ATH, could you run the problem executables from the beginning of an exponent with
[CODE]ReportIterations=1[/CODE]set in the ini file?
Does every iteration from the very start consist of zero res64, or does it take a while to start?[/QUOTE]

My results were the same as ATH.
Also, as I mentioned, I looked at the code that generated all the warnings in the 1st attached compile of ATH's just above here and they all looked ok.
My warnings were thesame, of course.

mognuts 2019-03-03 17:58

[QUOTE=ATH;509875]Instead of testing M3021377 you should probably test some fixed number of iterations of an exponent with the same size, which you are doing to use CUDALucas for. Either 83M-85M exponent for LL or 46M-48M for LL-D to make sure that version is also fastest at that particular FFT size.[/QUOTE]
I've run some benchmarks on M57885161 (FFT 3136K), using various CUDALucas v2.06beta 64-bit CUDA binaries using a GTX1060 6GB. These are the times for 190000 iterations. Nvidia driver version 419.17, Windows 7. It might be of some use to somebody, but as always with this type of thing, your mileage will vary.

[CODE]
CUDA
Ver Time
==== =======
4.2 16m 19s
5.0 15m 46s
5.5 16m 06s
6.0 15m 46s
6.5 15m 48s
8.0 15m 51s
10.0 20m 15s
10.1 18m 28s[/CODE]

nomead 2019-03-03 19:14

I can't confirm that there is a big difference between 10.0 and 10.1 performance on an RTX 2060. Can't run any earlier versions of course, since there's no support for CC 7.5. But maybe they did some Pascal-only optimizations in CUDA 10.1. Or judging by the performance drop from 8.0 to 10.0, maybe they just unrolled some of the stuff made back then. :smile:

Driver 417.71, CUDA 10.0 :
[CODE]Device GeForce RTX 2060
Compatibility 7.5
clockRate (MHz) 1830
memClockRate (MHz) 7001

fft max exp ms/iter
1024 19535569 1.0761
1152 21921901 1.2840
1296 24599717 1.5576
1323 25101101 1.7955
1350 25602229 1.8542
1372 26010389 1.8911
1400 26529691 1.9210
1458 27604673 1.9611
1512 28604657 1.9716
1600 30232693 2.0590
1728 32597297 2.0927
1792 33778141 2.1136
2048 38492887 2.1167
2304 43194913 2.6200
2560 47885689 3.1887
2592 48471289 3.2526
2744 51250889 3.6847
2880 53735041 3.8500
2916 54392209 3.8560
3200 59570449 3.9097
3456 64229677 3.9697
4096 75846319 4.2404
4608 85111207 5.4321
5184 95507747 6.3966
5400 99399967 7.4287
5760 105879517 7.4607
5832 107174381 7.7607
6400 117377567 8.0064
6912 126558077 8.2118
7168 131142761 9.5563
8192 149447533 10.1720[/CODE]

Driver 419.17, CUDA 10.1 :
[CODE]Device GeForce RTX 2060
Compatibility 7.5
clockRate (MHz) 1830
memClockRate (MHz) 7001

fft max exp ms/iter
1024 19535569 1.2139
1152 21921901 1.2783
1296 24599717 1.5568
1323 25101101 1.7832
1372 26010389 1.8362
1458 27604673 1.9038
1512 28604657 1.9646
1600 30232693 2.0505
1728 32597297 2.0808
1792 33778141 2.1063
2048 38492887 2.1119
2304 43194913 2.6102
2560 47885689 3.1769
2592 48471289 3.2404
2744 51250889 3.6480
2916 54392209 3.7917
3200 59570449 3.8857
3456 64229677 3.9463
4096 75846319 4.2507
4608 85111207 5.4388
5184 95507747 6.3674
5400 99399967 7.4289
5760 105879517 7.4390
5832 107174381 7.5620
6400 117377567 8.0081
6912 126558077 8.1465
7168 131142761 9.4851
8192 149447533 10.0911[/CODE]

ATH 2019-03-04 19:49

Someone who has access should remove CUDALucas 2.05 from the site completely, to avoid people wasting time on the 2.05 bug.

It is even listed as the "Latest version" here:
[url]https://sourceforge.net/projects/cudalucas/files/[/url]

ATH 2019-03-05 19:12

[QUOTE=mognuts;509874]Cuda 10.1 is faster than Cuda 10.0, but 5.0 is the fastest binary. I test using the known prime, M3021377.[/QUOTE]

CUDA 5.0 and 5.5 are faster than all the others on my old Titan Black. But 5.5 is slightly faster than 5.0 at 85M and 5.0 is slightly faster than 5.5 at 48M (LL-D).


Did you do -cufftbench and -threadbench for each version? It is slightly different thread counts 5.0 and 5.5 uses, and some of the other versions was fastest at a completely different FFT size like 4608K for 85M.



[CODE]CUDA 5.0
Using threads: square 64, splice 128.
Starting M85000007 fft length = 5184K
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Mar 05 03:11:44 | M85000007 50000 0xf8f23b3c461eda23 | 5184K 0.01904 3.2767 163.83s | 3:05:19:19 0.05% |
| Mar 05 03:14:29 | M85000007 100000 0x51c6acdda4a0afd2 | 5184K 0.01904 3.2949 164.74s | 3:05:29:27 0.11% |
| Mar 05 03:17:14 | M85000007 150000 0xac94578b70aaa8c4 | 5184K 0.01904 3.2941 164.70s | 3:05:30:38 0.17% |
| Mar 05 03:19:59 | M85000007 200000 0xee2d0bcbd9606021 | 5184K 0.01904 3.2945 164.72s | 3:05:29:59 0.23% |

CUDA 5.5
Using threads: square 64, splice 512.
Starting M85000007 fft length = 5184K
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Mar 05 03:24:06 | M85000007 50000 0xf8f23b3c461eda23 | 5184K 0.01855 3.2741 163.70s | 3:05:15:41 0.05% |
| Mar 05 03:26:51 | M85000007 100000 0x51c6acdda4a0afd2 | 5184K 0.01880 3.2832 164.16s | 3:05:19:22 0.11% |
| Mar 05 03:29:35 | M85000007 150000 0xac94578b70aaa8c4 | 5184K 0.01880 3.2870 164.35s | 3:05:20:34 0.17% |
| Mar 05 03:32:19 | M85000007 200000 0xee2d0bcbd9606021 | 5184K 0.01953 3.2882 164.41s | 3:05:20:14 0.23% |



CUDA5.0
Using threads: square 64, splice 256.
Starting M48000013 fft length = 2592K
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Mar 05 03:39:55 | M48000013 50000 0x75639b37776dd731 | 2592K 0.23047 1.5739 78.69s | 20:57:48 0.10% |
| Mar 05 03:41:14 | M48000013 100000 0x43895b9120fd80b4 | 2592K 0.23438 1.5790 78.95s | 20:58:33 0.20% |
| Mar 05 03:42:33 | M48000013 150000 0xa14c61edcdb4c50a | 2592K 0.23438 1.5788 78.94s | 20:57:52 0.31% |
| Mar 05 03:43:52 | M48000013 200000 0xd4deb59cd8ee0cfe | 2592K 0.24219 1.5788 78.93s | 20:56:52 0.41% |

CUDA5.5
Using threads: square 64, splice 128.
Starting M48000013 fft length = 2592K
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Mar 05 03:34:01 | M48000013 50000 0x75639b37776dd731 | 2592K 0.22266 1.5916 79.58s | 21:11:59 0.10% |
| Mar 05 03:35:20 | M48000013 100000 0x43895b9120fd80b4 | 2592K 0.24658 1.5958 79.79s | 21:12:21 0.20% |
| Mar 05 03:36:40 | M48000013 150000 0xa14c61edcdb4c50a | 2592K 0.24658 1.5957 79.78s | 21:11:32 0.31% |
| Mar 05 03:38:00 | M48000013 200000 0xd4deb59cd8ee0cfe | 2592K 0.24805 1.5957 79.78s | 21:10:28 0.41% |
[/CODE]

tServo 2019-03-08 18:02

[QUOTE=Prime95;497517]If I understand the 2080 architecture correctly, LL test speed could be improved (perhaps greatly) by going to 128-bit fixed point reals represented as four 32-bit integers. I investigated this somewhat 4 years ago when 32-bit adds had huge throughput advantage but 32-bit multiplies had no advantage compared to DP throughput. IIUC, in the 2080 both 32-bit adds and 32-bit multiplies have a huge throughput advantage compared to DP throughput.

The basic idea is that adding two 128-bit fixed point reals requires four 32-bit adds (with carries) plus some overhead for handling signs. Multiplying two 128-bit fixed point reals requires sixteen 32-bit multiplies, plus some adds, and some overhead for handling signs.

Each FFT butterfly adds and subtracts FFT data values which increases the maximum FFT data value by one bit. Thus, the fixed point reals must be shifted one bit prior to a butterfly (i.e. move the implied decimal point). This adds some additional overhead in implementing a fixed-point real FFT.

My research indicated we could store as many as 51 bits of input data in each 128-bit fixed point real. This (51/128) is much more memory efficient than current DP FFTs which store about 17-bits of data in each 64-bit double.

Is there any flaw in my understanding of the 2080 architecture? Does anyone have time to explore the feasibility of this approach?[/QUOTE]

George,
This sounds quite interesting and. as a matter of fact, I was thinking along these same lines ever since I heard that Volta had separated integer from floats ( it may have started in Pascal ). I was examining the math routines in mmff for inspiration.

A few questions :

When you speak of fixed point format, are you referring to the IEEE 754 standard used in Cuda to represent reals?

Instead of the 16 multiplies when doing a 4x4 32 bit numbers, wouldn't Karatsuba be much better to reduce the multiplies to adds? In fact, 2 levels of Karatsube may be used for 3 separate Karatsubas. It should be unrolled to be as linear as possible instead of using recursion or subroutine calls, of course

kriesel 2019-03-08 20:22

[QUOTE=tServo;510436]George,
This sounds quite interesting and. as a matter of fact, I was thinking along these same lines ever since I heard that Volta had separated integer from floats ( it may have started in Pascal ). I was examining the math routines in mmff for inspiration.

A few questions :

When you speak of fixed point format, are you referring to the IEEE 754 standard used in Cuda to represent reals?

Instead of the 16 multiplies when doing a 4x4 32 bit numbers, wouldn't Karatsuba be much better to reduce the multiplies to adds? In fact, 2 levels of Karatsube may be used for 3 separate Karatsubas. It should be unrolled to be as linear as possible instead of using recursion or subroutine calls, of course[/QUOTE]
I wondered about that too. "As a rule of thumb, Karatsuba's method is usually faster when the multiplicands are longer than 320–640 bits." [URL]https://en.wikipedia.org/wiki/Karatsuba_algorithm[/URL] That's for the multiplication itself, not taking into account the benefit to the higher level operation of a shorter fft length atop it, and also how memory bandwidth is frequently the limiting factor, and also of fitting a given primality test or P-1 factoring computation of ~50% greater equivalent size into the same fixed size gpu memory, which would allow P-1 stage 2 to higher bounds.
51/128 = 1.5 x 17bits/64bit word. What if we go further, to 256-bit; then how many bits might be usable per word?

Prime95 2019-03-09 00:58

[QUOTE=tServo;510436]
When you speak of fixed point format, are you referring to the IEEE 754 standard used in Cuda to represent reals?[/QUOTE]

While this may not be the best approach, what I played with was a mantissa-only float stored as 4 integers. Each FFT element has the same (not stored) exponent. This representation requires 16 32x32 multiplies to do a 128x128 multiply yielding a 128 bit result.

Here is a link to what I was working on:
[url]https://www.dropbox.com/s/g46bkk3yvh0sqo1/mycudalucas.zip?dl=0[/url]

This is NOT working FFT code, it was an effort to mimic the work a fixed point FFT would require. There are many backups showing how the code evolved. This was done 4 years ago, so I probably don't remember much of it.

tServo 2019-03-09 01:04

[QUOTE=kriesel;510449]I wondered about that too. "As a rule of thumb, Karatsuba's method is usually faster when the multiplicands are longer than 320–640 bits." [URL]https://en.wikipedia.org/wiki/Karatsuba_algorithm[/URL] That's for the multiplication itself, not taking into account the benefit to the higher level operation of a shorter fft length atop it, and also how memory bandwidth is frequently the limiting factor, and also of fitting a given primality test or P-1 factoring computation of ~50% greater equivalent size into the same fixed size gpu memory, which would allow P-1 stage 2 to higher bounds.
51/128 = 1.5 x 17bits/64bit word. What if we go further, to 256-bit; then how many bits might be usable per word?[/QUOTE]

I seem to recall doing some benchmarks in the 1990s on when to switch from grade school multiplication to Karatsuba and from Karatsuba to FFTs in a bignum math package I had found on Richard Crandal's website ( Perfectly Scientific or something like that ).
I remember discovering that the cutover points were set waaaay too high for the grade school to Karatsuba. I can't remember about the other cutoff point except that I was impressed with the speed of the FFT multiplication.

My point here is that the Karatsuba method may be better than that Wikipedia article concluded. I suspect it depends on the speed ratio between add & multiply.
Also, your point is a good one on possibly being able to carry more significant bits if a 256 sized format is used thereby making the FFT more efficient.

Sounds like a fun project.

ldesnogu 2019-03-09 16:50

You can find various threshold for multiplication algorithms for GMP lib implementation on many CPU here: [url]https://gmplib.org/devel/thres[/url]


Toom22 is Karatsuba.

kriesel 2019-03-09 19:54

[QUOTE=ldesnogu;510493]You can find various threshold for multiplication algorithms for GMP lib implementation on many CPU here: [URL]https://gmplib.org/devel/thres[/URL]

Toom22 is Karatsuba.[/QUOTE]
Thank you for the link. Those pages could use a little more explanation, in my opinion.
"ABI" is I suppose application binary interface, specifically bit size.
Are the "meas thres" and "conf thres" in units larger than bits, such as words equal in size to the abi value, or 32-bit, or what?
It looks like meas and conf track pretty well usually, though there are cases of 6, 8, or 18 difference. Meas in sqr-toom2 ranges from 14 to 100.
I wonder what the gpu equivalent crossovers look like.

[QUOTE=tServo]Sounds like a fun project.[/QUOTE]Go for it!

[QUOTE=prime95]Here is a link to what I was working on:
[URL="https://www.dropbox.com/s/g46bkk3yvh0sqo1/mycudalucas.zip?dl=0"]https://www.dropbox.com/s/g46bkk3yvh...lucas.zip?dl=0[/URL][/QUOTE]Is that based on any recent math reference?


All times are UTC. The time now is 13:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.