![]() |
|
|
#2729 | |
|
"Marv"
May 2009
near the Tannhäuser Gate
2×7×47 Posts |
Quote:
Also, as I mentioned, I looked at the code that generated all the warnings in the 1st attached compile of ATH's just above here and they all looked ok. My warnings were thesame, of course. |
|
|
|
|
|
|
#2730 | |
|
Sep 2008
Bromley, England
43 Posts |
Quote:
Code:
CUDA Ver Time ==== ======= 4.2 16m 19s 5.0 15m 46s 5.5 16m 06s 6.0 15m 46s 6.5 15m 48s 8.0 15m 51s 10.0 20m 15s 10.1 18m 28s |
|
|
|
|
|
|
#2731 |
|
"Sam Laur"
Dec 2018
Turku, Finland
317 Posts |
I can't confirm that there is a big difference between 10.0 and 10.1 performance on an RTX 2060. Can't run any earlier versions of course, since there's no support for CC 7.5. But maybe they did some Pascal-only optimizations in CUDA 10.1. Or judging by the performance drop from 8.0 to 10.0, maybe they just unrolled some of the stuff made back then.
![]() Driver 417.71, CUDA 10.0 : Code:
Device GeForce RTX 2060 Compatibility 7.5 clockRate (MHz) 1830 memClockRate (MHz) 7001 fft max exp ms/iter 1024 19535569 1.0761 1152 21921901 1.2840 1296 24599717 1.5576 1323 25101101 1.7955 1350 25602229 1.8542 1372 26010389 1.8911 1400 26529691 1.9210 1458 27604673 1.9611 1512 28604657 1.9716 1600 30232693 2.0590 1728 32597297 2.0927 1792 33778141 2.1136 2048 38492887 2.1167 2304 43194913 2.6200 2560 47885689 3.1887 2592 48471289 3.2526 2744 51250889 3.6847 2880 53735041 3.8500 2916 54392209 3.8560 3200 59570449 3.9097 3456 64229677 3.9697 4096 75846319 4.2404 4608 85111207 5.4321 5184 95507747 6.3966 5400 99399967 7.4287 5760 105879517 7.4607 5832 107174381 7.7607 6400 117377567 8.0064 6912 126558077 8.2118 7168 131142761 9.5563 8192 149447533 10.1720 Code:
Device GeForce RTX 2060 Compatibility 7.5 clockRate (MHz) 1830 memClockRate (MHz) 7001 fft max exp ms/iter 1024 19535569 1.2139 1152 21921901 1.2783 1296 24599717 1.5568 1323 25101101 1.7832 1372 26010389 1.8362 1458 27604673 1.9038 1512 28604657 1.9646 1600 30232693 2.0505 1728 32597297 2.0808 1792 33778141 2.1063 2048 38492887 2.1119 2304 43194913 2.6102 2560 47885689 3.1769 2592 48471289 3.2404 2744 51250889 3.6480 2916 54392209 3.7917 3200 59570449 3.8857 3456 64229677 3.9463 4096 75846319 4.2507 4608 85111207 5.4388 5184 95507747 6.3674 5400 99399967 7.4289 5760 105879517 7.4390 5832 107174381 7.5620 6400 117377567 8.0081 6912 126558077 8.1465 7168 131142761 9.4851 8192 149447533 10.0911 |
|
|
|
|
|
#2732 |
|
Einyen
Dec 2003
Denmark
315810 Posts |
Someone who has access should remove CUDALucas 2.05 from the site completely, to avoid people wasting time on the 2.05 bug.
It is even listed as the "Latest version" here: https://sourceforge.net/projects/cudalucas/files/ Last fiddled with by ATH on 2019-03-04 at 19:49 |
|
|
|
|
|
#2733 | |
|
Einyen
Dec 2003
Denmark
1100010101102 Posts |
Quote:
Did you do -cufftbench and -threadbench for each version? It is slightly different thread counts 5.0 and 5.5 uses, and some of the other versions was fastest at a completely different FFT size like 4608K for 85M. Code:
CUDA 5.0 Using threads: square 64, splice 128. Starting M85000007 fft length = 5184K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Mar 05 03:11:44 | M85000007 50000 0xf8f23b3c461eda23 | 5184K 0.01904 3.2767 163.83s | 3:05:19:19 0.05% | | Mar 05 03:14:29 | M85000007 100000 0x51c6acdda4a0afd2 | 5184K 0.01904 3.2949 164.74s | 3:05:29:27 0.11% | | Mar 05 03:17:14 | M85000007 150000 0xac94578b70aaa8c4 | 5184K 0.01904 3.2941 164.70s | 3:05:30:38 0.17% | | Mar 05 03:19:59 | M85000007 200000 0xee2d0bcbd9606021 | 5184K 0.01904 3.2945 164.72s | 3:05:29:59 0.23% | CUDA 5.5 Using threads: square 64, splice 512. Starting M85000007 fft length = 5184K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Mar 05 03:24:06 | M85000007 50000 0xf8f23b3c461eda23 | 5184K 0.01855 3.2741 163.70s | 3:05:15:41 0.05% | | Mar 05 03:26:51 | M85000007 100000 0x51c6acdda4a0afd2 | 5184K 0.01880 3.2832 164.16s | 3:05:19:22 0.11% | | Mar 05 03:29:35 | M85000007 150000 0xac94578b70aaa8c4 | 5184K 0.01880 3.2870 164.35s | 3:05:20:34 0.17% | | Mar 05 03:32:19 | M85000007 200000 0xee2d0bcbd9606021 | 5184K 0.01953 3.2882 164.41s | 3:05:20:14 0.23% | CUDA5.0 Using threads: square 64, splice 256. Starting M48000013 fft length = 2592K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Mar 05 03:39:55 | M48000013 50000 0x75639b37776dd731 | 2592K 0.23047 1.5739 78.69s | 20:57:48 0.10% | | Mar 05 03:41:14 | M48000013 100000 0x43895b9120fd80b4 | 2592K 0.23438 1.5790 78.95s | 20:58:33 0.20% | | Mar 05 03:42:33 | M48000013 150000 0xa14c61edcdb4c50a | 2592K 0.23438 1.5788 78.94s | 20:57:52 0.31% | | Mar 05 03:43:52 | M48000013 200000 0xd4deb59cd8ee0cfe | 2592K 0.24219 1.5788 78.93s | 20:56:52 0.41% | CUDA5.5 Using threads: square 64, splice 128. Starting M48000013 fft length = 2592K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Mar 05 03:34:01 | M48000013 50000 0x75639b37776dd731 | 2592K 0.22266 1.5916 79.58s | 21:11:59 0.10% | | Mar 05 03:35:20 | M48000013 100000 0x43895b9120fd80b4 | 2592K 0.24658 1.5958 79.79s | 21:12:21 0.20% | | Mar 05 03:36:40 | M48000013 150000 0xa14c61edcdb4c50a | 2592K 0.24658 1.5957 79.78s | 21:11:32 0.31% | | Mar 05 03:38:00 | M48000013 200000 0xd4deb59cd8ee0cfe | 2592K 0.24805 1.5957 79.78s | 21:10:28 0.41% | Last fiddled with by ATH on 2019-03-05 at 19:14 |
|
|
|
|
|
|
#2734 | |
|
"Marv"
May 2009
near the Tannhäuser Gate
2·7·47 Posts |
Quote:
This sounds quite interesting and. as a matter of fact, I was thinking along these same lines ever since I heard that Volta had separated integer from floats ( it may have started in Pascal ). I was examining the math routines in mmff for inspiration. A few questions : When you speak of fixed point format, are you referring to the IEEE 754 standard used in Cuda to represent reals? Instead of the 16 multiplies when doing a 4x4 32 bit numbers, wouldn't Karatsuba be much better to reduce the multiplies to adds? In fact, 2 levels of Karatsube may be used for 3 separate Karatsubas. It should be unrolled to be as linear as possible instead of using recursion or subroutine calls, of course |
|
|
|
|
|
|
#2735 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
152B16 Posts |
Quote:
51/128 = 1.5 x 17bits/64bit word. What if we go further, to 256-bit; then how many bits might be usable per word? Last fiddled with by kriesel on 2019-03-08 at 20:26 |
|
|
|
|
|
|
#2736 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
7,537 Posts |
Quote:
Here is a link to what I was working on: https://www.dropbox.com/s/g46bkk3yvh...lucas.zip?dl=0 This is NOT working FFT code, it was an effort to mimic the work a fixed point FFT would require. There are many backups showing how the code evolved. This was done 4 years ago, so I probably don't remember much of it. |
|
|
|
|
|
|
#2737 | |
|
"Marv"
May 2009
near the Tannhäuser Gate
2·7·47 Posts |
Quote:
I remember discovering that the cutover points were set waaaay too high for the grade school to Karatsuba. I can't remember about the other cutoff point except that I was impressed with the speed of the FFT multiplication. My point here is that the Karatsuba method may be better than that Wikipedia article concluded. I suspect it depends on the speed ratio between add & multiply. Also, your point is a good one on possibly being able to carry more significant bits if a 256 sized format is used thereby making the FFT more efficient. Sounds like a fun project. |
|
|
|
|
|
|
#2738 |
|
Jan 2008
France
2·52·11 Posts |
You can find various threshold for multiplication algorithms for GMP lib implementation on many CPU here: https://gmplib.org/devel/thres
Toom22 is Karatsuba. |
|
|
|
|
|
#2739 | |||
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
10101001010112 Posts |
Quote:
"ABI" is I suppose application binary interface, specifically bit size. Are the "meas thres" and "conf thres" in units larger than bits, such as words equal in size to the abi value, or 32-bit, or what? It looks like meas and conf track pretty well usually, though there are cases of 6, 8, or 18 difference. Meas in sqr-toom2 ranges from 14 to 100. I wonder what the gpu equivalent crossovers look like. Quote:
Quote:
Last fiddled with by kriesel on 2019-03-09 at 20:41 |
|||
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Don't DC/LL them with CudaLucas | LaurV | Data | 131 | 2017-05-02 18:41 |
| CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 | Brain | GPU Computing | 13 | 2016-02-19 15:53 |
| CUDALucas: which binary to use? | Karl M Johnson | GPU Computing | 15 | 2015-10-13 04:44 |
| settings for cudaLucas | fairsky | GPU Computing | 11 | 2013-11-03 02:08 |
| Trying to run CUDALucas on Windows 8 CP | Rodrigo | GPU Computing | 12 | 2012-03-07 23:20 |