![]() |
[QUOTE=kriesel;510005]Tservo, ATH, could you run the problem executables from the beginning of an exponent with
[CODE]ReportIterations=1[/CODE]set in the ini file? Does every iteration from the very start consist of zero res64, or does it take a while to start?[/QUOTE] My results were the same as ATH. Also, as I mentioned, I looked at the code that generated all the warnings in the 1st attached compile of ATH's just above here and they all looked ok. My warnings were thesame, of course. |
[QUOTE=ATH;509875]Instead of testing M3021377 you should probably test some fixed number of iterations of an exponent with the same size, which you are doing to use CUDALucas for. Either 83M-85M exponent for LL or 46M-48M for LL-D to make sure that version is also fastest at that particular FFT size.[/QUOTE]
I've run some benchmarks on M57885161 (FFT 3136K), using various CUDALucas v2.06beta 64-bit CUDA binaries using a GTX1060 6GB. These are the times for 190000 iterations. Nvidia driver version 419.17, Windows 7. It might be of some use to somebody, but as always with this type of thing, your mileage will vary. [CODE] CUDA Ver Time ==== ======= 4.2 16m 19s 5.0 15m 46s 5.5 16m 06s 6.0 15m 46s 6.5 15m 48s 8.0 15m 51s 10.0 20m 15s 10.1 18m 28s[/CODE] |
I can't confirm that there is a big difference between 10.0 and 10.1 performance on an RTX 2060. Can't run any earlier versions of course, since there's no support for CC 7.5. But maybe they did some Pascal-only optimizations in CUDA 10.1. Or judging by the performance drop from 8.0 to 10.0, maybe they just unrolled some of the stuff made back then. :smile:
Driver 417.71, CUDA 10.0 : [CODE]Device GeForce RTX 2060 Compatibility 7.5 clockRate (MHz) 1830 memClockRate (MHz) 7001 fft max exp ms/iter 1024 19535569 1.0761 1152 21921901 1.2840 1296 24599717 1.5576 1323 25101101 1.7955 1350 25602229 1.8542 1372 26010389 1.8911 1400 26529691 1.9210 1458 27604673 1.9611 1512 28604657 1.9716 1600 30232693 2.0590 1728 32597297 2.0927 1792 33778141 2.1136 2048 38492887 2.1167 2304 43194913 2.6200 2560 47885689 3.1887 2592 48471289 3.2526 2744 51250889 3.6847 2880 53735041 3.8500 2916 54392209 3.8560 3200 59570449 3.9097 3456 64229677 3.9697 4096 75846319 4.2404 4608 85111207 5.4321 5184 95507747 6.3966 5400 99399967 7.4287 5760 105879517 7.4607 5832 107174381 7.7607 6400 117377567 8.0064 6912 126558077 8.2118 7168 131142761 9.5563 8192 149447533 10.1720[/CODE] Driver 419.17, CUDA 10.1 : [CODE]Device GeForce RTX 2060 Compatibility 7.5 clockRate (MHz) 1830 memClockRate (MHz) 7001 fft max exp ms/iter 1024 19535569 1.2139 1152 21921901 1.2783 1296 24599717 1.5568 1323 25101101 1.7832 1372 26010389 1.8362 1458 27604673 1.9038 1512 28604657 1.9646 1600 30232693 2.0505 1728 32597297 2.0808 1792 33778141 2.1063 2048 38492887 2.1119 2304 43194913 2.6102 2560 47885689 3.1769 2592 48471289 3.2404 2744 51250889 3.6480 2916 54392209 3.7917 3200 59570449 3.8857 3456 64229677 3.9463 4096 75846319 4.2507 4608 85111207 5.4388 5184 95507747 6.3674 5400 99399967 7.4289 5760 105879517 7.4390 5832 107174381 7.5620 6400 117377567 8.0081 6912 126558077 8.1465 7168 131142761 9.4851 8192 149447533 10.0911[/CODE] |
Someone who has access should remove CUDALucas 2.05 from the site completely, to avoid people wasting time on the 2.05 bug.
It is even listed as the "Latest version" here: [url]https://sourceforge.net/projects/cudalucas/files/[/url] |
[QUOTE=mognuts;509874]Cuda 10.1 is faster than Cuda 10.0, but 5.0 is the fastest binary. I test using the known prime, M3021377.[/QUOTE]
CUDA 5.0 and 5.5 are faster than all the others on my old Titan Black. But 5.5 is slightly faster than 5.0 at 85M and 5.0 is slightly faster than 5.5 at 48M (LL-D). Did you do -cufftbench and -threadbench for each version? It is slightly different thread counts 5.0 and 5.5 uses, and some of the other versions was fastest at a completely different FFT size like 4608K for 85M. [CODE]CUDA 5.0 Using threads: square 64, splice 128. Starting M85000007 fft length = 5184K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Mar 05 03:11:44 | M85000007 50000 0xf8f23b3c461eda23 | 5184K 0.01904 3.2767 163.83s | 3:05:19:19 0.05% | | Mar 05 03:14:29 | M85000007 100000 0x51c6acdda4a0afd2 | 5184K 0.01904 3.2949 164.74s | 3:05:29:27 0.11% | | Mar 05 03:17:14 | M85000007 150000 0xac94578b70aaa8c4 | 5184K 0.01904 3.2941 164.70s | 3:05:30:38 0.17% | | Mar 05 03:19:59 | M85000007 200000 0xee2d0bcbd9606021 | 5184K 0.01904 3.2945 164.72s | 3:05:29:59 0.23% | CUDA 5.5 Using threads: square 64, splice 512. Starting M85000007 fft length = 5184K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Mar 05 03:24:06 | M85000007 50000 0xf8f23b3c461eda23 | 5184K 0.01855 3.2741 163.70s | 3:05:15:41 0.05% | | Mar 05 03:26:51 | M85000007 100000 0x51c6acdda4a0afd2 | 5184K 0.01880 3.2832 164.16s | 3:05:19:22 0.11% | | Mar 05 03:29:35 | M85000007 150000 0xac94578b70aaa8c4 | 5184K 0.01880 3.2870 164.35s | 3:05:20:34 0.17% | | Mar 05 03:32:19 | M85000007 200000 0xee2d0bcbd9606021 | 5184K 0.01953 3.2882 164.41s | 3:05:20:14 0.23% | CUDA5.0 Using threads: square 64, splice 256. Starting M48000013 fft length = 2592K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Mar 05 03:39:55 | M48000013 50000 0x75639b37776dd731 | 2592K 0.23047 1.5739 78.69s | 20:57:48 0.10% | | Mar 05 03:41:14 | M48000013 100000 0x43895b9120fd80b4 | 2592K 0.23438 1.5790 78.95s | 20:58:33 0.20% | | Mar 05 03:42:33 | M48000013 150000 0xa14c61edcdb4c50a | 2592K 0.23438 1.5788 78.94s | 20:57:52 0.31% | | Mar 05 03:43:52 | M48000013 200000 0xd4deb59cd8ee0cfe | 2592K 0.24219 1.5788 78.93s | 20:56:52 0.41% | CUDA5.5 Using threads: square 64, splice 128. Starting M48000013 fft length = 2592K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Mar 05 03:34:01 | M48000013 50000 0x75639b37776dd731 | 2592K 0.22266 1.5916 79.58s | 21:11:59 0.10% | | Mar 05 03:35:20 | M48000013 100000 0x43895b9120fd80b4 | 2592K 0.24658 1.5958 79.79s | 21:12:21 0.20% | | Mar 05 03:36:40 | M48000013 150000 0xa14c61edcdb4c50a | 2592K 0.24658 1.5957 79.78s | 21:11:32 0.31% | | Mar 05 03:38:00 | M48000013 200000 0xd4deb59cd8ee0cfe | 2592K 0.24805 1.5957 79.78s | 21:10:28 0.41% | [/CODE] |
[QUOTE=Prime95;497517]If I understand the 2080 architecture correctly, LL test speed could be improved (perhaps greatly) by going to 128-bit fixed point reals represented as four 32-bit integers. I investigated this somewhat 4 years ago when 32-bit adds had huge throughput advantage but 32-bit multiplies had no advantage compared to DP throughput. IIUC, in the 2080 both 32-bit adds and 32-bit multiplies have a huge throughput advantage compared to DP throughput.
The basic idea is that adding two 128-bit fixed point reals requires four 32-bit adds (with carries) plus some overhead for handling signs. Multiplying two 128-bit fixed point reals requires sixteen 32-bit multiplies, plus some adds, and some overhead for handling signs. Each FFT butterfly adds and subtracts FFT data values which increases the maximum FFT data value by one bit. Thus, the fixed point reals must be shifted one bit prior to a butterfly (i.e. move the implied decimal point). This adds some additional overhead in implementing a fixed-point real FFT. My research indicated we could store as many as 51 bits of input data in each 128-bit fixed point real. This (51/128) is much more memory efficient than current DP FFTs which store about 17-bits of data in each 64-bit double. Is there any flaw in my understanding of the 2080 architecture? Does anyone have time to explore the feasibility of this approach?[/QUOTE] George, This sounds quite interesting and. as a matter of fact, I was thinking along these same lines ever since I heard that Volta had separated integer from floats ( it may have started in Pascal ). I was examining the math routines in mmff for inspiration. A few questions : When you speak of fixed point format, are you referring to the IEEE 754 standard used in Cuda to represent reals? Instead of the 16 multiplies when doing a 4x4 32 bit numbers, wouldn't Karatsuba be much better to reduce the multiplies to adds? In fact, 2 levels of Karatsube may be used for 3 separate Karatsubas. It should be unrolled to be as linear as possible instead of using recursion or subroutine calls, of course |
[QUOTE=tServo;510436]George,
This sounds quite interesting and. as a matter of fact, I was thinking along these same lines ever since I heard that Volta had separated integer from floats ( it may have started in Pascal ). I was examining the math routines in mmff for inspiration. A few questions : When you speak of fixed point format, are you referring to the IEEE 754 standard used in Cuda to represent reals? Instead of the 16 multiplies when doing a 4x4 32 bit numbers, wouldn't Karatsuba be much better to reduce the multiplies to adds? In fact, 2 levels of Karatsube may be used for 3 separate Karatsubas. It should be unrolled to be as linear as possible instead of using recursion or subroutine calls, of course[/QUOTE] I wondered about that too. "As a rule of thumb, Karatsuba's method is usually faster when the multiplicands are longer than 320–640 bits." [URL]https://en.wikipedia.org/wiki/Karatsuba_algorithm[/URL] That's for the multiplication itself, not taking into account the benefit to the higher level operation of a shorter fft length atop it, and also how memory bandwidth is frequently the limiting factor, and also of fitting a given primality test or P-1 factoring computation of ~50% greater equivalent size into the same fixed size gpu memory, which would allow P-1 stage 2 to higher bounds. 51/128 = 1.5 x 17bits/64bit word. What if we go further, to 256-bit; then how many bits might be usable per word? |
[QUOTE=tServo;510436]
When you speak of fixed point format, are you referring to the IEEE 754 standard used in Cuda to represent reals?[/QUOTE] While this may not be the best approach, what I played with was a mantissa-only float stored as 4 integers. Each FFT element has the same (not stored) exponent. This representation requires 16 32x32 multiplies to do a 128x128 multiply yielding a 128 bit result. Here is a link to what I was working on: [url]https://www.dropbox.com/s/g46bkk3yvh0sqo1/mycudalucas.zip?dl=0[/url] This is NOT working FFT code, it was an effort to mimic the work a fixed point FFT would require. There are many backups showing how the code evolved. This was done 4 years ago, so I probably don't remember much of it. |
[QUOTE=kriesel;510449]I wondered about that too. "As a rule of thumb, Karatsuba's method is usually faster when the multiplicands are longer than 320–640 bits." [URL]https://en.wikipedia.org/wiki/Karatsuba_algorithm[/URL] That's for the multiplication itself, not taking into account the benefit to the higher level operation of a shorter fft length atop it, and also how memory bandwidth is frequently the limiting factor, and also of fitting a given primality test or P-1 factoring computation of ~50% greater equivalent size into the same fixed size gpu memory, which would allow P-1 stage 2 to higher bounds.
51/128 = 1.5 x 17bits/64bit word. What if we go further, to 256-bit; then how many bits might be usable per word?[/QUOTE] I seem to recall doing some benchmarks in the 1990s on when to switch from grade school multiplication to Karatsuba and from Karatsuba to FFTs in a bignum math package I had found on Richard Crandal's website ( Perfectly Scientific or something like that ). I remember discovering that the cutover points were set waaaay too high for the grade school to Karatsuba. I can't remember about the other cutoff point except that I was impressed with the speed of the FFT multiplication. My point here is that the Karatsuba method may be better than that Wikipedia article concluded. I suspect it depends on the speed ratio between add & multiply. Also, your point is a good one on possibly being able to carry more significant bits if a 256 sized format is used thereby making the FFT more efficient. Sounds like a fun project. |
You can find various threshold for multiplication algorithms for GMP lib implementation on many CPU here: [url]https://gmplib.org/devel/thres[/url]
Toom22 is Karatsuba. |
[QUOTE=ldesnogu;510493]You can find various threshold for multiplication algorithms for GMP lib implementation on many CPU here: [URL]https://gmplib.org/devel/thres[/URL]
Toom22 is Karatsuba.[/QUOTE] Thank you for the link. Those pages could use a little more explanation, in my opinion. "ABI" is I suppose application binary interface, specifically bit size. Are the "meas thres" and "conf thres" in units larger than bits, such as words equal in size to the abi value, or 32-bit, or what? It looks like meas and conf track pretty well usually, though there are cases of 6, 8, or 18 difference. Meas in sqr-toom2 ranges from 14 to 100. I wonder what the gpu equivalent crossovers look like. [QUOTE=tServo]Sounds like a fun project.[/QUOTE]Go for it! [QUOTE=prime95]Here is a link to what I was working on: [URL="https://www.dropbox.com/s/g46bkk3yvh0sqo1/mycudalucas.zip?dl=0"]https://www.dropbox.com/s/g46bkk3yvh...lucas.zip?dl=0[/URL][/QUOTE]Is that based on any recent math reference? |
| All times are UTC. The time now is 13:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.