![]() |
[QUOTE=aaronhaviland;500309]I seem to recall making some modifications to the memory allocations prior to my first git commit, but I cannot recall what they are.
We have to remember that it checks the available RAM before stage 1, as part of the bounds calculations: ... But this memory is not actually allocated until much later, and the amount could have changed in that time. [/QUOTE]Much later, indeed. Even on fast gpus, a stage may take days for high exponents. Seems like recalculating right before stage 2 setup could help.[QUOTE]We have to be very careful not to exceed it (available memory) because therein lies fatal errors, and we do not have control over other applications that may also be using the same memory. One reason I find the code uses less memory than what is available is that it (based on my understanding, at least): [LIST=1][*]Determines the value of nrp based on the available memory and fft size (and for some reason restricts it to 4GiB on Windows. Possibly a 32-bit issue, or something from older CUDA versions?)[/LIST][/QUOTE]Could be left over from old compiler version limitations. I think it more likely a consequence of using the same code base for 64 bit and 32 bit application builds. Up to CUDA7.5 builds, 32bit builds were possible. I don't think 32bit builds are necessarily necessary any more. I'd be interested in other people's thoughts on that. There were some speed advantages in 32bit in older CUDA versions for CUDALucas, but they were not dramatic and perhaps not highly reproducible in benchmarking. |
1 Attachment(s)
[QUOTE=aaronhaviland;500255]Success compiling with MPIR.
64-bit binary attached Requires CUDA 10, and a GPU with Compute Capability >= 3.5. Unsure of other requirements, I'm not too familiar with Windows dependencies. [CODE]Microsoft Windows [Version 10.0.17134.407] C:\Users\Aaron\Documents\Visual Studio 2017\Projects\CUDAPm1\x64\Release>CUDAPm1.exe 7990427 -b1 986 -b2 124000 CUDAPm1 v0.21 Assuming exponent is trial factored to 63 bits ------- DEVICE 0 ------- name GeForce RTX 2070 Compatibility 7.5 clockRate (MHz) 1710 memClockRate (MHz) 7001 totalGlobalMem 8589934592 totalConstMem 65536 l2CacheSize 4194304 sharedMemPerBlock 49152 regsPerBlock 65536 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 1024 multiProcessorCount 36 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment 512 deviceOverlap 1 No GeForceRTX2070_fft.txt file found. Using default fft lengths. For optimal fft selection, please run ./CUDAPm1 -cufftbench 1 8192 r for some small r, 0 < r < 6 e.g. CUDA reports 6723M of 8192M GPU memory free. No GeForceRTX2070_threads.txt file found. Running benchmark. CUDA bench, testing various thread sizes for fft 448K, doing 15 passes. fft size = 448K, square time = 0.0436 msec, threads 32 fft size = 448K, square time = 0.0449 msec, threads 64 fft size = 448K, square time = 0.0336 msec, threads 128 fft size = 448K, square time = 0.0335 msec, threads 256 fft size = 448K, square time = 0.0356 msec, threads 512 fft size = 448K, square time = 0.0438 msec, threads 1024 Best square time for fft = 448K, time: 0.0335, t = 256 fft size = 448K, ave time = 0.0408 msec, Norm1 threads 32, Norm2 threads 32 fft size = 448K, ave time = 0.0407 msec, Norm1 threads 32, Norm2 threads 64 fft size = 448K, ave time = 0.0408 msec, Norm1 threads 32, Norm2 threads 128 fft size = 448K, ave time = 0.0412 msec, Norm1 threads 32, Norm2 threads 256 fft size = 448K, ave time = 0.0419 msec, Norm1 threads 32, Norm2 threads 512 fft size = 448K, ave time = 0.0433 msec, Norm1 threads 32, Norm2 threads 1024 fft size = 448K, ave time = 0.0402 msec, Norm1 threads 64, Norm2 threads 32 fft size = 448K, ave time = 0.0402 msec, Norm1 threads 64, Norm2 threads 64 fft size = 448K, ave time = 0.0405 msec, Norm1 threads 64, Norm2 threads 128 fft size = 448K, ave time = 0.0406 msec, Norm1 threads 64, Norm2 threads 256 fft size = 448K, ave time = 0.0408 msec, Norm1 threads 64, Norm2 threads 512 fft size = 448K, ave time = 0.0428 msec, Norm1 threads 64, Norm2 threads 1024 fft size = 448K, ave time = 0.0394 msec, Norm1 threads 128, Norm2 threads 32 fft size = 448K, ave time = 0.0394 msec, Norm1 threads 128, Norm2 threads 64 fft size = 448K, ave time = 0.0397 msec, Norm1 threads 128, Norm2 threads 128 fft size = 448K, ave time = 0.0400 msec, Norm1 threads 128, Norm2 threads 256 fft size = 448K, ave time = 0.0411 msec, Norm1 threads 128, Norm2 threads 512 fft size = 448K, ave time = 0.0423 msec, Norm1 threads 128, Norm2 threads 1024 fft size = 448K, ave time = 0.0401 msec, Norm1 threads 256, Norm2 threads 32 fft size = 448K, ave time = 0.0394 msec, Norm1 threads 256, Norm2 threads 64 fft size = 448K, ave time = 0.0395 msec, Norm1 threads 256, Norm2 threads 128 fft size = 448K, ave time = 0.0403 msec, Norm1 threads 256, Norm2 threads 256 fft size = 448K, ave time = 0.0408 msec, Norm1 threads 256, Norm2 threads 512 fft size = 448K, ave time = 0.0423 msec, Norm1 threads 256, Norm2 threads 1024 fft size = 448K, ave time = 0.0417 msec, Norm1 threads 512, Norm2 threads 32 fft size = 448K, ave time = 0.0416 msec, Norm1 threads 512, Norm2 threads 64 fft size = 448K, ave time = 0.0417 msec, Norm1 threads 512, Norm2 threads 128 fft size = 448K, ave time = 0.0424 msec, Norm1 threads 512, Norm2 threads 256 fft size = 448K, ave time = 0.0428 msec, Norm1 threads 512, Norm2 threads 512 fft size = 448K, ave time = 0.0425 msec, Norm1 threads 512, Norm2 threads 1024 Best time for fft = 448K, time: 0.0394, t1 = 128, t2 = 256, t3 = 64 Using threads: norm1 256, mult 128, norm2 128. Using up to 4119M GPU memory. Starting stage 1 P-1, M7990427, B1 = 986, B2 = 124000, fft length = 448K Doing 1452 iterations M7990427, 0x32318b15f9d83ab6, n = 448K, CUDAPm1 v0.21 Stage 1 complete, estimated total time = 0:01 Starting stage 1 gcd. M7990427 Stage 1 found no factor (P-1, B1=986, B2=124000, e=0, n=448K CUDAPm1 v0.21) Starting stage 2. Using b1 = 986, b2 = 124000, d = 420, e = 4, nrp = 96 Zeros: 4430, Ones: 8530, Pairs: 2981 Processing 1 - 96 of 96 relative primes. Initializing pass... done. transforms: 1987, err = 0.02539, (0.71 real, 0.3550 ms/tran, ETA NA) Transforms: 9204 M7990427, 0x456fdf3be182449c, n = 448K, CUDAPm1 v0.21 err = 0.02734 (0:03 real, 0.2873 ms/tran, ETA 0:02) Transforms: 8928 M7990427, 0x2acd8bf807caa816, n = 448K, CUDAPm1 v0.21 err = 0.02734 (0:02 real, 0.2912 ms/tran, ETA 0:00) Stage 2 complete, 20119 transforms, estimated total time = 0:05 Starting stage 2 gcd. M7990427 has a factor: 10509037975912491881 (P-1, B1=986, B2=124000, e=4, n=448K CUDAPm1 v0.21) C:\Users\Aaron\Documents\Visual Studio 2017\Projects\CUDAPm1\x64\Release>[/CODE][/QUOTE] Looks like it works here! W10 (1803) x64 CUDA10.0.130 (driver version 411.70) GTX1080Ti [code] C:\CUDAPm1-CUDA10>CUDAPm1-CUDA10.exe 7990427 -b1 986 -b2 124000 CUDAPm1 v0.21 Assuming exponent is trial factored to 63 bits Warning: Couldn't find .ini file. Using defaults for non-specified options. CUDA reports 9312M of 11264M GPU memory free. No GeForceGTX1080Ti_threads.txt file found. Running benchmark. CUDA bench, testing various thread sizes for fft 512K, doing 15 passes. fft size = 512K, square time = 0.0346 msec, threads 32 fft size = 512K, square time = 0.0360 msec, threads 64 fft size = 512K, square time = 0.0362 msec, threads 128 fft size = 512K, square time = 0.0363 msec, threads 256 fft size = 512K, square time = 0.0372 msec, threads 512 fft size = 512K, square time = 0.0379 msec, threads 1024 Best square time for fft = 512K, time: 0.0346, t = 32 fft size = 512K, ave time = 0.0454 msec, Norm1 threads 32, Norm2 threads 32 fft size = 512K, ave time = 0.0452 msec, Norm1 threads 32, Norm2 threads 64 fft size = 512K, ave time = 0.0450 msec, Norm1 threads 32, Norm2 threads 128 fft size = 512K, ave time = 0.0453 msec, Norm1 threads 32, Norm2 threads 256 fft size = 512K, ave time = 0.0452 msec, Norm1 threads 32, Norm2 threads 512 fft size = 512K, ave time = 0.0460 msec, Norm1 threads 32, Norm2 threads 1024 fft size = 512K, ave time = 0.0445 msec, Norm1 threads 64, Norm2 threads 32 fft size = 512K, ave time = 0.0445 msec, Norm1 threads 64, Norm2 threads 64 fft size = 512K, ave time = 0.0449 msec, Norm1 threads 64, Norm2 threads 128 fft size = 512K, ave time = 0.0451 msec, Norm1 threads 64, Norm2 threads 256 fft size = 512K, ave time = 0.0456 msec, Norm1 threads 64, Norm2 threads 512 fft size = 512K, ave time = 0.0465 msec, Norm1 threads 64, Norm2 threads 1024 fft size = 512K, ave time = 0.0452 msec, Norm1 threads 128, Norm2 threads 32 fft size = 512K, ave time = 0.0452 msec, Norm1 threads 128, Norm2 threads 64 fft size = 512K, ave time = 0.0453 msec, Norm1 threads 128, Norm2 threads 128 fft size = 512K, ave time = 0.0453 msec, Norm1 threads 128, Norm2 threads 256 fft size = 512K, ave time = 0.0461 msec, Norm1 threads 128, Norm2 threads 512 fft size = 512K, ave time = 0.0475 msec, Norm1 threads 128, Norm2 threads 1024 fft size = 512K, ave time = 0.0455 msec, Norm1 threads 256, Norm2 threads 32 fft size = 512K, ave time = 0.0455 msec, Norm1 threads 256, Norm2 threads 64 fft size = 512K, ave time = 0.0456 msec, Norm1 threads 256, Norm2 threads 128 fft size = 512K, ave time = 0.0456 msec, Norm1 threads 256, Norm2 threads 256 fft size = 512K, ave time = 0.0470 msec, Norm1 threads 256, Norm2 threads 512 fft size = 512K, ave time = 0.0477 msec, Norm1 threads 256, Norm2 threads 1024 fft size = 512K, ave time = 0.0459 msec, Norm1 threads 512, Norm2 threads 32 fft size = 512K, ave time = 0.0462 msec, Norm1 threads 512, Norm2 threads 64 fft size = 512K, ave time = 0.0463 msec, Norm1 threads 512, Norm2 threads 128 fft size = 512K, ave time = 0.0464 msec, Norm1 threads 512, Norm2 threads 256 fft size = 512K, ave time = 0.0474 msec, Norm1 threads 512, Norm2 threads 512 fft size = 512K, ave time = 0.0475 msec, Norm1 threads 512, Norm2 threads 1024 Best time for fft = 512K, time: 0.0445, t1 = 64, t2 = 32, t3 = 32 Using threads: norm1 256, mult 128, norm2 128. Using up to 4124M GPU memory. Starting stage 1 P-1, M7990427, B1 = 986, B2 = 124000, fft length = 512K Doing 1452 iterations M7990427, 0x32318b15f9d83ab6, n = 512K, CUDAPm1 v0.21 Stage 1 complete, estimated total time = 0:01 Starting stage 1 gcd. M7990427 Stage 1 found no factor (P-1, B1=986, B2=124000, e=0, n=512K CUDAPm1 v0.21) Starting stage 2. Using b1 = 986, b2 = 124000, d = 420, e = 4, nrp = 96 Zeros: 4430, Ones: 8530, Pairs: 2981 Processing 1 - 96 of 96 relative primes. Initializing pass... done. transforms: 1987, err = 0.00134, (0.53 real, 0.2650 ms/tran, ETA NA) Transforms: 18132 M7990427, 0x2acd8bf807caa816, n = 512K, CUDAPm1 v0.21 err = 0.00146 (0:05 real, 0.3128 ms/tran, ETA 0:00) Stage 2 complete, 20119 transforms, estimated total time = 0:05 Starting stage 2 gcd. M7990427 has a factor: 10509037975912491881 (P-1, B1=986, B2=124000, e=4, n=512K CUDAPm1 v0.21) [/code]Anymore test cases that I should run? |
[QUOTE=VictordeHolland;500342]Looks like it works here!
...Anymore test cases that I should run?[/QUOTE] You could try some run of the mill manual P-1 assignments. Or get adventurous and try some larger ones. Note, run time can be quite long, and some might fail to complete. If you hit a case that fails, please share the details. If you want some verification candidates, here's an excerpt from the draft rewrite of the CUDAPm1 readme file. [CODE] Run CUDAPm1 on some exponents with known factors that should be found, and see whether you find them. Easiest way is to select from the following list, exponents at or near the size you plan to run, and put them in the worktodo file. The bounds necessary to find factors vary by exponent. CUDAPm1's automatic parameter selection will be enough to find most but not all. Exponent Min B1 Min B2 fft length notes 4444091 7 2,557 256k 50001781 94,709 4,067,587 2688k 51558151 5,953 2,034,041 2880k 54447193 1,181 682,009 3072k 58610467 70,843 694,201 3200k 61012769 10,273 1,572,097 3360k 81229789 6,709 11,282,221 4704K 100000081 1,289 7,554,653 5600K 120002191 1,563 3,109,391 7168K 150000713 15,131 2,294,519 8640K 200000183 953 1,138,061 11200K 200001187 204,983 207,821 11200K 200003173 4,651 229,813 11200K 249500221 4 2.58951e+9 14336K big bounds, much memory & time 249500501 307 167,381 14336K 290001377 2,551 34,354,769 16384K takes days PFactor=1,2,4444091,-1,70,2 PFactor=1,2,50001781,-1,74,2 PFactor=1,2,51558151,-1,74,2 PFactor=1,2,54447193,-1,74,2 PFactor=1,2,58610467,-1,74,2 PFactor=1,2,61012769,-1,74,2 PFactor=1,2,81229789,-1,75,2 PFactor=1,2,100000081,-1,76,2 Pfactor=1,2,120002191,-1,75,2 Pfactor=1,2,150000713,-1,75,2 Pfactor=1,2,200001187,-1,75,2 PFactor=1,2,249500501,-1,75,2 PFactor=1,2,290001377,-1,75,2 Exponent Factor (may be composite) Prime factors 4444091 1809798096458971047321927127 = 8888183 x 319974553 x 636358278473 50001781 4392938042637898431087689 = 3 x 182851 x 8008229 51558151 755277543419074012358186647 54447193 17261184235049628259201 58610467 69057033982979789260999 61012769 2018028590362685212673 81229789 355078783674010195200030259699844128700274440385857 = 488121804389130135740149369 x 727438890213848757119753 100000081 3441393510714285782119 120002191 100835659918276033441 150000713 1447762785107694357647 200000183 849003842550205126847 200001187 3050161780881530584679 200003173 14652109287435525414352647642348599 = 4320552944485007 x 3391257895852957657 249500221 5168661482381201657 249500501 3571511465549660434777661921959439 = 11607130072256471 x 307699788260867209 290001377 10645243382592701071676802590718709559 = 1436135993277492383 x 7412420155488583273 or 90944796249039267769901814723364335322839708522092302667497 = * 170370076089478747961 * 371696926552024067119 * 1436135993277492383 Feel free to pick your own. Evaluate them at their equivalent of http://www.mersenne.ca/exponent/249500501[/CODE] |
[QUOTE=kriesel;500347]If you want some verification candidates, here's an excerpt from the draft rewrite of the CUDAPm1 readme file.[/QUOTE]
This is a great list. I want to include some more "quick" candidates as tests as part of the build process, beyond what I already have. (And I want to find out if Visual Studio can run tests post-compile... right now I just have Makefile rules for that on *nix) |
1 Attachment(s)
I ran the ones that take an hour at the most:
[code] 4,444,091 7 2,557 50,001,781 94,709 4,067,587 51,558,151 5,953 2,034,041 54,447,193 1,181 682,009 58,610,467 70,843 694,201 61,012,769 10,273 1,572,097 81,229,789 6,709 11,282,221 100,000,081 1,289 7,554,653 120,002,191 1,563 3,109,391 150,000,713 15,131 2,294,519 200,000,183 953 1,138,061 200,001,187 204,983 207,821 200,003,173 4,651 229,813 Pminus1=1,2,4444091,-1,7,2557 Pminus1=1,2,50001781,-1,94709,4067587 Pminus1=1,2,51558151,-1,5953,2034041 Pminus1=1,2,54447193,-1,1181,682009 Pminus1=1,2,58610467,-1,70843,694201 Pminus1=1,2,61012769,-1,10273,1572097 Pminus1=1,2,81229789,-1,6709,11282221 Pminus1=1,2,100000081,-1,1289,7554653 Pminus1=1,2,120002191,-1,1563,3109391 Pminus1=1,2,150000713,-1,15131,2294519 Pminus1=1,2,200000183,-1,953,1138061 Pminus1=1,2,200001187,-1,204983,207821 Pminus1=1,2,200003173,-1,4651,229813[/code]and they completed succesfully: [code] M4444091 has a factor: 2843992382407199 (P-1, B1=7, B2=7, e=0, n=256K CUDAPm1 v0.21) M50001781 has a factor: 4392938042637898431087689 (P-1, B1=94709, B2=4067587, e=12, n=2816K CUDAPm1 v0.21) M51558151 has a factor: 755277543419074012358186647 (P-1, B1=5953, B2=2034041, e=12, n=2816K CUDAPm1 v0.21) M54447193 has a factor: 17261184235049628259201 (P-1, B1=1181, B2=682009, e=12, n=3200K CUDAPm1 v0.21) M58610467 has a factor: 69057033982979789260999 (P-1, B1=70843, B2=694201, e=12, n=3200K CUDAPm1 v0.21) M61012769 has a factor: 2018028590362685212673 (P-1, B1=10273, B2=1572097, e=12, n=3456K CUDAPm1 v0.21) M81229789 has a factor: 727438890213848757119753 (P-1, B1=6709, B2=11282221, e=12, n=4480K CUDAPm1 v0.21) M100000081 has a factor: 3441393510714285782119 (P-1, B1=1289, B2=7554653, e=12, n=5760K CUDAPm1 v0.21) M120002191 has a factor: 100835659918276033441 (P-1, B1=1563, B2=3109391, e=12, n=6912K CUDAPm1 v0.21) M150000713 has a factor: 1447762785107694357647 (P-1, B1=15131, B2=2294519, e=12, n=8640K CUDAPm1 v0.21) M200000183 has a factor: 849003842550205126847 (P-1, B1=953, B2=1138061, e=12, n=11200K CUDAPm1 v0.21) M200001187 has a factor: 3050161780881530584679 (P-1, B1=204983, B2=207821, e=12, n=11200K CUDAPm1 v0.21) M200003173 has a factor: 14652109287435525414352647642348599 (P-1, B1=4651, B2=229813, e=12, n=11200K CUDAPm1 v0.21) [/code] |
[QUOTE=aaronhaviland;500367]I want to include some more "quick" candidates as tests as part of the build process, beyond what I already have. [/QUOTE]
Aaaand on that note, I've added some built-in self-tests into the code itself, instead of relying on the build process. [CODE]-selftest Run a quick selftest (ETA: 0:16) -selftest2 Run a longer selftest (ETA: 17:22)[/CODE]So far I have 5 "quick" self tests (< 10s each on my hardware), and 2 "slow" self tests (~ 10m each on my hardware). Checkpoints, worktodo.txt, and results.txt I/O are completely disabled for these tests. |
[QUOTE=kriesel;500133]Yes. See for example [URL]https://www.mersenneforum.org/showpost.php?p=456324&postcount=2591[/URL] where 1024 squaring threads is bad, gives timings half what others do, in CUDALucas. There are also cases where 32 threads is bad. Compute capability 2.0 I think. CUDAPm1 issue #16.
There are also cases where certain fft lengths give bad results. As I recall these were found for old CUDA levels.[/QUOTE] Check for anomalous thread timings: Commit 36ceb29 Check for anomalous fft timings: Commit 538118a [QUOTE]CUDALucas was modified to trap for a select few bad-residue cases; 0x02, 0x00, and 0xfffffffffffffffd. The CUDALucas v2.06beta traps for its known bad residues. Since CUDAPM1 was derived from CUDALucas, years before, it has some of the same issues as well as some of its own. CUDAPm1's list of bad residues is longer.[/QUOTE]Added check for this. Commit a2c7f50 |
Releasing all the above as v0.22
(Binaries uploaded:[URL]https://github.com/ah42/cuda-p1/releases/tag/0.22[/URL]) [LIST][*]First proper release since forking[*](Originally based on code from [URL]https://sourceforge.net/projects/cudapm1/[/URL] (r52)[*]Compute Dickman's function live, instead of using incorrect precomputed values[*]Fix memory leaks in stage2[*]Fix fencepost error causing invalid results[*]Fix potential overflows[*]Use smaller data types when possible[*]Reduce kernel branching[*]Update build for CUDA 10.0 / Compute Capability 7.5[*]Split kernel code into individual files[*]Replace GMP with MPIR for easier cross-platform builds.[*]Automatically run threadbench if required[*]Add VS2017 and eclipse build files.[*]Implement internal self-test system[*]Allow full memory allocation on 64-bit windows builds[*]Contributions from kriesel:[LIST][*]Add test for known invalid residues[*]Comment & code formatting/cleanup[*]Add test for abnormally low threadbench timings[*]Add test for abnormally low fftbench timings[/LIST] [/LIST] |
Now, that is a very good job, after so long time, sir! Hat off and bow. :bow:
We will give it a spin tonight when we reach home. |
Wow, great job!
|
[QUOTE=aaronhaviland;500475]Releasing all the above as v0.22
(Binaries uploaded:[URL]https://github.com/ah42/cuda-p1/releases/tag/0.22[/URL]) [LIST][*]First proper release since forking[*](Originally based on code from [URL]https://sourceforge.net/projects/cudapm1/[/URL] (r52)[*]Compute Dickman's function live, instead of using incorrect precomputed values[*]Fix memory leaks in stage2[*]Fix fencepost error causing invalid results[*]Fix potential overflows[*]Use smaller data types when possible[*]Reduce kernel branching[*]Update build for CUDA 10.0 / Compute Capability 7.5[*]Split kernel code into individual files[*]Replace GMP with MPIR for easier cross-platform builds.[*]Automatically run threadbench if required[*]Add VS2017 and eclipse build files.[*]Implement internal self-test system[*]Allow full memory allocation on 64-bit windows builds[*]Contributions from kriesel:[LIST][*]Add test for known invalid residues[*]Comment & code formatting/cleanup[*]Add test for abnormally low threadbench timings[*]Add test for abnormally low fftbench timings[/LIST] [/LIST][/QUOTE] Outstanding! I've updated my reference material to point to this (Aaron's post), and emailed James Heinrich with a link for updating his mirror. What's next Aaron? Logging extensions, date/time stamp addition, and removal of CUDAPm1 v0.2x from every iteration or transforms progress record? What would other users like to see, assuming Aaron is open to suggestions? I'll test this in my production running and for changes in limits, after finishing out some V0.20 limits testing that is still ongoing. |
| All times are UTC. The time now is 23:19. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.