mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Blogorrhea > kriesel

Closed Thread
 
Thread Tools
Old 2018-05-28, 21:14   #1
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

582010 Posts
Default CUDAPm1-specific reference material

This thread is intended to hold only reference material specifically for CUDAPm1
(Suggestions are welcome. Discussion posts in this thread are not encouraged. Please use the reference material discussion thread http://www.mersenneforum.org/showthread.php?t=23383. Off-topic posts may be moved or removed, to keep the reference threads clean, tidy, and useful.)


Beginning users of CUDAPM1 may want to go directly to post 4 of this thread for a sort of how-to / getting started guide. But first, consider whether gpuowl would be usable on the intended gpu. If so, use gpuowl instead for superior error detection and reliability and performance.


Table of contents
  1. This post
  2. Run time scaling versus exponent for the NVIDIA GTX480 of CUDAPm1 v0.20 http://www.mersenneforum.org/showpos...27&postcount=2
  3. CUDAPm1 bug and wish list (with some workarounds) http://www.mersenneforum.org/showpos...34&postcount=3
  4. CUDAPm1 getting started guide http://www.mersenneforum.org/showpos...51&postcount=4
  5. Is this instance of P-1 software and hardware working correctly? How can we tell? http://www.mersenneforum.org/showpos...54&postcount=5
  6. An example of statistics or occasional error in action http://www.mersenneforum.org/showpos...80&postcount=6
  7. Limits on exponent versus gpu model or gpu memory or program behavior (CUDAPm1 v0.20) http://www.mersenneforum.org/showpos...65&postcount=7
  8. CUDAPm1 v0.20 Limits and anomalies versus gpu model https://www.mersenneforum.org/showpo...72&postcount=8
  9. CUDAPm1 v0.20 Limits and anomalies versus gpu model (continued) https://www.mersenneforum.org/showpo...73&postcount=9
  10. RTX20xx and CUDAPm1 v0.21 https://www.mersenneforum.org/showpo...9&postcount=10
  11. CUDAPm1 v0.22 release https://www.mersenneforum.org/showpo...5&postcount=11
  12. Example run https://www.mersenneforum.org/showpo...5&postcount=12
  13. CUDAPm1 v0.20 -h help output https://www.mersenneforum.org/showpo...4&postcount=13
  14. etc tbd

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-01-24 at 16:51 Reason: added advice to use gpuowl instead if possible
kriesel is online now  
Old 2018-05-28, 21:17   #2
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×3×5×97 Posts
Default CUDAPm1 run time scaling, and comparison to CUDALucas

Timings for an assortment of exponents are tabulated and charted for reference, and other considerations like memory requirements also, for the NVIDIA GTX480. Note, only one trial per combination was tabulated, so no measure made or indication given of reproducibility run to run for same inputs. Where issues were encountered they are briefly identified. See the pdf attachment.

CUDALucas run times are also shown here, for consideration of how the CUDAPm1 run time scales with exponent, since the point of running CUDAPm1 is to maximize savings in total Mersenne prime hunt search time by efficiently finding factors that eliminate the need for a primality test or two or more.

This is a somewhat different way of looking at test speed than the GPU Lucas-Lehmer or trial factoring performance benchmarks at http://www.mersenne.ca/cudalucas.php etc. There is no P-1 factoring performance benchmarking data posted at http://www.mersenne.ca to my knowledge.

Data now span 10M to 700M


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: pdf cudapm1-runtime scaling.pdf (46.4 KB, 303 views)

Last fiddled with by kriesel on 2019-11-18 at 14:15
kriesel is online now  
Old 2018-05-29, 03:01   #3
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

132748 Posts
Default CUDAPm1 bug and wish list

Here is the most current posted version of the list I am maintaining for CUDAPM1. As always, this is in appreciation of the authors' past contributions. Users may want to browse this for workarounds included in some of the descriptions, and for an awareness of some known pitfalls. Please respond with any comments, additions or suggestions you may have, preferably by PM to kriesel or in the separate discussion thread here.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: pdf cudapm1 bug and wish list 2019-01-11.pdf (102.0 KB, 240 views)

Last fiddled with by kriesel on 2019-11-18 at 14:15 Reason: updated attachment for v0.22
kriesel is online now  
Old 2018-06-03, 14:03   #4
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

582010 Posts
Default CUDAPm1 getting started guide

CUDAPm1 makes heavy use of gpu ram and is unforgiving of error. Thorough memory testing is recommended, using CUDALucas -memtest option to test as much of the memory as possible, or some other gpu ram test utility. Logging the memory test results to disk for later reference is recommended. Retesting at annual or semiannual intervals is recommended.

Read the readme file. Note that most of the following is based on experience with CUDAPm1 v0.20 on Windows and a variety of gpus ranging from the 1GB Quadro 2000 to the 11GB GTX 1080 Ti.
Then begin by confirming that the gpu you plan to run CUDAPm1 on has reliable memory.
Run CUDALucas -memtest on it, specifying testing as many 25MB blocks as you can get to fit and run successfully. A place to start is blocks = (GPURam in GB)*1024 / 25 -4. Specify number of passes; 1 for a slow gpu, more for a fast gpu. Redirecting the output to a file allows review at any time later. For example, for an 8GT GTX1080, 2 passes, something like
Code:
cmd /k cudalucas -memtest 324 2 >>gtx1080-ramtest.txt
If you specify too many blocks, it will complain and reduce the number and make an attempt. Sometimes that fails; if so, reduce the amount some more and retry until it will run to completion. Be very strict; CUDAPm1 requires a lot of memory and extreme reliability and yet has less checking intrinsically. If the memory test detects ANY memory errors, it's probably not worth running CUDAPm1 or CUDALucas on that gpu.

Run via cmd /k commandline so any error message will stick around long enough to read. (/k = keep the command prompt window after program termination) The error could be any of a number of things, like a mis-typed executable name, wrong dll for the exe, missing dll, permissions problem, device number mismatch, program bug that crashes CUDAPm1, gpu with a driver problem or that has gone to sleep from a thermal limit or turned off because of inadequate total system power, etc.

Create a user directory. Unzip the CUDAPm1 software in it.
Get the appropriate CUDA level cufft and cudart library files for your gpu and OS from
https://download.mersenne.ca/CUDA-DLLs and place them in the same directory.

(If all else fails, CUDA dlls or .so files can be obtained by downloading and installing the applicable OS and CUDA version’s CUDA SDK, then locating the files and adding the few needed files to your path or to your working directory on other systems. Download the current version from https://developer.nvidia.com/cuda-downloads
Older versions can be obtained from the download archive, at https://developer.nvidia.com/cuda-toolkit-archive
These are typically gigabytes in the newer versions; 2.8GB @ CUDA v11.3.
Get the installation guide and other documentation pdfs also while you're there.

Note, not all OS version and CUDA version combinations are supported! See CUDA Toolkit compatibility vs CUDA level https://www.mersenneforum.org/showpo...1&postcount=11 and good luck!)

Edit the cudapm1.ini file to suit your system and preferences.

At this point it is probably worth a short test to see if you have things installed and configured properly, and whether the gpu is reliable in CUDAPm1. Substitute for %dev% whatever the CUDA device number of the gpu is. (Numbering starts at zero.)
Code:
cudapm1 -d %dev% -b2 5000000 -f 2688k 50001781 >>cudapm1test.txt
Set other things running in the case you want to tune for, such as prime95 running on the cpu, other gpus busy with their GIMPS work, no interactive use, then do your CUDAPm1 fft and threads benchmarking on one gpu at a time in that context that's reflective of running GIMPS work undisturbed.

Download (zipped in cp.7z), unzip, and modify and then run the Windows batch file cp.bat in stages. It contains a lot of comments. A lot of the code is initially commented out; uncomment what is useful to you. Initial edits are in the first several lines to make it match your gpu model and situation. The first stage would be to run fft benchmark over the range of fft lengths you may use at some point in the future.
CUDAPm1 overwrites fft files of previous runs, hence the renames in the batch file. (If we try to benchmark too many fft lengths in a single run, the program may crash before generating an fft output file, or it may overrun the storage space in the program and output some garbage. So it is broken up into multiple runs in the batch file.)
Manually merge the fft output files, if multiple, into "<gpu model> fft.txt".

The threadbench portion needs to be edited based on which fft lengths are found to be useful and included in the fft output files.
Then run the threadbench portion. This can be very time consuming; hours or days. Threadbench appends to the threads file, so there's no need to rename and merge later.

CUDALucas has the -threadbench option, but CUDAPm1 has no -threadbench option. A CUDAPm1 threadbench is performed for a single fft length by specifying -cufftbench (fftlength) (fftlength) (repetitions) (mask). Repetitions and mask are optional. Same fftlength given twice is what tells CUDAPm1 to do a threadbench instead of an fftbench. The fft and threads files produced by CUDAPm1 will differ from those produced by CUDALucas for the same gpu and should not be used for CUDALucas. Nor should CUDALucas fft or threads files be used in CUDAPm1. CUDALucas should be run to produce its own fft and threads files. Keep these apps and related files in separate directories from each other. In my experience (based on lots of deep testing and benchmarking on numerous gpu models and software versions), the benchmark result files differ when any of the following differ:
  • Software application
  • application version
  • CUDA level of the application
  • GPU model
  • GPU unit
  • variations in other system activity, especially affecting the display gpu
  • one run to the next, everything else held constant (minor)
I saw no appreciable effect of widely varying CUDA driver version, in CUDALucas on a GTX480. However, I saw reductions in gpuowl performance on AMD of up to 5% from upgrading the Windows Adrenalin driver. Which CUDA level performs best on a given gpu can vary with the fft length as well as application etc.; performance can fluctuate several percent versus CUDA level, other things held constant. It's not always the latest that's fastest.

Note there are periods where the gpu will go idle during a P-1 run. It's expected and normal. The gcd computation is done with a single cpu core, not the gpu, at the end of stage one and at the end of stage two. The memory of the gpu is still committed to the CUDAPm1 application while this is occurring. During resumption of a computation in progress from a save file (after a crash or requested stop) there is also a time where the gpu is loading, not computing.

Obtaining work assignments and submitting results are manual only at this time.
Select P-1 factoring at https://www.mersenne.org/manual_assignment/
Report results at https://www.mersenne.org/manual_result/

There are a number of known issues with the program. See the bug and wish list at http://www.mersenneforum.org/showpos...34&postcount=3 for descriptions and also for some workarounds. If you find a new issue in the program, not in that list, please report it to Kriesel via PM for inclusion in the bug and wish list, which is occasionally updated.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: 7z cp.7z (4.2 KB, 218 views)

Last fiddled with by kriesel on 2021-05-12 at 17:33 Reason: added section on getting dll or .so files
kriesel is online now  
Old 2018-06-03, 14:32   #5
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·3·5·97 Posts
Default Is this instance of P-1 software and hardware working correctly? How can we tell?

(much of the following first appeared at http://mersenneforum.org/showpost.ph...7&postcount=19 and relates to p~80M)
How can we tell when a P-1 factoring program is running correctly, producing correct results? Can we tell?

False factors get spotted by Primenet or the user. Factors of mersenne numbers have the following properties, that can be screened for quickly by Primenet:
  1. Factor reported is 1 or 7 mod 8
  2. Factor reported is of form 2 k p + 1, where k is a positive integer one or larger, and p is the Mersenne exponent
  3. The factor reported actually cleanly divides the Mersenne number resulting in zero remainder
Missed factors don't reliably get spotted.
For given exponent, bounds, available memory, and program-selected optimization, there's an estimated probability of finding a factor if the P-1 computation is performed correctly. In enough trials, sampling hundreds or thousands of exponents, statistically some pretty improbable things occur and are expected to occur. Given enough sample groups of 100-200 attempts, we're guaranteed some will fall below expectation and some will exceed expectation. It seems to me people are likely to be happy with the latter and concerned about the former, and post about the former. So there could be a sort going on, biasing what appears in the forum threads, even for samples run on hardware and software functioning perfectly. And since hardware malfunctioning is likely to reduce the number of successes, it seems to me worthwhile to look into a low number of successes since it _might_ be a quiet and nonspecific indication of unreliable hardware or software.

What are the choices?

A) Ignore low productivity, and continue to run possibly bad hardware, perhaps returning false negatives, necessitating a number of longer primality tests that could have been avoided?

B) Run another sample of a hundred plus on the same hardware and software and observe whether the yield is also improbably low? Running and examining yield on another set of 130 plus would provide a second independent yield number to compare against the probability distribution. Since the two runs are independent events and no factors for a set is probability 0.9%, getting two sets of 130 with no results in either is probability 80 parts per million,. A set of 130 in the current wavefront would take about a month on a GTX480. P-1 itself is a slow and vague check.

C) Rerun a substantial subset of the unproductive sample on a different GPU? As I understand it, P-1 should be deterministic; the same bounds run twice (whether on same or different hardware) if it is working correctly should return the same found factors twice, or return none twice. (And in the case of CUDAPm1, produce the same res64/iteration sequence twice.) If someone with low results in a significant sample size reruns 32 or more, randomly chosen, or chosen giving preference to those that had the most restarts during their run, on different hardware that's tested reliable recently, the odds are still 30% of no factors, and takes about a week. Rerunning 65 still has a probability of nearly 10% of no factors, and takes about two weeks on a GTX480.

D) Explicitly test the reliability in multiple ways for known test cases. If it's a prime95/mprime instance yielding no P-1 factors, rerun the self test. Try some LL DC or triple check in similar and larger exponents, to exercise the same fft length (although not the same memory footprint) separately. Assuming it's a CUDAPm1 instance, pause P-1 factoring for a while to run maximum-size memory test and several repetitions on the hardware to check whether its memory reliability is currently ok. If it fails the memory check, retry with underclocking. If it passes the memory check, try repeating P-1 on a a number of exponents in the range of current interest with a known P-1 factor each. ( Factors found by TF can also be considered.) Try multiple exponents with known factors; some might work and some not, while there is a reliability issue. Running just one such test, that succeeds, or a few, could provide unwarranted confidence, so run several. Consider the case where they succeed ~50% of the time due to infrequent error, and that happens to be the first time, or two, or three. There is not currently in CUDAPm1 a residue self-test capability, like there is in CUDALucas. Reliability probably should be retested about annually regardless of software and hardware.

I feel some responsibility for ensuring my own future P-1 runs are functioning correctly. It's not simple to do so. I know from performing D it is possible to produce (a fraction of the expected rate of) P-1 factors on hardware with memory reliability issues. It can appear to be working when it is having problems and may be silently missing factors. Or it could just be the statistics.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2019-11-18 at 14:16
kriesel is online now  
Old 2018-06-05, 03:12   #6
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×3×5×97 Posts
Default An example of statistics or occasional error in action

Here's an example from my own experience of P-1 factoring stats.


By its nature, P-1 seems to be a relatively unchecked case. Trial factoring gets back-stopped somewhat by P-1; if a trial factoring attempt has a calculation error that causes it to miss the factor, the later P-1 run could catch it (if the corresponding P-1 is smooth, estimated to be about 20% of factors). If P-1 finds a factor trial factoring should have but didn't, that reveals there was an error in trial factoring. If P-1 attempts have errors that cause factors to be missed, it will typically be at a bit level untested by trial factoring, and go undetected. If LL test has an error that affects the residue, it will usually not change the primality conclusion, and is extremely likely to be detected eventually by a 64-bit-residue mismatch with another LL test run made with a different offset, as long as the error is not a systematic one. The LL tests that follow P-1 don't shed any new light on the correctness of the P-1 attempt. Presumably if the improbable error of falsely computing and reporting a factor occurs in P-1 it could be caught by primenet verifying the factor found by P-1 is of the correct form and indeed divides the mersenne number when the factor is submitted.

P-1 in effect double checks trial factoring; LL test double checks LL test; nothing double checks P-1 for missed factors. For production intent that's ok _if_ the hardware is sound and the code sound, since double checking P-1 would cost a lot of cycles compared to the potential benefit of a relatively few additional factors.

(For assumptions of equal code and hardware reliability per hour, 3% error rate in LL test to completion, 3.6% probability of P-1 factor findable and 0.03 times LL runtime for P-1 run, all of which are made-up numbers thought to be in the general neighborhood of current actuality, 200 candidates surviving trial factoring would lead to 7.2 P-1 factors found in 200*.03=6 LL test times saving 1.2 LL test time, but the hourly error rate would reduce that by 200 candidates *.03 duration/candidate*.03 error-rate/duration = 0.18 factors to 1.02 net. Spending 6. more LL test times to double check the 200 P-1 candidates to save 0.18 factors found is a bad bargain we don't take in production. Hardware or software QA is another matter. Running the same exponent many times in development & QA can pay if it reduces the 0.03 failure rate for production work.)

How do we know if a P-1 factorer program is working correctly? Other than frequency of halts and error messages, and trying some exponents with known factors, I think we currently don't, even if the number of factors found is plausible compared to expectations and the probability distributions. I think many users will not examine the question very closely. The statistical nature of finding factors provides a lot of cover, for missing a fraction of the factors, that ought to be found, to go undetected.

How do we know if a given piece of hardware is reliable enough for P-1? If it can perform LL tests for similar exponents, with their longer run times than P-1, successfully, it may be. But vram footprint may be larger in CUDAPM1 than in CUDALucas for the same fft length. That's implied by observed lower-maximum-fft-length limits in fft or thread benchmarking for CUDAPM1 than CUDALucas on GPUs with smaller memory capacity (36864k vs. 38880k fft length limits on 1GB). It's also confirmed by GPU-Z monitoring of gpu memory occupancy during P-1 and LL testing. So P-1 may enter into regions of bad memory before LL tests do, as exponent and fft length increases.

Run times of LL tests, P-1 attempts, and P-1 time per expected factor found, on the same hardware, and estimated probability of factors over a widely scattered variety of exponents, are illustrated in the attachment found at http://www.mersenneforum.org/showpos...4&postcount=23

The 2.8%-5% probability of finding factors in P-1 runs, of the ranges I've seen CUDAPm1 calculated as probability of finding a factor, and for a wide variety of exponents, is a function of exponent, bounds, etc.

The P-1 factoring software needs to be ready and trusted for the bigger exponents coming later also.

CUDAPm1 can be run in a way that it selects bounds and other parameters itself. This is how I have run it. I reviewed my own results and logs a while back and found some interesting things.

The population of my P-1 tests, by approximate exponent size bin, and the corresponding CUDAPm1-indicated probabilities of factoring were:

41M 2.80%* 1=0.028; 0 factored
43M 2.64%* 18=0.48 ; 0 factored (but many of these had already not found a factor in a stage-1-only P-1 run; a bit of negative selection bias)
81-83M 3.56%*150=5.34 expected vs 2 factored
150M 4.81%* 1=0.048; 0 "
151M 4.79* 1=0.048; 0 "
199M 5.05%* 1=0.050; 0 " (B1 = 1880000, B2 = 29140000)
200M 2.72%* 1=0.027; 0 " (B1 = 1405000, B2 = 17211250 generated by ll tests saved=1)
total expected ~6.01 to be factored.

Note the range of probabilities calculated by CUDAPM1 v0.20 are from 2.64% to 5.05% for exponents 41M to 200M.

The ratio of P-1 factor attempt time to LL test run time increases with exponent, when CUDAPM1 is allowed to choose its own bounds and other parameters, in calculations to attempt to maximize the probable savings of computing time.

Two exponents had factors found by P-1. That seemed too few and an unlikely outcome. Two / 5.34 from 81-83M is 0.375 that expected, quite a bit lower. About 4.5 LL test durations were spent getting two factors, making two LL tests and two double checks unneeded (plus the 12% possibility of a triple check if an LL error occurs)

Computing for the 82M exponent group, a binomial distribution (150, .0356); https://www.easycalculation.com/stat...stribution.php
confirms 2 factors or less from 150 attempts is a low probability but not extremely low:
Code:
r     d     cumulative <=r probability
0 .0043509  .0043509
1 .0240915  .0284424
2 .0662542  .0946966
3 .1206554  .2153510
4 .1636804  .3790314
5 .1764300  .5554614
6 .1573918  .7128532
7 .1195196  .8323728
8 .0788639  .9112367
9 .0459321  .9571688
10 .0239072 .9810760
Three through eight are all more likely outcomes if calculations occurred accurately, than the observed outcome of two.

For the 43M exponent group, by comparison, things look ok; 0 found, and 0 most likely to find.
43M distribution (18, .0264)
Code:
r     d     cumulative <=r probability
0 .6178032  .6178032
1 .3015408  .9193440
2 .0695006  .9888446
3 .0100510  .9988956
In the course of reviewing my logs, to obtain the exponent-specific probabilities from CUDAPm1, and looking for anything unusual, I found 3 exponents had conspicuous errors.

CUDAPm1 prints interim progress lines with 64-bit values similar to CUDALucas interim residues. Repeating values are unusual. In 3 of the 81M exponents, the interim residues switched to 0x00 early in the run. For example in stage 1:
Iteration 550000 M81328073, 0x4c1a24bd6c974303, n = 4704K, CUDAPm1 v0.20 err = 0.07153 (7:21 real, 8.8207 ms/iter, ETA 1:12:56)
Iteration 600000 M81328073, 0x0000000000000000, n = 4704K, CUDAPm1 v0.20 err = 0.07617 (7:18 real, 8.7657 ms/iter, ETA 1:05:10)
...
stage 1 finishes and stage 2 runs to completion in normal timing with the abnormal value:
...
Processing 469 - 480 of 480 relative primes.
Inititalizing pass... done. transforms: 359, err = 0.00095, (1.60 real, 4.4590 ms/tran, ETA 4:05)
Transforms: 53752 M81328073, 0x0000000000000000, n = 4704K, CUDAPm1 v0.20 err = 0.00095 (4:04 real, 4.5543 ms/tran, ETA 0:00)

Stage 2 complete, 1895210 transforms, estimated total time = 2:23:39
Starting stage 2 gcd.
M81328073 Stage 2 found no factor (P-1, B1=725000, B2=16131250, e=2, n=4704K CUDAPm1 v0.20)

The following result entry omits the final 64-bit value or anything that would indicate error and so gives no indication anything went wrong.
M81328073 found no factor (P-1, B1=725000, B2=16131250, e=2, n=4704K, aid=6E6DD895294C2D938D25A7B4E3CF____ CUDAPm1 v0.20)
Perhaps that should be changed. Modifying CUDAPm1 to detect the condition, such as adapting the check for bad residues developed for CUDALucas, and halt or retry from last believed good save file is another possibility.

What could account for the lower than expected number of factored exponents?
A) The estimates of probability made by CUDAPm1 could be optimistic
B) The conspicuous errors preventing finding any factors on the 3 exponents clearly affected. This effect is small, ~0.107 factors.
C) Additional errors on other exponents from the same cause as the conspicuous errors, that went unnoticed; magnitude unknown
D) Additional errors on some other exponents from other causes for which a factor would have been found if the algorithm correctly executed; existence, type, magnitude unknown.
E) The probability distribution; finding only 2 factors or less for 150 ~82M trials is only <10% probable but can happen.
F) Hardware issue, software bug, environmental, something else?
G) Combinations of the preceeding

The 3 exponents with conspicuous errors were rerun from start and appeared to run normally. None yielded a factor. That doesn't mean none were there to be found; it could be the result of correct execution, or it could be a reproducible error.

Counts of restarts (per error type) per exponent tallied by the application would be useful to identify the candidates most likely to have had errors.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2019-11-18 at 14:16
kriesel is online now  
Old 2018-06-07, 17:28   #7
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

16BC16 Posts
Default Limits on exponent versus gpu model or gpu memory or program behavior (CUDAPm1 v0.20)

Experimenting with several different model gpus, some old, some new, in CUDAPm1 V0.20 (mostly the September 2013 CUDA 5.5 build for Windows), I've found none of the gpus capable of completing stage 2 for exponents in the higher 3/4 of the theoretical capability (231-1). Plus some interesting behaviors.

At least one model can compute and save in stage 1, a save file it can not resume from.
(Quadro 4000, 800M exponent.)

Maximum successfully completed stage one and stage two exponents differ. This is not surprising in some cases, since stage 2 requires more memory. But it was surprising that some models (the GTX 1060 3GB, 4GB GTX 1050 Ti, and 8GB GTX 1070; >=32 bit addressing), showed decreasing limits in their stage one runs with increasing memory, and lower than the older 1GB Quadro 2000, 1.5GB GTX480, and 2GB Quadro 4000 (31 bit address range), whose limits trend upward with memory as expected.

Some exponents fail within these ranges on a particular gpu also. For example, several exponents around 84.2M, and one at 128M failed on a Quadro 2000, although the upper limit of its capability is above 177M.

Currently the main limiting factors seem to be inadequate memory for stage 2, failure to correctly complete the stage 1 gcd or stage 2 startup immediately after stage 1 gcd, and unknown bugs. (The gcd is done on a cpu core. A quiet termination in stage 2 due to excess round off error was mentioned by owftheevil, CUDAPm1's author, as a known issue years ago.)

Further runs, as I refine values for the respective limits, by binary search, may narrow the gap between current lower and upper bounds of feasibility versus gpu model and stage. In some cases running an exponent to obtain a single bit of refinement on the bound can take a week to a month. Most upper and lower bounds are now converged to within 1%, my usual arbitrary end point, and many are within 1M.

Some preliminary numbers are as follows. Below these approximate bound values, most exponents can be run to completion in both stage 1 and stage 2.

Code:
CUDAPm1 V0.20             
GPU model    GPU Memory GB    Least lower bound value (including 1-month run time limit) 
Quadro 2000       1             177,500,083
GTX 480           1.5           289,999,981
Quadro 4000       2             338,000,009
Quadro 5000       2.5           311,000,077
Quadro K4000      3            404,000,123
GTX 1060 3GB      3             432,500,129
GTX 1050 Ti       4             384,000,031
Tesla C2075       5.25          376,000,133
GTX 1070          8             333,000,257
GTX 1080          8             377,000,051
GTX 1080 Ti      11             377,000,081
The above numbers are for B1 and B2 bounds selected by the program, with the number of primality tests saved if a factor is found of two. Two saved is what is usually included in manual assignment records. The bounds CUDAPm1 picks for that are sometimes not high enough to match what PrimeNet wants as limits, as indicated by mersenne.ca exponent status pages. In most cases increasing number of tests saved to 3 would be enough. Running with two saved is probably more efficient overall. The maximum exponents are likely to drop significantly at higher number of tests saved. The good news is even the lowly Quadro 2000 has several years of somewhat useful life remaining, since current exponent issue by Primenet is around 92M and advancing at about 8M/year. It's not recommended to use the Quadro 2000 for P-1 though, since its bounds tend to be lower than needed.

The attachment below tabulates and graphs the stage 1 and stage 2 lower and upper exponent bounds found to date, along with notes re the limiting behavior, extrapolated run times, and comparison to certain means of estimating bounds. (Ignore the 64M fft limit claimed there; the limit is 128M, at least for CUDA levels 5.5 and up and possibly even somewhat lower)


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: pdf cudapm1 exponent limits and reasons.pdf (156.6 KB, 163 views)

Last fiddled with by kriesel on 2020-07-02 at 18:55 Reason: updated limit values
kriesel is online now  
Old 2018-10-24, 17:27   #8
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

16BC16 Posts
Default CUDAPm1 v0.20 Limits and anomalies versus gpu model

Here is an info dump of some tests made, anomalies seen, and fits to run times and B1 and B2 bounds versus exponent, separately for a selection of NVIDIA based gpu models. This is the gory detail for which the preceding post in this thread is a summary or overview. Each attachment is for a separate gpu model. They are listed in order of increasing gpu memory size;

Code:
Quadro 2000 1 GB
GTX480 1.5 GB
Quadro 4000 2 GB
Quadro 5000 2.5 GB
GTX 1060 3 GB

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: pdf CUDAPm1 on Quadro 2000.pdf (18.3 KB, 195 views)
File Type: pdf cudapm1-gtx480 runtime scaling.pdf (44.9 KB, 202 views)
File Type: pdf CUDAPm1 on Q4000.pdf (19.4 KB, 200 views)
File Type: pdf CUDAPm1 on GTX1060.pdf (18.9 KB, 207 views)
File Type: pdf CUDAPm1 on Q5000.pdf (18.3 KB, 184 views)

Last fiddled with by kriesel on 2019-11-18 at 14:19 Reason: updated q5000 file
kriesel is online now  
Old 2018-10-24, 17:31   #9
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·3·5·97 Posts
Default CUDAPm1 v0.20 Limits and anomalies versus gpu model (continued)

Working around the 5-attachment limit here, this is a continuation of the set begun in the previous post in this thread (and at least for now, the conclusion).
Code:
GTX 1050 Ti 4 GB
GTX 1070 8 GB
GTX 1080 8 GB
GTX 1080 Ti 11 GB
Tesla C2075 6GB (part of which is used for ECC)
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: pdf CUDAPm1 on GTX1050Ti.pdf (93.8 KB, 166 views)
File Type: pdf CUDAPm1 on GTX1070.pdf (78.6 KB, 156 views)
File Type: pdf CUDAPm1 on GTX1080.pdf (63.8 KB, 167 views)
File Type: pdf CUDAPm1 on GTX1080 Ti.pdf (65.9 KB, 162 views)
File Type: pdf CUDAPm1 on Tesla C2075.pdf (104.8 KB, 150 views)

Last fiddled with by kriesel on 2020-07-02 at 19:46 Reason: added GTX1080 Ti attachment
kriesel is online now  
Old 2018-11-15, 02:08   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

16BC16 Posts
Default RTX20xx and CUDAPm1 V0.21

Aaron Haviland has continued updating his forked version of CUDAPm1. This contains a number of fixes relative to V0.20. See https://www.mersenneforum.org/showpo...&postcount=627 for a Windows executable built and run on Win10 x64 that supports RTX20xx and requires CUDA10. See his github repository for source. https://github.com/ah42/cuda-p1


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2019-11-18 at 14:21
kriesel is online now  
Old 2018-11-19, 19:23   #11
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·3·5·97 Posts
Default V0.22 description link

Aaron Haviland announces a proper release of V0.22 with numerous changes. See https://www.mersenneforum.org/showpo...&postcount=646
While this addresses some issues of V0.20, it seems to introduce some new ones also. Overall, V0.20 seems more dependable to me than V0.22.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2020-04-25 at 15:54
kriesel is online now  
Closed Thread

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Mersenne Prime mostly-GPU Computing reference material kriesel kriesel 33 2021-10-24 23:43
Reference material discussion thread kriesel kriesel 78 2021-07-12 13:51
CUDALucas-specific reference material kriesel kriesel 9 2020-05-28 23:32
Mfaktc-specific reference material kriesel kriesel 8 2020-04-17 03:50
How do you obtain material of which your disapproval governs? jasong jasong 97 2015-09-14 00:17

All times are UTC. The time now is 17:51.


Thu Oct 28 17:51:49 UTC 2021 up 97 days, 12:20, 2 users, load averages: 1.37, 1.50, 1.50

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.