View Single Post
Old 2018-06-03, 14:03   #4
kriesel's Avatar
Mar 2017
US midwest

6,091 Posts
Default CUDAPm1 getting started guide

CUDAPm1 makes heavy use of gpu ram and is unforgiving of error. Thorough memory testing is recommended, using CUDALucas -memtest option to test as much of the memory as possible, or some other gpu ram test utility. Logging the memory test results to disk for later reference is recommended. Retesting at annual or semiannual intervals is recommended.

Read the readme file. Note that most of the following is based on experience with CUDAPm1 v0.20 on Windows and a variety of gpus ranging from the 1GB Quadro 2000 to the 11GB GTX 1080 Ti.
Then begin by confirming that the gpu you plan to run CUDAPm1 on has reliable memory.
Run CUDALucas -memtest on it, specifying testing as many 25MB blocks as you can get to fit and run successfully. A place to start is blocks = (GPURam in GB)*1024 / 25 -4. Specify number of passes; 1 for a slow gpu, more for a fast gpu. Redirecting the output to a file allows review at any time later. For example, for an 8GT GTX1080, 2 passes, something like
cmd /k cudalucas -memtest 324 2 >>gtx1080-ramtest.txt
If you specify too many blocks, it will complain and reduce the number and make an attempt. Sometimes that fails; if so, reduce the amount some more and retry until it will run to completion. Be very strict; CUDAPm1 requires a lot of memory and extreme reliability and yet has less checking intrinsically. If the memory test detects ANY memory errors, it's probably not worth running CUDAPm1 or CUDALucas on that gpu.

Run via cmd /k commandline so any error message will stick around long enough to read. (/k = keep the command prompt window after program termination) The error could be any of a number of things, like a mis-typed executable name, wrong dll for the exe, missing dll, permissions problem, device number mismatch, program bug that crashes CUDAPm1, gpu with a driver problem or that has gone to sleep from a thermal limit or turned off because of inadequate total system power, etc.

Create a user directory. Unzip the CUDAPm1 software in it.
Get the appropriate CUDA level cufft and cudart library files for your gpu and OS from and place them in the same directory.

(If all else fails, CUDA dlls or .so files can be obtained by downloading and installing the applicable OS and CUDA version’s CUDA SDK, then locating the files and adding the few needed files to your path or to your working directory on other systems. Download the current version from
Older versions can be obtained from the download archive, at
These are typically gigabytes in the newer versions; 2.8GB @ CUDA v11.3.
Get the installation guide and other documentation pdfs also while you're there.

Note, not all OS version and CUDA version combinations are supported! See CUDA Toolkit compatibility vs CUDA level and good luck!)

Edit the cudapm1.ini file to suit your system and preferences.

At this point it is probably worth a short test to see if you have things installed and configured properly, and whether the gpu is reliable in CUDAPm1. Substitute for %dev% whatever the CUDA device number of the gpu is. (Numbering starts at zero.)
cudapm1 -d %dev% -b2 5000000 -f 2688k 50001781 >>cudapm1test.txt
Set other things running in the case you want to tune for, such as prime95 running on the cpu, other gpus busy with their GIMPS work, no interactive use, then do your CUDAPm1 fft and threads benchmarking on one gpu at a time in that context that's reflective of running GIMPS work undisturbed.

Download (zipped in cp.7z), unzip, and modify and then run the Windows batch file cp.bat in stages. It contains a lot of comments. A lot of the code is initially commented out; uncomment what is useful to you. Initial edits are in the first several lines to make it match your gpu model and situation. The first stage would be to run fft benchmark over the range of fft lengths you may use at some point in the future.
CUDAPm1 overwrites fft files of previous runs, hence the renames in the batch file. (If we try to benchmark too many fft lengths in a single run, the program may crash before generating an fft output file, or it may overrun the storage space in the program and output some garbage. So it is broken up into multiple runs in the batch file.)
Manually merge the fft output files, if multiple, into "<gpu model> fft.txt".

The threadbench portion needs to be edited based on which fft lengths are found to be useful and included in the fft output files.
Then run the threadbench portion. This can be very time consuming; hours or days. Threadbench appends to the threads file, so there's no need to rename and merge later.

CUDALucas has the -threadbench option, but CUDAPm1 has no -threadbench option. A CUDAPm1 threadbench is performed for a single fft length by specifying -cufftbench (fftlength) (fftlength) (repetitions) (mask). Repetitions and mask are optional. Same fftlength given twice is what tells CUDAPm1 to do a threadbench instead of an fftbench. The fft and threads files produced by CUDAPm1 will differ from those produced by CUDALucas for the same gpu and should not be used for CUDALucas. Nor should CUDALucas fft or threads files be used in CUDAPm1. CUDALucas should be run to produce its own fft and threads files. Keep these apps and related files in separate directories from each other. In my experience (based on lots of deep testing and benchmarking on numerous gpu models and software versions), the benchmark result files differ when any of the following differ:
  • Software application
  • application version
  • CUDA level of the application
  • GPU model
  • GPU unit
  • variations in other system activity, especially affecting the display gpu
  • one run to the next, everything else held constant (minor)
I saw no appreciable effect of widely varying CUDA driver version, in CUDALucas on a GTX480. However, I saw reductions in gpuowl performance on AMD of up to 5% from upgrading the Windows Adrenalin driver. Which CUDA level performs best on a given gpu can vary with the fft length as well as application etc.; performance can fluctuate several percent versus CUDA level, other things held constant. It's not always the latest that's fastest.

Note there are periods where the gpu will go idle during a P-1 run. It's expected and normal. The gcd computation is done with a single cpu core, not the gpu, at the end of stage one and at the end of stage two. The memory of the gpu is still committed to the CUDAPm1 application while this is occurring. During resumption of a computation in progress from a save file (after a crash or requested stop) there is also a time where the gpu is loading, not computing.

Obtaining work assignments and submitting results are manual only at this time.
Select P-1 factoring at
Report results at

There are a number of known issues with the program. See the bug and wish list at for descriptions and also for some workarounds. If you find a new issue in the program, not in that list, please report it to Kriesel via PM for inclusion in the bug and wish list, which is occasionally updated.

Top of reference tree:
Attached Files
File Type: 7z cp.7z (4.2 KB, 236 views)

Last fiddled with by kriesel on 2021-05-12 at 17:33 Reason: added section on getting dll or .so files
kriesel is offline