Experimenting with several different model gpus, some old, some new, in CUDAPm1 V0.20 (mostly the September 2013 CUDA 5.5 build for Windows), I've found none of the gpus capable of completing stage 2 for exponents in the higher 3/4 of the theoretical capability (2

^{31}-1). Plus some interesting behaviors.

At least one model can compute and save in stage 1, a save file it can not resume from.

(Quadro 4000, 800M exponent.)

Maximum successfully completed stage one and stage two exponents differ. This is not surprising in some cases, since stage 2 requires more memory. But it was surprising that some models (the GTX 1060 3GB, 4GB GTX 1050 Ti, and 8GB GTX 1070; >=32 bit addressing), showed decreasing limits in their stage one runs with increasing memory, and lower than the older 1GB Quadro 2000, 1.5GB GTX480, and 2GB Quadro 4000 (31 bit address range), whose limits trend upward with memory as expected.

Some exponents fail within these ranges on a particular gpu also. For example, several exponents around 84.2M, and one at 128M failed on a Quadro 2000, although the upper limit of its capability is above 177M.

Currently the main limiting factors seem to be inadequate memory for stage 2, failure to correctly complete the stage 1 gcd or stage 2 startup immediately after stage 1 gcd, and unknown bugs. (The gcd is done on a cpu core. A quiet termination in stage 2 due to excess round off error was mentioned by owftheevil, CUDAPm1's author, as a known issue years ago.)

Further runs, as I refine values for the respective limits, by binary search, may narrow the gap between current lower and upper bounds of feasibility versus gpu model and stage. In some cases running an exponent to obtain a single bit of refinement on the bound can take a week to a month. Most upper and lower bounds are now converged to within 1%, my usual arbitrary end point, and many are within 1M.

Some preliminary numbers are as follows. Below these approximate bound values, most exponents can be run to completion in both stage 1 and stage 2.

Code:

CUDAPm1 V0.20
GPU model GPU Memory GB Least lower bound value (including 1-month run time limit)
Quadro 2000 1 177,500,083
GTX 480 1.5 289,999,981
Quadro 4000 2 338,000,009
Quadro 5000 2.5 311,000,077
Quadro K4000 3 404,000,123
GTX 1060 3GB 3 432,500,129
GTX 1050 Ti 4 384,000,031
Tesla C2075 5.25 376,000,133
GTX 1070 8 333,000,257
GTX 1080 8 377,000,051
GTX 1080 Ti 11 377,000,081

The above numbers are for B1 and B2 bounds selected by the program, with the number of primality tests saved if a factor is found of two. Two saved is what is usually included in manual assignment records. The bounds CUDAPm1 picks for that are sometimes not high enough to match what PrimeNet wants as limits, as indicated by mersenne.ca exponent status pages. In most cases increasing number of tests saved to 3 would be enough. Running with two saved is probably more efficient overall. The maximum exponents are likely to drop significantly at higher number of tests saved. The good news is even the lowly Quadro 2000 has several years of somewhat useful life remaining, since current exponent issue by Primenet is around 92M and advancing at about 8M/year. It's not recommended to use the Quadro 2000 for P-1 though, since its bounds tend to be lower than needed.

The attachment below tabulates and graphs the stage 1 and stage 2 lower and upper exponent bounds found to date, along with notes re the limiting behavior, extrapolated run times, and comparison to certain means of estimating bounds. (Ignore the 64M fft limit claimed there; the limit is 128M, at least for CUDA levels 5.5 and up and possibly even somewhat lower)

Top of reference tree:

https://www.mersenneforum.org/showpo...22&postcount=1