View Single Post
Old 2018-12-07, 16:48   #17
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×3×1,229 Posts
Default P-1 bounds determination

(following originated as https://www.mersenneforum.org/showpo...3&postcount=36)

As far as I can determine, it's not PrimeNet doing the B1, B2, d, e or NRP determination and dictating to the applications, it's most applications optimizing the bounds and other parameters, unless specified by the user, and the applications afterward telling PrimeNet in the results record what parameters were selected and used.
(However, note that if the bounds reported for P-1 completed without finding a factor are not sufficient, PrimeNet does not retire the P-1 factoring task for the exponent, but instead will reissue the task to someone else. Therefore much of the first P-1 attempt's computation will be duplicated elsewhere, which is inefficient.)

The applications, mprime, prime95, and CUDAPm1 (but not Gpuowl v5.0's PRP-1 or later Gpuowl versions' P-1), unless the user specifies otherwise, estimate P-1 factoring run time and probability of saving some number of primality tests to try to optimize the probable savings in total computing time for the exponent, based on computed probabilities over combinations of many B1 values and several B2 values, of finding a P-1 factor, for
  • a given prior TF level (number of bits trial factored to)
  • a given number of future primality tests potentially saved, typically 1 or 2,
  • available memory resource limits (system or gpu),
  • and probably the system / gpu's performance characteristics / benchmark results.
The mprime, prime95, and CUDAPm1 programs try many combinations of B1 and B2 values while seeking that optimal.

Or the user dictates the P-1 bounds in the worktodo line (or command line as applicable). For mprime or prime95, that explicit bounds specification can be done in a Pminus1 worktodo line, but not a Pfactor line. It seems like a lot of work, and a poor bet that the average user can do better than the coded optimization algorithm created by the author of prime95, mprime, gwnum and some bits of Gpuowl.

From experiments with prime95, with somewhat larger exponents, it appears that optimization calculation occurs also during prime95 Test Status output generation, which shows considerable lag for P-1 work compared to other computation types. It appears there's no caching of previous computation of the optimal P-1 bounds. In my experience prime95 status output without a stack of P-1 work assignment is essentially instantaneous, while this example attached takes 5 seconds, even immediately after a preceding one. With larger P-1 exponents or more P-1 assignments (deeper work caching or more complete dedication of a system to P-1 work than the 1/4 in my example) I think that 5 seconds will increase.

prime95.log:
Code:
Got assignment [aid redacted]: P-1 M89787821
Sending expected completion date for M89787821: Dec 05 2018
...
 [Thu Dec 06 09:17:24 2018 - ver 29.4]
Sending result to server: UID: Kriesel/emu, M89787821 completed P-1, B1=730000, B2=14782500, E=12, Wg4: 123E2311, AID: redacted

PrimeNet success code with additional info:
CPU credit is 7.3113 GHz-days.
The prime95 worktodo.txt record for a PrimeNet-given P-1 assignment contains no B1 or B2 specification.
Code:
Pfactor=[aid],1,2,89794319,-1,76,2
George's description of the optimization process is in the P-1 Factoring section of https://www.mersenne.org/various/math.php.
It's there to read in the source codes also.

CUDAPm1 example:
worktodo entry from manual assignment:
Code:
PFactor=[aid],1,2,292000031,-1,81,2
program output:
Code:
CUDAPm1 v0.20
------- DEVICE 1 -------
name                GeForce GTX 480
Compatibility       2.0
clockRate (MHz)     1401
memClockRate (MHz)  1848
totalGlobalMem      zu
totalConstMem       zu
l2CacheSize         786432
sharedMemPerBlock   zu
regsPerBlock        32768
warpSize            32
memPitch            zu
maxThreadsPerBlock  1024
maxThreadsPerMP     1536
multiProcessorCount 15
maxThreadsDim[3]    1024,1024,64
maxGridSize[3]      65535,65535,65535
textureAlignment    zu
deviceOverlap       1

CUDA reports 1426M of 1536M GPU memory free.
Index 91
Using threads: norm1 256, mult 128, norm2 32.
Using up to 1408M GPU memory.
Selected B1=1830000, B2=9607500, 2.39% chance of finding a factor
  Starting stage 1 P-1, M292000031, B1 = 1830000, B2 = 9607500, fft  length = 16384K
Aaron Haviland has rewritten part of CUDAPm1's bounds selection code in v0.22 https://www.mersenneforum.org/showpo...&postcount=646, building on his earlier 2014 fork. https://www.mersenneforum.org/showpo...&postcount=592

Gpuowl's PRP-1 implementation is a bit different approach, and requires user selection of B1. It defaults to B2=p but allows other B2 to be user specified. See https://www.mersenneforum.org/showth...=22204&page=70, posts 765-767 for Preda's description of gpuowl v5.0 P-1 handling. (See posts 694-706 for his earlier B1-only development; https://www.mersenneforum.org/showth...=22204&page=64.) Gpuowl's P-1 bounds defaults, cost and algorithm are continuing to evolve over time, with substantial performance increases in V6.11 and significant cost reduction by overlap with PRP squarings introduced in V7.0. As of V6.11, p ~104M, B1 defaults to 1M, B2 defaults to 30 * B1. As of V7.0 I think, Gpuowl does not run stage 2 if the available gpu ram is not sufficient for 15 or more buffers. On a 16GB gpu, that's above 900M exponent in my experience.

Lookup on mersenne.ca for an exponent provides separate guides for GPU or CPU use bounds. For example, https://www.mersenne.ca/exponent/104089423 shows

PrimeNet B1=450000, B2=22000000
GPU72 B2=650000, B2=24000000
When I reserve a block of ~30 to run on a GPU, I'll typically specify the GPU72 bounds for the last (largest) of the block. That way the bounds are just sufficient for all, retiring the P-1 task for all the exponents in the block, regardless of exponent in the block and whether the primality test will run on CPU or GPU. It's also about 40% faster than taking the Gpuowl default B1=1000000, B2=30000000, allowing me to do more of them in a day, with a near optimal probability weighted saving of compute time overall by finding factors. That helps reduce the number that get inadequately run by other GIMPS participants, on CPUs with default prime95 memory allocation, that does stage 1 only, no stage 2.


(Code authors are welcome to weigh in re any errors, omissions, nuances etc.)


Top of this reference thread: https://www.mersenneforum.org/showpo...89&postcount=1
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2022-12-20 at 23:30 Reason: minor edits
kriesel is online now