View Single Post
Old 2019-05-05, 20:23   #11
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

29×173 Posts
Default Why don't we compute multiple P-1 runs at the same time allowing multiple use of some interim values

In P-1 factoring, it is common to get multiple assignments close together.
For example, a recent reservation I made of 10 for a single gpu was:
Code:
PFactor=(aid redacted),1,2,91347967,-1,77,2
PFactor=(aid redacted),1,2,91348007,-1,77,2
PFactor=(aid redacted),1,2,91348013,-1,77,2
PFactor=(aid redacted),1,2,91348031,-1,77,2
PFactor=(aid redacted),1,2,91348091,-1,77,2
PFactor=(aid redacted),1,2,91348093,-1,77,2
PFactor=(aid redacted),1,2,91348151,-1,77,2
PFactor=(aid redacted),1,2,91348177,-1,77,2
PFactor=(aid redacted),1,2,91348207,-1,77,2
PFactor=(aid redacted),1,2,91348241,-1,77,2
Note the exponents were assigned in sorted ascending order. These exponents are as little as 2 apart, or up to 60 apart. (Spaced less than 1ppm on adjacent exponent p, less than 3ppm max-min in 10.)

There are multiple opportunities for speeding total throughput. The first is the subject of the post title.

1) For stage 1 P-1, powers of small primes < B1 are computed and multiplied together. B1 and B2 values for closely spaced exponents generally match from one exponent to the next, or are very close.
For closely spaced exponents, many of those powers will be the same for multiple exponents and can be reused, or slightly extended, not recomputed entirely.
Their products can also be reused, not recomputed, up to the point where the product is performed mod Mp and therefore becomes exponent-specific.

But these are just the partial computations, of what power to raise a small prime to. If I recall correctly, these are computed up to about a million bits size in prime95 ahead, and the rest is computed along the way. So what could be cached and reused is that precomputation, which is a small part of the whole.

Stage 1 P-1 memory requirements are small compared to the installed ram of a modern gpu. For example, stage 1 on 90M exponents occupies ~250 MB, while a GTGX1070 or above has 8GB or more of gpu ram. So on-gpu caching of the reusable megabit value is not an issue.

2) Multiple stage 1 runs on different exponents could be performed essentially in parallel. With some phasing or stagger, throughput may be raised by one run using gpu-host bandwidth such as for writing save files, while others use the gpu computing capability during that time.

3) Near or somewhat above the current LL double check wavefront, on newer gpus with 11GB or more of ram, it is practical to run two P-1 stage 2 runs in parallel without affecting NRP values. If these are phased, so that the gcd of one runs on the cpu while the other still uses the gpu, or save checkpoint to disk of one runs while the other continues to compute on the gpu, increased throughput is obtained.

4) In stage 2, at or above the first-test wavefront, gpu ram gets rather fully committed. The gpu goes idle and stays idle during stage 2 gcd being computed on a single cpu core. Some trial factoring can be slipped into that otherwise idle interval. The additional throughput per interval is greater for faster gpus, larger exponents, and slower cpu core

The underutilization of the gpu during gcd is due to performing the gcd on a cpu core. This is true of both CUDAPm1 and some versions of gpuowl that support P-1. Recently GpuOwl was modified to run the gcd in parallel with other activity in most cases. (See #6 below.)

As far as I know, no one has attempted using multiple cpu cores to speed that, or attempted programming a gcd computation on a gpu in either CUDA or OpenCL.

5) The P-1 computation can be done somewhat incrementally, going to a low value of B1 (perhaps half of gputo72 target) and attempting a gcd, then if no factor is found, extending the powers of primes up to a higher B1 (perhaps the full gputo72 target) and multiplying by additional primes up to the higher B1 and then performing another gcd, and similarly in stage 2 extending from a low value of B2 to a higher B2. If the low B1 gcd yields a factor, or the low B2 gcd yields a factor, computing time would be saved. That saving would need to be large enough and common enough to more than compensate for the cost of the additional gcds.

Neither GpuOwl nor CUDAPm1 have yet implemented B1 extension from an existing save file. Consequently a run to a higher B1 for the same exponent currently requires starting over, repeating a lot of computation.

Neither GpuOwl nor CUDAPm1 have yet implemented B2 extension from an existing savefile. Consequently a run to a higher B2 for the same exponent currently requires starting over, repeating a lot of computation.

6) Mihai Preda has implemented a different type of throughput increase in recent versions of gpuOwL P-1. The stage 1 gcd is performed on a cpu core in a separate thread, while in parallel the gpu begins the stage 2 computation. In most cases the stage 2 computation will be necessary. If the stage 1 gcd returns a factor found (about 2% of the time), the stage 2 computation start is a waste, but no more so than an idle gpu (except increased power consumption). Similarly, the stage 2 gcd is performed on a cpu core in a separate thread, while in parallel the gpu begins the computation of the next worktodo entry if any is present. Typically the gcd is faster than the gpu stage computation. However, occasionally one might follow a large exponent with a small one, or there can be a gcd yet to run after the last worktodo entry's stage 2 gpu computation. The gpuowl program normally waits for a pending gcd to complete before terminating from lack of work. There are cases where a program error causes the program to terminate before the gcd is completed. Correcting the problem and restarting from save files generally completes the gcd.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2020-06-19 at 20:16 Reason: more content; misc edits.
kriesel is online now