View Single Post
Old 2019-01-02, 21:56   #7
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

29×173 Posts
Default

A crash program to attack one 100Mdigit candidate might be done as follows.
TF from current bit level to goal level -1bit, on one gpu.
TF from goal level-1 to final, in parallel on a second gpu.
PRP/GC with P-1 built in on a third, AMD gpu with gpuowl 5.0. (Or it might be faster to run P-1 in parallel separately on CUDAPm1 and run PRP/GC with gpuowl V3.5 to 3.8.)
If the gpus are equal speed or close, the factoring runs finish much earlier than the primality test.
(TF time, or P-1 time, are typically somewhere around ~1/40 of the primality test time for the same exponent and hardware as I recall.)
The odds of finding a factor by TF or P-1 are low but not zero, around 3% for P-1.
If gpu one or two find a factor for it, the time of the others was unnecessary.
Therefore it's better to pipeline; one gpu screens candidate exponents by TF, and other gpus and cpus follow by P-1 or primality test (LL or PRP not both) in parallel, on exponents that were previously TF screened.

Throwing multiple _cores_ at a single exponent is standard operating procedure in prime95/mprime when multiple are available. There is a requirement for a very high data rate between such cooperating cores. Package-local cache is best. Memory transfer rates are typically the limiting factor in performance. (George experiments with different code sequences that trade more instructions for fewer memory accesses.) Even the package-to-package transfer rate on dual-Xeon pc's are slow by comparison, and show up as slowdowns in benchmarks when a prime95 worker uses cores from both packages. (See https://www.mersenneforum.org/showpo...18&postcount=4 and https://www.mersenneforum.org/showpo...19&postcount=5 for several different hardware examples of benchmark results.) The available networking speeds between computers are slow compared to that transfer rate. So there's no point in attempting to code for distributing a single pseudoprime iteration or Lucas-Lehmer iteration across multiple computers, as it would produce a slowdown, not a speed-up. As I recall, a similar argument has been made against using multiple gpus on a single exponent; the PCIe communication is not fast enough, and SLI or Crossfire may also lack enough speed. That PCIe speed limit would also constrain coprocessor cards bearing cpus and RAM.

If tackling a long run such as a 100Mdigit exponent primality test,
  1. benchmarking,
  2. adjusting number of cores/worker for a good throughput & latency tradeoff,
  3. testing for memory errors, and then doing at least one successful double check first are recommended, as is
  4. using PRP/GC for superior error detection and correction by repeating some iterations as in gpuowl or prime95/mprime (not LL/Jacobi at 50% detection, or naked LL with no Jacobi test as in CUDALucas).
  5. Most importantly, first, check its current status, and reserve the assignment. Check status at https://www.mersenne.org/report_exponent/ Reserve at https://www.mersenne.org/manual_assignment/ or https://www.mersenne.org/manual_gpu_assignment/

Last fiddled with by kriesel on 2019-05-06 at 17:14
kriesel is online now   Reply With Quote