mersenneforum.org Developer's corner
 Register FAQ Search Today's Posts Mark Forums Read

 2021-03-02, 03:21 #13 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 1BC416 Posts LL with shift See also Ernst's post on shift in Mlucas. Think of the primality testing software's math routines as an emulation of a very wide word calculator. It dynamically adjusts to p bits width, where p is the prime exponent of Mp=2p-1. 1) All the LL tests should use the same (before shifted) seed value, S0=4, not ten or the other known seed values, or bad seed values such as assorted other powers of two, so that the composite numbers' final test residues will be comparable. That's separate from considerations of shift. See https://www.mersenneforum.org/showpo...12&postcount=6 2) A very simple example with zero shift: p=7, M7=127 s0=4 = 100.b shift = 0 so the value in our emulated wide-word calculator is 4. S1= 4^2-2 = 14 = 1110.b shift does not change from one iteration to the next. S2=14^2-2 =194 = 1 1000010 mod M7 = 67 = 1000011.b 3) Second run, same simple example except with shift 1: p=7 s0=4 startingshift = 1 so the value in our simulated fft calculator lands one bit left, looking like 8 = 1000 but representing 100.0b. At this point previousshift=startingshift. square it, compute the new shift, and apply a shifted -2: 8^2 =64 = 1000000 but representing 10000.00b. shift=previousshift * 2 = 2 subtract 2 <
2021-03-27, 21:58   #14
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

710810 Posts
Challenges of large exponents

Server Support
TF
Up to 1G exponent at mersenne.org; 1G to 10G at mersenne.ca

P-1
Up to 1G exponent at mersenne.org; 1G to 232-1 (~4.29G) at mersenne.ca

LL
Up to 1G exponent at mersenne.org

PRP
Up to 1G exponent at mersenne.org, although note, currently the server is limited by its SSE2 instruction set to PRP proof file processing up to ~595.7M automatically. Processing higher exponents' proof files requires some manual intervention by George for the time being.

Software availability
TF
Mfaktx are capable of up to 232-1 (largest prime exponent 4,294,967,291). Mfakto for AMD GPUs or Intel IGPs or AMD IGPs, Mfaktc for CUDA (NVIDIA) GPUs. Max supported bit level 92 in mfakto, 95 in maktc; both are sufficient for the 91-92 bits thought optimal for recently released NVIDIA GPUs.

P-1
Mlucas added P-1 support at V20. It now supports exponents up to ~233. Its P-1 implementation does not include error checking by Jacobi symbol.

CUDAPm1 is theoretically capable of up to 231-1 (largest prime exponent 2,147,483,647) but in practice may reach to ~100Mdigit on a reasonably modern NVIDIA GPU.

Mprime/prime95 are capable of up to varying exponents depending on CPU type, with maximum 64K fft length on AVX512 allowing ~1169M.

Various versions of Gpuowl support P-1 either standalone or as part of a PRP run (but not both in the same version). The fft lengths supported are one determinant of the maximum exponent. The maximum exponent for P-1 may be less than for PRP, depending at least upon bounds and available GPU or system memory, and a bit more requirement for carry handling reducing the available bits/word on a given fft length. No gpuowl version to date supports gigadigit exponent P-1. I have performed complete P-1 runs to mersenne.ca recommended GPU72 bounds on local Radeon VII and Colab provided P100 GPUs up to 1Gbit. Current versions of gpuowl support fft lengths consistent with P-1 to about exponent 231-1. Most P-1 implementations are without offset, Jacobi symbol check, or GEC. GpuOwl V7.x supports both some GEC and Jacobi symbol checking during P-1 computations combined with PRP.

PRP
ONLY PRP with GEC & Proof should be considered for new first primality tests. ESPECIALLY for larger exponents.

Mprime/prime95 are capable of up to varying exponents depending on CPU type, with maximum 64K fft length on AVX512 allowing up to ~1169M.

Gpuowl in relatively current and efficient versions is capable of 120M fft lengths, nominally capable of 2172.36M per its help output.
Gpuowl around V6.5-84 was capable of PRP on gigadigit exponents and a bit more, 3339.40M indicated in the help output.

Mlucas below V20 is capable of fft lengths up to 256M of up to 232-1 (largest prime exponent 4,294,967,291). V20.x support larger, including 512M fft lengths for F33 and so presumably up to ~M8,589,934,583 (which has a known factor). PRP proof capability is also being added. (V21?)

LL
LL should be considered as for primality verification or reliability research purposes only, not suitable for ordinary first tests! Even its use in LLDC is questionable.

Mprime/prime95 are capable of PRP & benchmarking up to varying exponents depending on cpu type, with maximum 64M fft length on AVX512 allowing ~1169M. However, through v30.7b8, LL is limited to ~922.6M on AVX512.

CUDALucas is nominally capable of up to 256M fft (for recent CUDA versions), which would be sufficient for ~232-1. However, its exponent variable is signed 32 bit, 231-1 (largest prime exponent 2,147,483,647), capping exponent for fft lengths 128M and above. In practice it is limited further, to ~1.43Gbits, determined empirically, as shown below.

Gpuowl in relatively current and efficient versions is capable of 120M fft lengths, though there may be issues in its upper reaches also. Nominally that is capable of 2172.36M.
Gpuowl around V6.5-84 was nominally capable of gigadigit exponents but did not include LL then.

Mlucas V19.x is capable with fft lengths up to 256M of up to 232-1 (largest prime exponent 4,294,967,291). V20.x support larger, including 512M fft lengths for F33 and so presumably up to ~M8,589,934,583 (which has a known factor). Mlucas does not yet implement the Jacobi symbol check for LL testing.

Memory requirements
There are several requirements to consider. Disk space, system ram, and sometimes GPU ram; also how much data may be transferred to and from the PrimeNet server in the case of PRP proof and certification.

TF: minor

P-1: large
Gpuowl typically requires at least 16 or 24 buffers available on the GPU for P-1 stage 2. This limits the maximum exponent according to GPU memory capacity. On 16GB GPUs, up to 1G exponents can be run in v6.11-380, and at least slightly higher. (M1,000,000,007 ran successfully to completion of both P-1 stages on a Radeon VII with -maxAlloc 13000 and 10 buffers in about a week.)
Mlucas v20.x uses multiples of 24 or 40 buffers, plus the equivalent of about 6 more for other data, so at OBD P-1 stage 2 requires ~64 GiB of system ram for 24 buffers minimum in stage 2; at F33, ~128 GiB of system ram. Stage 1 requirements are lower; ~16 GiB is adequate for stage 1 OBD.

PRP:
GPU-Z indicates usage of 2.7 GB GPU ram in Gpuowl v6.11-318 for exponent ~231.
Interim save files will be at least the binary packed size of the exponent, plus some words for housekeeping, such as file format, exponent, iteration, etc. Most applications save one or two residues in a save file. Mprime / prime95 for some cases will store up to 3. So a rough bound for prime95 save files is 3/8 p bytes each.
The number of save files saved is controlled by each program's settings.
For temporaries stored on disk for proof generation, their total size and timing depends on the application, exponent, and proof power.
Mprime / prime95 pre-allocates disk space sufficient for all the temporaries. Gpuowl adds individual temporaries files as the PRP computation progresses.
Mlucas ?

LL:
Storage requirements are similar to PRP without proof.

Run time scaling
Run time scaling for some specific hardware and software combinations can be found in the relevant software-specific reference threads.
TF: exponential in bit level, inversely proportional to exponent for a given bit level. The overall effect since recommended bit level increases with exponent is run time increases with exponent. Large exponents can require weeks of TF on a fast GPU to complete recommended levels.
P-1: to optimal bounds, about 1/30 of primality testing time.
Primality testing: scales as approximately exponent2.1.
Run times of large exponents may exceed the probable hardware useful life, or the user's remaining life expectancy. A good backup regimen and transfer to replacement hardware is a workaround, if the user has enough patience and human life expectancy remaining. Very long undertakings may also need a succession plan for personnel.

Software & Hardware combination reliability
Overall experience
Madpoo has checked error rate of primality tests on occasion. Most of the data points are at much smaller exponent than the current wavefronts. Run time was probably substantial at the time they were performed. Observed error rate was about 4% per exponent, 2% per LL test. These results date from before introduction to GIMPS testing of the LL Jacobi symbol check or PRP or GEC or proof generation. (See https://mersenneforum.org/showpost.p...1&postcount=67)
100M exponent experience
For the period October 1 2020 to March 26 2021, the observed rate of LL final residue error per test for exponents between 97M and 103M was ~0.94%. Presumably most of these were performed with mprime / prime95 with the Jacobi check, and some were with ECC system ram. (See first attachment.)
Extrapolating to approximately triple the exponent, ten times the run time, projects ~10% error rate at 100Mdigit; to ten times the exponent, 126 times the run time, projects a very high error rate for single-run gigabit exponents, ~30% chance of correct result, 70% chance of error in the result. Extrapolating further to gigadigit, 12.44 times the runtime of a gigabit test, near certainty of error, ~0.37 ppm chance of correct result.
Longer term average error rate was found to be higher, at ~2.5% for exponents 97M-103M. This is probably reflective both of runs without Jacobi symbol check and longer run times on slower hardware in previous years. (See second attachment.)
More detail on observed error rates can be found at https://www.mersenneforum.org/showpo...40&postcount=4 and its links.
Limited data on 100Mdigit
There are very few verified exponents from which to assess 100 Mdigit LL test error rate. The available data indicates about 21.% error rate. Many of these were done long ago, probably with long run times and without the benefit of any Jacobi check. (See third attachment.)
Higher
There's very little data for above the 100Mdigit region. Most over 350M were done as paired LL runs with regular interim residue comparison, or as PRP with GEC. Extrapolations to gigabit / 300M digit indicate paired runs with frequent interim res64 comparison or other, new error checks or approaches will be required for LL testing. (See fourth attachment.)

CUDALucas is Lucas-Lehmer test capable only, not PRP. It lacks Jacobi symbol error check. The GEC that is so useful in keeping PRP runs reliable is not applicable to the LL residue sequence, so not applicable to CUDALucas. It is also substantially slower than recent Gpuowl (V6.11-3xx, v7.x-y) on the same hardware and assignment. Avoid CUDALucas. Use Gpuowl whenever possible.

100Mdigit experiments
Obtaining matching interim residues of 100Mdigit exponents is usually straightforward on available software and adequate reliability hardware. Matching interim residues were produced with independent runs on the first tries, among Mlucas and Gpuowl for LL, and Gpuowl for PRP. (See https://www.mersenneforum.org/showpost.php?p=546384 for details)

~gigabit or 300Mdigit experiments
Obtaining matching interim residues of 300Mdigit exponents involved 4 attempts to get two matching. Details follow; results are summarized at the above link.

interim residues (mostly unverified, except for green)
CUDALucas v2.06 on GTX1080Ti (LL, no Jacobi check, unverified; not recommended at 1.4 years estimated duration on a GTX1080Ti)
999999937 301,029,977 decimal digits
Code:
|  Jan 09  04:48:51  | M999999937     10000  0x567ad47461d3bb5f  | 57344K  0.21875  41.1682   41.16s  | 473:21:41:27   0.00%  |
|  Jan 20  08:14:38  | M999999937    100000  0xe776f4a0dcd3491d  | 57344K  0.18750  44.3713   44.37s  | 506:19:41:57   0.01%  |
|  Jan 20  19:26:39  | M999999937   1000000  0x141a108c13a86d5a  | 57344K  0.18750  45.3946   45.39s  | 516:19:59:55   0.10%  |
|  Jan 23  07:03:54  | M999999937   5000000  0x0811f10855dab84c  | 57344K  0.20313  44.9825   44.98s  | 516:10:56:38   0.50%  |
CUDALucas v2.06 on GTX1080 (LL, no Jacobi check, unverified; not recommended at 1.85 years estimated duration on a GTX1080)
Code:
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Mar 27  13:24:34  | M999999937     10000  0xd0ba62caab74a325  | 57344K  0.20313  58.3992  224.72s  | 671:06:41:49   0.00%  |
|  Mar 27  13:34:21  | M999999937     20000  0x3b0de44a71284153  | 57344K  0.20313  58.7176  587.17s  | 675:10:19:38   0.00%  |
|  Mar 27  13:44:08  | M999999937     30000  0xee988cb77112ea03  | 57344K  0.20313  58.7149  587.14s  | 676:19:10:58   0.00%  |
|  Mar 27  13:53:55  | M999999937     40000  0x906263cb2b36ac4c  | 57344K  0.20313  58.7086  587.08s  | 677:11:05:19   0.00%  |
|  Mar 27  14:03:43  | M999999937     50000  0x454a88b55988ce1e  | 57344K  0.21094  58.7162  587.16s  | 677:20:59:32   0.00%  |
(deleted this bad run's set of interim residue files)
Any residues produced after a known-bad residue (red or purple here) will also be bad. The grayed lines are preserved since they show iteration speed variation.

Gpuowl (which has the Jacobi check for LL in some versions) on RX480, not recommended at 1.56 years or more depending on gpuowl version; GTX1080 similar; Radeon VII marginally feasible at ~6 months
Code:
2019-02-08 17:34:48 gpuowl v6.2-e2ffe65
2019-02-08 17:34:48 condorella/rx480 -user kriesel -cpu condorella/rx480 -device 0 -fft +0
2019-02-08 17:34:48 condorella/rx480 999999937 FFT 73728K: Width 256x8, Height 256x8, Middle 9; 13.25 bits/word
2019-02-08 17:34:48 condorella/rx480 using long carry kernels
2019-02-08 17:34:54 condorella/rx480 OpenCL compilation in 5333 ms, with  "-DEXP=999999937u -DWIDTH=2048u -DSMALL_HEIGHT=2048u -DMIDDLE=9u  -I.  -cl-fast-relaxed-math -cl-std=CL2.0"
2019-02-08 17:37:08 condorella/rx480 999999937 OK      800  0.00%; 71.64 ms/sq; ETA 829d 05:22; c3c8e02da339fdfa (check 31.90s)
2019-02-08 17:48:10 condorella/rx480 999999937       10000  0.00%; 72.02 ms/sq; ETA 833d 14:01; a30a0c45e9fb828c
2019-02-08 21:16:26 condorella/rx480 999999937      100000  0.01%; 74.38 ms/sq; ETA 860d 18:51; 3efc806b68d92b86
gpuowl v6.10-9-g54cba1d on RX480 continuation from 102000 to 495600
2021-03-15 23:16:33 999999937 FFT 57344K: Width 256x8, Height 256x8, Middle 7; 17.03 bits/word

gpuowl v6.11-380-g79ea0cc on RX480 continuation
Code:
2021-03-16 06:23:01 condorella/rx480 999999937 FFT: 56M 4K:14:512 (17.03 bpw)
2021-03-16 06:23:01 condorella/rx480 Expected maximum carry32: 833C0000
2021-03-16 06:51:54 condorella/rx480 999999937 OK   600000   0.06%; 49222 us/it; ETA 569d 08:39; c974a7f641226218 (check 21.93s)
2021-03-16 09:36:25 condorella/rx480 999999937 OK   800000   0.08%; 49242 us/it; ETA 569d 11:23; 321dd77853313c0c (check 22.04s)
2021-03-16 12:20:54 condorella/rx480 999999937 OK  1000000   0.10%; 49236 us/it; ETA 569d 07:00;  e1b53ccf581f928b (check 22.08s)
(deleted this bad run's set of interim residue files)

gpuowl v6.11-380-g79ea0cc independent run from start on gtx 1080
Code:
2021-03-27 14:28:07 gpuowl v6.11-380-g79ea0cc
2021-03-27 14:28:07 config: -device 1 -user kriesel -cpu asr3/gtx1080 -maxAlloc 6500 -proof 9 -cleanup -yield -use NO_ASM
2021-03-27 14:28:07 device 1, unique id ''
2021-03-27 14:28:07 asr3/gtx1080 999999937 FFT: 56M 4K:14:512 (17.03 bpw)
2021-03-27 14:28:07 asr3/gtx1080 Expected maximum carry32: 833C0000
2021-03-27 14:28:16 asr3/gtx1080 OpenCL args "-DEXP=999999937u  -DWIDTH=4096u -DSMALL_HEIGHT=512u -DMIDDLE=14u -DPM1=0 -DCARRY64=1  -DWEIGHT_STEP_MINUS_1=0xf.57fb440c6997p-4  -DIWEIGHT_STEP_MINUS_1=-0xf.aa3b4ca84faap-5 -DNO_ASM=1   -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2021-03-27 14:28:22 asr3/gtx1080

2021-03-27 14:28:22 asr3/gtx1080 OpenCL compilation in 6.10 s
2021-03-27 14:28:24 asr3/gtx1080 999999937 LL        0 loaded: 0000000000000004
2021-03-27 14:29:14 asr3/gtx1080 Stopping, please wait..
2021-03-27 14:29:16 asr3/gtx1080 999999937 LL     1000   0.00%; 52191 us/it; ETA 604d 01:29; ddadfed64e080856
(Jacobi check on exit passed; continuing:)
2021-03-27 14:55:18 asr3/gtx1080 999999937 LL    10000   0.00%; 50442 us/it; ETA 583d 19:32; 567ad47461d3bb5f
2021-03-27 15:03:46 asr3/gtx1080 999999937 LL    20000   0.00%; 50782 us/it; ETA 587d 17:50; 78a2f270a1bba92d
2021-03-27 15:12:14 asr3/gtx1080 999999937 LL    30000   0.00%; 50794 us/it; ETA 587d 21:05; d79b942904525426
2021-03-27 15:20:42 asr3/gtx1080 999999937 LL    40000   0.00%; 50823 us/it; ETA 588d 05:02; 650ca8c106ac7d12
2021-03-27 15:29:11 asr3/gtx1080 999999937 LL    50000   0.01%; 50869 us/it; ETA 588d 17:34; 98420630c9a4b877
2021-03-27 15:37:39 asr3/gtx1080 999999937 LL    60000   0.01%; 50850 us/it; ETA 588d 12:03; adb31fa7cf23b0ba
2021-03-27 15:46:08 asr3/gtx1080 999999937 LL    70000   0.01%; 50847 us/it; ETA 588d 11:13; 20b88580b0d5c8a5
2021-03-27 15:54:38 asr3/gtx1080 999999937 LL    80000   0.01%; 51002 us/it; ETA 590d 06:05; 2b4ab3e44fcbdc84
2021-03-27 16:03:07 asr3/gtx1080 999999937 LL    90000   0.01%; 50957 us/it; ETA 589d 17:20; b9c6eeca4e553904
2021-03-27 16:11:37 asr3/gtx1080 999999937 LL   100000   0.01%; 50972 us/it; ETA 589d 21:24; e776f4a0dcd3491d
(there was a long dormant period for the partial run)
2022-04-19 07:41:53 asr3/gtx1080 999999937 OK   900000 (jacobi == -1)
2022-04-19 08:24:45 asr3/gtx1080 999999937 LL  1000000   0.10%; 51451 us/it; ETA 594d 21:37; 141a108c13a86d5a
2022-04-19 09:07:37 asr3/gtx1080 999999937 LL  1050000   0.11%; 51422 us/it; ETA 594d 12:53; b1b3f4ff1e3553fb
2022-04-19 09:07:37 asr3/gtx1080 999999937 OK  1000000 (jacobi == -1)
CUDALucas V2.06beta on V100 (LL, no Jacobi check, unverified, ~6 months; interim residue extracted by ramgeis from an interim save file using a perl program provided by Kriesel) This run was preceded and followed by successful LL DC runs. However, it was not a paired run with interim residues cross-checked, so its reliability over the long run time is uncertain.
999999937 Iteration 110,000,000 Residue64 0x87ded9d3ab3e7f38

The PRP equivalent is feasible marginally with Gpuowl on a Radeon VII GPU, at ~0.5 years to complete; LL in Gpuowl on a Radeon VII is also possible but not recommended since it's only protected by an occasional Jacobi check's ~50% chance of detection of an error. Such long runs are likely to go wrong with undetected errors.
Gpuowl v6.2-e2ffe65 PRP on RX480 GPU:
Code:
2019-02-08 17:48:10 condorella/rx480 999999937       10000  0.00%; 72.02 ms/sq; ETA 833d 14:01; a30a0c45e9fb828c
2019-02-08 21:16:26 condorella/rx480 999999937      100000  0.01%; 74.38 ms/sq; ETA 860d 18:51; 3efc806b68d92b86
continuation on gpuowl v6.11-380-g79ea0cc on RX480, quite a speed improvement, 1.51X over v6.2, but still not recommended at ~1.56 years on an RX480:
Code:
2021-03-16 12:20:54 condorella/rx480 999999937 OK  1000000   0.10%; 49236 us/it; ETA 569d 07:00; e1b53ccf581f928b (check 22.08s)
Mlucas v20.1.1 on dual-xeon-e5-2690 confirmed some of the gpuowl PRP interim res64s; iteration times differ because of a pause in prime95 competing for CPU cycles.
Code:
[2021-11-13 00:04:26] M999999937 Iter# = 10000 [ 0.00% complete] clocks = 01:12:27.209 [434.7210 msec/iter] Res64: A30A0C45E9FB828C. AvgMaxErr = 0.033546355. MaxErr = 0.046875000. Residue shift count = 118809714.
[2021-11-13 08:19:25] M999999937 Iter# = 100000 [ 0.01% complete] clocks = 00:33:06.763 [198.6764 msec/iter] Res64: 3EFC806B68D92B86. AvgMaxErr = 0.030404632. MaxErr = 0.039062500. Residue shift count = 423797417.
Mprime/prime95 through v30.7b8 will not run LL iterations, on exponents above ~922.6M, even on AVX512 hardware, so there are no prime95 LL interim res64 results to compare.

1.25Gbit experiment
CUDALucas on GTX1080 wrote a save file and stopped upon request.
Code:
Using threads: square 512, splice 64.
Starting M1250000033 fft length = 73728K
SIGINT caught, writing checkpoint. Estimated time spent so far: 12:18
A resumption produced
Code:
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Mar 28  13:21:04  | M1250000033     10000  0xf40e592716cd5c7a  | 73728K  0.12500  75.3094   11.14s  | 1084:17:26:53   0.00%  |
This residue is unverified and conflicts with the one produced by gpuowl v6.11-380 on a Radeon VII gpu, and Mlucas self test.
Code:
2021-03-30 18:45:38 gpuowl v6.11-380-g79ea0cc
2021-03-30 18:45:38 config: -user kriesel -cpu asr2/radeonvii4 -d 4 -use NO_ASM -maxAlloc 15000 -cleanup -block 1000 -log 10000
2021-03-30 18:45:38 device 4, unique id ''
2021-03-30 18:45:38 asr2/radeonvii4 1250000033 FFT: 72M 4K:9:1K (16.56 bpw)
2021-03-30 18:45:38 asr2/radeonvii4 Expected maximum carry32: 6CC80000
2021-03-30 18:45:49 asr2/radeonvii4 OpenCL args "-DEXP=1250000033u -DWIDTH=4096u -DSMALL_HEIGHT=1024u -DMIDDLE=9u -DPM1=0 -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0xb.819f9530b86cp-5 -DIWEIGHT_STEP_MINUS_1=-0x8.76945b26097f8p-5 -DNO_ASM=1  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2021-03-30 18:45:54 asr2/radeonvii4 OpenCL compilation in 4.79 s
2021-03-30 18:48:54 asr2/radeonvii4 1250000033 LL    10000   0.00%; 17459 us/it; ETA 252d 13:58; 5a7f1ee464e7c654
2021-03-30 18:51:48 asr2/radeonvii4 1250000033 LL    20000   0.00%; 17428 us/it; ETA 252d 03:27; 1bdc8fcb27f794f5
2021-03-30 18:52:29 asr2/radeonvii4 1250000033 LL    22000   0.00%; 20088 us/it; ETA 290d 14:53; b2fc2c1ace615898
2021-03-30 18:52:29 asr2/radeonvii4 waiting for the Jacobi check to finish..
2021-03-30 19:11:16 asr2/radeonvii4 1250000033 OK    22000 (jacobi == -1)
Mlucas V20.1.1 self tests on 4 of 4 available radix sets on i5-1035G1 confirm the Gpuowl 10000 iter res64.
Code:
INFO: Maximum recommended exponent for FFT length (73728 Kdbl) = 1315825018; p[ = 1250000033]/pmax_rec = 0.9499743628.
Initial DWT-multipliers chain length = [long] in carry step.
M1250000033: using FFT length 73728K = 75497472 8-byte floats, initial residue shift count = 354702250
this gives an average   16.556846208042568 bits per digit
Using complex FFT radices        36        16        16        16        16        16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
10000 iterations of M1250000033 with FFT length 75497472 = 73728 K, final residue shift count = 1034108523
Res64: 5A7F1EE464E7C654. AvgMaxErr = 0.085587384. MaxErr = 0.113281250. Program: E20.1.1
This exponent (or >~1169M) is beyond the reach of mprime / prime95 through v30.7 in LL or PRP.
CUDALucas on GTX1080Ti failed to match residues in 4 of 4 attempts by batch file, and rapidly crashed.
Code:
CUDALucas v2.06beta 64-bit build, compiled May  5 2017 @ 13:02:54
...

Using threads: square 32, splice 1024.
Starting M1250000033 fft length = 73728K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Nov 10  08:34:52  | M1250000033     10000  0x5b7eb1ec98fce96f  | 73728K  0.12305  51.9534  519.53s  | 751:15:14:39   0.00%  |

(program crash & auto restart by batch file)

Using threads: square 32, splice 1024.
Starting M1250000033 fft length = 73728K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Nov 10  08:52:32  | M1250000033     10000  0x904cf0517c82f343  | 73728K  0.11816  52.2756  522.75s  | 756:07:06:22   0.00%  |
|  Nov 10  09:01:15  | M1250000033     20000  0x10231dfa9495f5c3  | 73728K  0.12500  52.3162  523.16s  | 756:14:01:09   0.00%  |

(program crash & auto restart by batch file)

Using threads: square 32, splice 1024.
Starting M1250000033 fft length = 73728K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Nov 10  09:19:48  | M1250000033     10000  0xffffffffffffffff  | 73728K  0.11914  52.1692  521.69s  | 754:18:11:06   0.00%  |
|  Nov 10  09:28:32  | M1250000033     20000  0x0000000000000000  | 73728K  0.11719  52.4310  524.31s  | 756:15:29:04   0.00%  |
Illegal residue: 0x0000000000000000. See mersenneforum.org for help.

(program crash & auto restart by batch file)

Using threads: square 32, splice 1024.
Starting M1250000033 fft length = 73728K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Nov 10  09:47:19  | M1250000033     10000  0x1ac66811a2df6b09  | 73728K  0.12500  51.9302  519.30s  | 751:07:12:15   0.00%  |
|  Nov 10  09:56:03  | M1250000033     20000  0xa82b4d1e98e888a5  | 73728K  0.12109  52.3941  523.94s  | 754:15:35:19   0.00%  |

(batch file gives up after 4 attempts)
Mlucas v20.1.1 2022-03-20 PRP:
Code:
[2022-05-09 20:09:55] M1250000033 Iter# = 100000 [ 0.01% complete] clocks = 08:41:50.153 [313.1015 msec/iter] Res64: 95C8FAB6227CB3AE. AvgMaxErr = 0.128896556. MaxErr = 0.187500000. Residue shift count = 0.
1.4xGbit experiments
These were performed after the 2Gbit, 1.5Gbit, and 1.25Gbit experiments, to determine the approximate transition point for CUDALucas. The following are in exponent order, not chronological order. Note that after the initial experimentation on GTX1080, running PRP/GEC in gpuowl on the GTX1080 GPU showed a significant error rate. So the experiments may reflect that GPU's unreliability more than CUDALucas reliability. Retesting CUDALucas is being performed on a different GPU.

1.40G
(CUDALucas on GTX1080, system ram does not have ECC)
Code:
Using threads: square 1024, splice 32.
Starting M1400000197 fft length = 81920K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Mar 27  09:32:44  | M1400000197     10000  0x3064425057775e29  | 81920K  0.17188  85.1755  851.75s  | 1380:03:35:56   0.00%  |
exit at Sat 03/27/2021  9:47:05.96
(CUDALucas v2.06 on GTX1080Ti, system ram does not have ECC)
Code:
Starting M1400000197 fft length = 81920K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Nov 09  12:23:40  | M1400000197     10000  0xb98557f593be218e  | 81920K  0.16406  59.9727  599.72s  | 971:18:34:45   0.00%  |
|  Nov 09  12:33:49  | M1400000197     20000  0xb391b97d3409a2b5  | 81920K  0.17188  60.8700  608.70s  | 979:00:53:16   0.00%  |
Mlucas V20.1.1 on i5-1035G1: produced matching res64 23BEDBCBD2F66C97 for each of 4 usable radix sets. For example:
Code:
INFO: Maximum recommended exponent for FFT length (81920 Kdbl) = 1458483632; p[ = 1400000197]/pmax_rec = 0.9599012058.
Initial DWT-multipliers chain length = [long] in carry step.
M1400000197: using FFT length 81920K = 83886080 8-byte floats, initial residue shift count = 1225430952
this gives an average   16.689302885532378 bits per digit
Using complex FFT radices       320        16        16        16        32
Using 8 threads in carry step
10000 iterations of M1400000197 with FFT length 83886080 = 81920 K, final residue shift count = 1261367293
Res64: 23BEDBCBD2F66C97. AvgMaxErr = 0.135606808. MaxErr = 0.187500000. Program: E20.1.1
A completely independent run on Gpuowl v6.11-380 & Radeon VII matches.
Code:
2021-11-09 12:49:24 gpuowl v6.11-380-g79ea0cc
2021-11-09 12:49:24 config: -device 3 -user kriesel -cpu asr2/radeonii3 -block 1000 -log 10000 -use NO_ASM -proof 9
2021-11-09 12:49:24 device 3, unique id ''
2021-11-09 12:49:24 asr2/radeonii3 1400000197 FFT: 80M 4K:10:1K (16.69 bpw)
2021-11-09 12:49:24 asr2/radeonii3 Expected maximum carry32: 7E760000
2021-11-09  12:49:37 asr2/radeonii3 OpenCL args "-DEXP=1400000197u -DWIDTH=4096u  -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DPM1=0 -DAMDGPU=1 -DCARRY64=1  -DWEIGHT_STEP_MINUS_1=0xf.6130165dfeb5p-6  -DIWEIGHT_STEP_MINUS_1=-0xc.665dab2c7ba2p-6 -DNO_ASM=1   -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2021-11-09 12:49:42 asr2/radeonii3 OpenCL compilation in 5.08 s
2021-11-09 12:53:13 asr2/radeonii3 1400000197 LL    10000   0.00%; 20439 us/it; ETA 331d 04:27; 23bedbcbd2f66c97
2021-11-09 12:56:38 asr2/radeonii3 1400000197 LL    20000   0.00%; 20449 us/it; ETA 331d 08:18; 75f7dcd66b5c154a
2021-11-09 12:56:38 asr2/radeonii3 waiting for the Jacobi check to finish..
The Jacobi check is uncharacteristically long here for three reasons:
• A very large exponent.
• The CPU is a Celeron G1840.
• The CPU cooler fan is seized (rpm=0); replacement ordered.

1.43G is about the highest that would run in CUDALucas on the GTX1080 long enough to produce any printed interim residues; 1.45Gbit and higher failed early.
Code:
Starting M1430000027 fft length = 81920K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Mar 27  11:28:40  | M1430000027     10000  0xa58244f8b0e73cde  | 81920K  0.26563  85.7222  857.22s  | 1418:18:31:51   0.00%  |
|  Mar 27  11:42:59  | M1430000027     20000  0x210de9223d1c0ca0  | 81920K  0.29688  85.9697  859.69s  | 1420:19:27:07   0.00%  |
|  Mar 27  11:57:19  | M1430000027     30000  0xf17cde78a6256ad9  | 81920K  0.28125  85.9585  859.58s  | 1421:10:07:19   0.00%  |
|  Mar 27  12:11:38  | M1430000027     40000  0x6863e2b56e6e5d89  | 81920K  0.28125  85.9554  859.55s  | 1421:17:01:24   0.00%  |
exit at Sat 03/27/2021 12:26:00.77
Compare the above 10000 iter residue to gpuowl v6.11-380's Jacobi checked output 19b5110e4fd08ef6. Note the ~20msec/iter and ~11 month ETA on a Radeon VII GPU set to operate at no more than 80% of nominal power rating, and with GPU ram clocked ~1180 MHz.
Code:
2021-11-09 13:30:05 gpuowl v6.11-380-g79ea0cc
2021-11-09 13:30:05 config: -device 3 -user kriesel -cpu asr2/radeonii3 -block 1000 -log 10000 -use NO_ASM -proof 9
2021-11-09 13:30:05 device 3, unique id ''
2021-11-09 13:30:05 asr2/radeonii3 worktodo.txt line ignored: ";DoubleCheck=1400000197"
2021-11-09 13:30:05 asr2/radeonii3 1430000027 FFT: 80M 4K:10:1K (17.05 bpw)
2021-11-09 13:30:05 asr2/radeonii3 Expected maximum carry32: A20A0000
2021-11-09 13:30:17 asr2/radeonii3 OpenCL args "-DEXP=1430000027u -DWIDTH=4096u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DPM1=0 -DAMDGPU=1 -DCARRY64=1 -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0xe.f9d0543ab1a5p-4 -DIWEIGHT_STEP_MINUS_1=-0xf.7892905b96ebp-5 -DNO_ASM=1  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2021-11-09 13:30:22 asr2/radeonii3 OpenCL compilation in 5.13 s
2021-11-09 13:33:50 asr2/radeonii3 1430000027 LL    10000   0.00%; 20141 us/it; ETA 333d 08:23; 19b5110e4fd08ef6
Or Mlucas V20.1.1 on i5-1035G1: res64 19B5110E4FD08EF6 for each of 4 usable radix sets, with the best ~849. ms/iter, corresponding to ~ 38.6 years run time to completion on this 15W CPU package laptop. For example:
Code:
INFO: Maximum recommended exponent for FFT length (81920 Kdbl) = 1458483632; p[ = 1430000027]/pmax_rec = 0.9804703979.
Initial DWT-multipliers chain length = [short] in carry step.
M1430000027: using FFT length 81920K = 83886080 8-byte floats, initial residue shift count = 492342517
this gives an average   17.046928727626799 bits per digit
Using complex FFT radices       320        16        16        16        32
Using 8 threads in carry step
10000 iterations of M1430000027 with FFT length 83886080 = 81920 K, final residue shift count = 891887602
Res64: 19B5110E4FD08EF6. AvgMaxErr = 0.217721720. MaxErr = 0.281250000. Program: E20.1.1
A CUDALucas v2.06 attempt on GTX1080Ti ran, but badly, from early on. 0xffffffffffffffff is generally a known-bad LL res64 value. Also repeating residues are highly unlikely to be correct.
Code:
Using threads: square 1024, splice 32.
Starting M1430000027 fft length = 81920K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Nov 09  12:56:20  | M1430000027     10000  0xffffffffffffffff  | 81920K  0.28125  60.7330  607.33s  | 1005:04:21:07   0.00%  |
|  Nov 09  13:06:27  | M1430000027     20000  0xffffffffffffffff  | 81920K  0.28125  60.7226  607.22s  | 1005:02:06:31   0.00%  |
|  Nov 09  13:16:35  | M1430000027     30000  0x24fc53761b0533fc  | 81920K  0.28125  60.7770  607.77s  | 1005:08:27:36   0.00%  |
A CUDALucas retry from start on the GTX1080Ti produced
Code:
Starting M1430000027 fft length = 81920K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Nov 09  13:33:15  | M1430000027     10000  0x050701ca0c98d890  | 81920K  0.28125  60.5683  605.68s  | 1002:10:56:38   0.00%  |
ETA ~1002.5 days ~2.75 years.

An independent try on the GTX1080Ti in gpuowl produced a matching res64
Code:
2021-11-09 14:25:34 gpuowl v6.11-380-g79ea0cc
2021-11-09 14:25:34 config: -device 0 -user kriesel -cpu test/GTX1080Ti -maxAlloc 7500 -proof 9 -use NO_ASM -log 10000 -yield
2021-11-09 14:25:34 device 0, unique id ''
2021-11-09 14:25:34 test/GTX1080Ti 1430000027 FFT: 80M 4K:10:1K (17.05 bpw)
2021-11-09 14:25:34 test/GTX1080Ti Expected maximum carry32: A20A0000
2021-11-09 14:25:46 test/GTX1080Ti OpenCL args "-DEXP=1430000027u -DWIDTH=4096u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DPM1=0 -DCARRY64=1 -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0xe.f9d0543ab1a5p-4 -DIWEIGHT_STEP_MINUS_1=-0xf.7892905b96ebp-5 -DNO_ASM=1  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2021-11-09 14:25:46 test/GTX1080Ti

2021-11-09 14:25:46 test/GTX1080Ti OpenCL compilation in 0.02 s
2021-11-09 14:25:49 test/GTX1080Ti 1430000027 LL        0 loaded: 0000000000000004
2021-11-09 14:35:02 test/GTX1080Ti 1430000027 LL    10000   0.00%; 55245 us/it; ETA 914d 08:19; 19b5110e4fd08ef6
Note the 914. day = 2.5 year ETA.

At 1.44G CUDALucas struggled and crashed on the GTX1080.
Code:
Using threads: square 32, splice 32.
Starting M1440000083 fft length = 82944K
Round off error at iteration = 100, err = 0.5 > 0.35, fft = 82944K.
Restarting from last checkpoint to see if the error is repeatable.

Using threads: square 32, splice 32.
Starting M1440000083 fft length = 82944K
Round off error at iteration = 100, err = 0.5 > 0.35, fft = 82944K.
The error persists.
Trying a larger fft until the next checkpoint.

Using threads: square 32, splice 128.
Starting M1440000083 fft length = 84672K
CUDALucas ran on the GTX1080Ti but oscillated among fft lengths, costing ~5% of run time.
Code:
Using threads: square 32, splice 128.
Starting M1440000083 fft length = 82944K
Round off error at iteration = 100, err = 0.5 > 0.35, fft = 82944K.
Restarting from last checkpoint to see if the error is repeatable.

Using threads: square 32, splice 128.
Starting M1440000083 fft length = 82944K
Round off error at iteration = 100, err = 0.5 > 0.35, fft = 82944K.
The error persists.
Trying a larger fft until the next checkpoint.

Using threads: square 32, splice 1024.
Starting M1440000083 fft length = 84672K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Nov 09  22:14:53  | M1440000083     10000  0x58961e35a62cf0a1  | 84672K  0.14063  66.4872  664.87s  | 1108:02:43:39   0.00%  |
Resettng fft.
Gpuowl on the same GTX1080Ti,
Code:
2021-11-11 04:29:06 gpuowl v6.11-380-g79ea0cc
2021-11-11 04:29:06 config: -device 0 -user kriesel -cpu test/GTX1080Ti -maxAlloc 7500 -proof 9 -use NO_ASM -log 10000 -yield
2021-11-11 04:29:06 device 0, unique id ''
2021-11-11 04:29:06 test/GTX1080Ti 1440000083 FFT: 80M 4K:10:1K (17.17 bpw)
2021-11-11 04:29:06 test/GTX1080Ti Expected maximum carry32: AFFF0000
2021-11-11 04:29:16 test/GTX1080Ti OpenCL args "-DEXP=1440000083u -DWIDTH=4096u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DPM1=0 -DCARRY64=1 -DMM_CHAIN=1u -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0xc.84e9e985fb5f8p-4 -DIWEIGHT_STEP_MINUS_1=-0xe.0c13e5b57c4bp-5 -DNO_ASM=1  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2021-11-11 04:29:22 test/GTX1080Ti

2021-11-11 04:29:22 test/GTX1080Ti OpenCL compilation in 6.22 s
2021-11-11 04:29:24 test/GTX1080Ti 1440000083 LL        0 loaded: 0000000000000004
2021-11-11 04:38:43 test/GTX1080Ti 1440000083 LL    10000   0.00%; 55823 us/it; ETA 930d 08:55; 757f1958eea39b48
2021-11-11 04:48:02 test/GTX1080Ti 1440000083 LL    20000   0.00%; 55897 us/it; ETA 931d 14:24; ca704e1ad71c4d7f
...
2021-11-11 05:15:59 test/GTX1080Ti 1440000083 LL    50000   0.00%; 55910 us/it; ETA 931d 19:07; cbe95431712b43dc
2021-11-11 05:20:37 test/GTX1080Ti Stopping, please wait..
2021-11-11 05:20:39 test/GTX1080Ti 1440000083 LL    55000   0.00%; 56128 us/it; ETA 935d 10:25; dcca548706633818
2021-11-11 05:20:39 test/GTX1080Ti waiting for the Jacobi check to finish..
2021-11-11 05:37:11 test/GTX1080Ti 1440000083 OK    55000 (jacobi == -1)
Mlucas v20.1.1 2022-03-20:
Code:
[2022-05-01 20:27:50] M1440000083 Iter# = 10000 [ 0.00% complete] clocks = 00:59:38.481 [357.8482 msec/iter] Res64: 757F1958EEA39B48. AvgMaxErr = 0.216583227. MaxErr = 0.281250000. Residue shift count = 744135935.
1.45G just crashed CUDALucas on the GTX1080.
Code:
Using threads: square 32, splice 32.
Starting M1450000043 fft length = 82944K
The 10M steps in test exponent sometimes correspond to different expected fft lengths.
CUDALucas GTX1080Ti does not produce repeatable results, and likely all the 10k iteration res64 values are wrong.
Code:
CUDALucas v2.06beta 64-bit build, compiled May  5 2017 @ 13:02:54

Using threads: square 32, splice 128.
Starting M1450000043 fft length = 82944K

(program crash, batch file auto restarts task)

Using threads: square 32, splice 128.
Starting M1450000043 fft length = 82944K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Nov 10  17:59:54  | M1450000043     10000  0x23b2ba4d44537db3  | 82944K  0.28125  61.7818  617.81s  | 1036:20:11:26   0.00%  |
|  Nov 10  18:10:14  | M1450000043     20000  0xc7a4f2c40f16bbdd  | 82944K  0.29688  61.9543  619.54s  | 1038:06:45:33   0.00%  |

(program crash, batch file auto restarts task)

Using threads: square 32, splice 128.
Starting M1450000043 fft length = 82944K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Nov 10  18:32:00  | M1450000043     10000  0x61aa46b9c8b2d0a0  | 82944K  0.28906  61.8699  618.69s  | 1038:07:40:38   0.00%  |
|  Nov 10  18:42:23  | M1450000043     20000  0x0000000000000000  | 82944K  0.31250  62.2602  622.60s  | 1041:14:05:40   0.00%  |
Illegal residue: 0x0000000000000000. See mersenneforum.org for help.

(program exit, batch file auto restarts task)

Using threads: square 32, splice 128.
Starting M1450000043 fft length = 82944K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Nov 10  19:02:41  | M1450000043     10000  0x4fa6a82c75016f4f  | 82944K  0.29688  61.3545  613.54s  | 1029:16:04:31   0.00%  |

(program crash, batch file exits)
Gpuowl v6.11-380 LL:
Code:
2022-05-01 21:51:50 test/radeonvii 1450000043 LL    10000   0.00%; 19064 us/it; ETA 319d 22:26; 654ee6a96681050a
2022-05-01 22:20:23 test/radeonvii 1450000043 LL   100000   0.01%; 18998 us/it; ETA 318d 19:32; 96656182cba4bb39
Mlucas V20.1.1 2022-03-20 LL:
Code:
[2022-05-02 09:19:43] M1450000043 Iter# = 10000 [ 0.00% complete] clocks = 00:48:18.435 [289.8435 msec/iter] Res64: 654EE6A96681050A. AvgMaxErr = 0.254127446. MaxErr = 0.328125000. Residue shift count = 1420926634.
Gpuowl V6.11-380 on Radeon VII GPU (Windows 10) PRP:
Code:
2022-05-04 13:55:46 test/radeonvii 1450000043 OK    10000   0.00%; 18668 us/it; ETA 313d 07:07; c51ba6c28e5adfb9 (check 41.96s)
2022-05-04 13:59:34 test/radeonvii 1450000043 OK    20000   0.00%; 18631 us/it; ETA 312d 16:05; 5d83daadd9f021f4 (check 41.80s)
2022-05-04 14:10:58 test/radeonvii 1450000043 OK    50000   0.00%; 18625 us/it; ETA 312d 13:23; d9dd1f8b2addb16d (check 41.72s)
2022-05-04 14:30:00 test/radeonvii 1450000043 OK   100000   0.01%; 18657 us/it; ETA 313d 02:10; a1d5ce3bb8126dd7 (check 41.97s)
Mlucas PRP
Code:
[2022-05-04 21:30:49] M1450000043 Iter# = 10000 [ 0.00% complete] clocks = 00:59:44.029 [358.4030 msec/iter] Res64: C51BA6C28E5ADFB9. AvgMaxErr = 0.254497920. MaxErr = 0.343750000. Residue shift count = 0.
[2022-05-04 22:30:44] M1450000043 Iter# = 20000 [ 0.00% complete] clocks = 00:59:26.868 [356.6868 msec/iter] Res64: 5D83DAADD9F021F4. AvgMaxErr = 0.255102744. MaxErr = 0.328125000. Residue shift count = 0.
1.5Gbit experiments
gpuowl v4.6 PRP on RX480 (not recommended, at estimated 7.8 years to completion; also moot since it has a known small factor) B1 bounds 0, so presumably base = 3.
Code:
2018-11-01 16:35:37 condorella-rx480       10000/1500000041 [ 0.00%], 164.51 ms/it [163.50, 174.58]; ETA 2856d 01:42;
Note also the 2856 day ~7.8 year long ETA.
CUDALucas v2.06 on a GTX1080 failed on 1,500,000,043 with an error similar to that for 2Gbit (see below).
Code:
Using threads: square 512, splice 128.
Starting M1500000043 fft length = 86400K
Round off error at iteration = 100, err = 0.5 > 0.35, fft = 86400K.
Something is wrong! Quitting.
Gpuowl V6.11-380 PRP on a Radeon VII GPU:
Code:
2022-05-04 14:56:40 test/radeonvii 1500000043 OK    10000   0.00%; 20466 us/it; ETA 355d 07:35; d77ad2df221f4174 (check 45.60s)
2022-05-04 15:00:50 test/radeonvii 1500000043 OK    20000   0.00%; 20483 us/it; ETA 355d 14:34; d72694cae693cb89 (check 45.70s)
...
2022-05-04 15:13:22 test/radeonvii 1500000043 OK    50000   0.00%; 20480 us/it; ETA 355d 12:56; cac725527de70889 (check 45.81s)
...
2022-05-04 15:34:15 test/radeonvii 1500000043 OK   100000   0.01%; 20462 us/it; ETA 355d 05:11; 0ac335566f192292 (check 45.83s)
Mlucas v20.1.1 2022-03-20 PRP:
Code:
[2022-05-09 14:47:17] M1500000043 Iter# = 10000 [ 0.00% complete] clocks = 02:26:33.365 [879.3366 msec/iter] Res64: D77AD2DF221F4174. AvgMaxErr = 0.000414304. MaxErr = 0.000549316. Residue shift count = 158099465.
2Gbit experiments
https://www.mersenne.ca/exponent/2147483563 is theoretically just within range of CUDALucas. It would need a lot of TF and P-1 before a serious attempt at LL were worthwhile. Estimated time on a GTX1080 is several years to completion, and little or no chance of correct completion without the Jacobi check or better.
In initial attempts to obtain LL timing on a GTX 1080, CUDALucas halted repeatedly before producing any timing data. A few more attempts were made after tuning the application for the GPU on the system where it's currently installed. It initially selects a properly sized fft length, decides that's too large, goes to one too small, and has excessive round-off error, terminating, even if a proper fft length is specified on the command line.
Code:
The fft length 131072K is too large for exponent 2147483563, decreasing to 116640K
Using threads: square 1024, splice 64.
Starting M2147483563 fft length = 116640K
Round off error at iteration = 100, err = 0.5 > 0.35, fft = 116640K.
Something is wrong! Quitting.
The relevant fft lengths and benchmark timings are
Code:
Device              GeForce GTX 1080
Compatibility       6.1
clockRate (MHz)     1797
memClockRate (MHz)  5005

fft    max exp  ms/iter
76832 1335757897  83.2664
81920 1422251777  83.9003
82944 1439645131  84.9825
84672 1468986017  91.3988
86400 1498314007  94.8596
93312 1615502269  96.7309
96768 1674025489  99.3981
98304 1700021251 103.3561
100352 1734668777 105.2496
102400 1769301077 110.3034
104976 1812840839 112.3060
110592 1907684153 113.6384
114688 1976791967 117.1181
115200 1985426669 124.0529
116640 2009707367 131.0044
131072 2147483647 131.6075
It would need ~9. years to complete on a GTX1080.
The issue was reproducible on 2,000,000,099 and some lower values.
Code:
Using threads: square 1024, splice 64.
Starting M2000000099 fft length = 115200K
Round off error at iteration = 100, err = 0.5 > 0.35, fft = 115200K.
Restarting from last checkpoint to see if the error is repeatable.

Using threads: square 1024, splice 64.
Starting M2000000099 fft length = 115200K
Round off error at iteration = 100, err = 0.5 > 0.35, fft = 115200K.
The error persists.
Trying a larger fft until the next checkpoint.

Using threads: square 1024, splice 64.
Starting M2000000099 fft length = 116640K
Round off error at iteration = 100, err = 0.5 > 0.35, fft = 116640K.
Something is wrong! Quitting.
CUDALucas GTX1080Ti, Gpuowl, Mlucas TBD.

Gpuowl V6.11-380 on a Radeon VII (subsequent Jacobi symbol check successful):
Code:
2022-05-02 09:12:03 test/radeonvii 2000000099 LL    10000   0.00%; 28213 us/it; ETA 653d 02:00; f7b1319c033334c2
https://www.mersenne.ca/exponent/2147483563 would need more TF and P-1 before any attempt at PRP. P-1 stage 1 appears possible on gpuowl v6.11-380, at over 6 days on a Radeon VII:
Code:
2021-03-26 13:57:40 asr2/radeonvii4 2147483563 P1 B1=11000000, B2=600000000; 15869712 bits; starting at 0
2021-03-26 14:03:21 asr2/radeonvii4 2147483563 P1    10000   0.06%; 34113 us/it; ETA 6d 06:17; d78cbf554970d8c1
P-1 stage 2 is unlikely to work on most available GPUs, since GPU ram of 16 GB is barely adequate (-maxalloc 15000) for exponents near 1G.

Gpuowl V6.11-380 (subsequent Jacobi symbol check successful):
Code:
2022-05-02 10:31:42 test/radeonvii 2147483563 LL    10000   0.00%; 30706 us/it; ETA 763d 04:33; b2a28af43a012b86
2022-05-02 11:17:38 test/radeonvii 2147483563 LL   100000   0.00%; 30615 us/it; ETA 760d 21:28; 1605bbbbc380ca76
2022-05-02 12:08:42 test/radeonvii 2147483563 LL   200000   0.01%; 30662 us/it; ETA 762d 00:57; 4b360c36788550fc
Mlucas V20.1.1 2022-03-20 LL:
Code:
[2022-05-04 20:37:34] M2147483563 Iter# = 10000 [ 0.00% complete] clocks = 01:13:43.354 [442.3354 msec/iter] Res64: B2A28AF43A012B86. AvgMaxErr = 0.234474211. MaxErr = 0.296875000. Residue shift count = 880753044.
https://www.mersenne.ca/exponent/2147483743 would need a lot of TF and P-1 first, before a serious attempt at PRP testing was made. This is far beyond the capability of CUDAPm1, and requires a >16GB GPU in gpuowl. The LL is slightly beyond the nominal range of CUDALucas, but LL and PRP are theoretically within reach of gpuowl or Mlucas and TF is feasible in Mfaktx. Projected PRP time on a Radeon VII is 2.63 years, not recommended. Gpuowl V6.11-380 PRP timings:
Code:
2021-03-26 12:43:44 asr2/radeonvii4 2147483743 OK    20000   0.00%; 38636 us/it; ETA 960d 07:01; 304a08968e896b20 (check 23.14s)
2021-03-26 13:36:46 asr2/radeonvii4 2147483743 OK   100000   0.00%; 38636 us/it; ETA 960d 05:56; 0637bf5d33611a9d (check 22.85s)
Mlucas v20.1.1 2022-03-20 PRP:
Code:
[2022-05-01 22:01:47] M2147483743 Iter# = 10000 [ 0.00% complete] clocks = 01:11:35.048 [429.5048 msec/iter] Res64: FDB3067B242478FA. AvgMaxErr = 0.234596052. MaxErr = 0.312500000. Residue shift count = 1753623453.
[2022-05-01 23:14:04] M2147483743 Iter# = 20000 [ 0.00% complete] clocks = 01:11:42.295 [430.2295 msec/iter] Res64: 304A08968E896B20. AvgMaxErr = 0.235129663. MaxErr = 0.312500000. Residue shift count = 674700330.
Mlucas LL:
Code:
[2022-05-01 21:58:16] M2147483743 Iter# = 10000 [ 0.00% complete] clocks = 01:18:25.244 [470.5244 msec/iter] Res64: 94A87DFAE884EB75. AvgMaxErr = 0.234346936. MaxErr = 0.312500000. Residue shift count = 1552261820.
[2022-05-02 09:45:49] M2147483743 Iter# = 100000 [ 0.00% complete] clocks = 01:18:24.558 [470.4559 msec/iter] Res64: 3A997CAD52805EF0. AvgMaxErr = 0.234872706. MaxErr = 0.296875000. Residue shift count = 1308916996.
There may be some naive interest in attempting these exponents, since they would potentially qualify for the largest EFF prize. Sufficient TF is feasible with mfaktx and a fast GPU or a lot of patience. Some exponents have been trial factored to ~sufficient bit depth and others are in progress.

There was until Mlucas ~v20.1 no GIMPS software suitable for P-1 factoring gigadigit Mersenne numbers preparatory to a primality test attempt. As an experiment a copy of Gpuowl v6.11-219 was modified to raise the P-1 exponent limit as a function of fft length, sufficiently to permit P-1 attempts on gigabit exponents with its highest fft length. It was tested at a much smaller exponent and fft length, near the expanded P-1 bits/word limit, and failed to find a known factor, as one might expect. (George had identified a factor of 3 as a reason for slightly lower usable bits/word in the fft lengths for P-1 than for PRP or LL.) P-1 factoring of gigadigit exponents will require a larger fft length than the 192M largest ever offered in Gpuowl. Estimated runtime for a gigadigit stage 1 P-1 run on a Radeon VII would be ~1-2 months per exponent. That is impractically long without robust error correction such as introduced in Gpuowl V7.2 P-1. Gpuowl V7.2 P-1 stage 2 requires a minimum of 24 buffers, increasing the GPU memory requirement considerably. I estimate ~40GiB GPU ram would be required.

Mlucas V20 or later and a select few versions of gpuowl could nominally run the primality tests. (CUDALucas, mprime/prime95, cllucas cannot.) Primality test duration is too long on any available consumer-grade hardware. There are severe reliability issues with very long LL tests. PRP/GEC would be required. PRP proof generation would be strongly recommended. No software existing now (2022-05-02) offers the necessary combination of large enough fft lengths, PRP, GEC, and PRP proof generation.

Gpuowl v6.5-84 PRP3 (not recommended at an estimated 6.3 years on a Radeon VII; lacks PRP proof capability)
M3,321,928,097 has 1,000,000,001 decimal digits (This exponent is moot, as it has multiple known small factors)
Code:
FFT 196608K: Width 512x8, Height 256x8, Middle 12; 16.50 bits/word
2020-05-24 17:21:43 radeonvii3 3,321,928,097       20000  0.00%; 60053 us/sq; ETA 2308d 21:54; 1a05f0ca51fb8e7a
2020-05-24 18:41:56 radeonvii3 3,321,928,097      100000  0.00%; 60122 us/sq; ETA 2311d 12:24; 5b2cb77f57840bcc
A better test would have been a well trial factored OBD candidate such as 3321928171.
A Gpuowl PRP attempt on a Radeon VII GPU yielded
Code:
2021-11-12 13:37:12 gpuowl v6.5-84-g30c0508
2021-11-12 13:37:12 Note: no config.txt file found
2021-11-12 13:37:12 radeonvii3 config: -d 3 -user kriesel -cpu radeonvii3 -use NO_ASM -log 10000
2021-11-12 13:37:12 radeonvii3 3321928171 FFT 196608K: Width 512x8, Height 256x8, Middle 12; 16.50 bits/word
2021-11-12 13:37:12 radeonvii3 using long carry kernels
2021-11-12 13:37:13 radeonvii3 OpenCL args "-DEXP=3321928171u -DWIDTH=4096u -DSMALL_HEIGHT=2048u -DMIDDLE=12u -DWEIGHT_STEP=0xb.4fea9eebaf1ep-3 -DIWEIGHT_STEP=0xb.50b3cb11d398p-4 -DWEIGHT_BIGSTEP=0xc.5672a115506d8p-3 -DIWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-4 -DNO_ASM=1 -DNO_ASM=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2021-11-12 13:37:17 radeonvii3 OpenCL compilation in 4096 ms
2021-11-12 13:42:02 radeonvii3 3321928171 OK     2000  0.00%; 54123 us/sq; ETA 2080d 21:57; 56782571892f7722 (check 65.70s)
2021-11-12 13:49:18 radeonvii3 3321928171       10000  0.00%; 54518 us/sq; ETA 2096d 03:00; 2165debfa35a4d4f
Note the ~5.7 year run time estimate. This version does not have proof generation.
Mlucas v20.1.1 2022-03-20 PRP:
Code:
[2022-05-04 17:01:24] M3321928171 Iter# = 10000 [ 0.00% complete] clocks = 02:08:58.093 [773.8093 msec/iter] Res64: 2165DEBFA35A4D4F. AvgMaxErr = 0.141962889. MaxErr = 0.187500000. Residue shift count = 1082059918.
[2022-05-04 19:04:51] M3321928171 Iter# = 20000 [ 0.00% complete] clocks = 02:02:29.438 [734.9438 msec/iter] Res64: 9471341ED3F3D370. AvgMaxErr = 0.142553408. MaxErr = 0.187500000. Residue shift count = 106571623.
~2.5GDigit, ~8.9G exponent
Mlucas v20.1.1 2021-12-02 LL unverified interim residues. Both timing lines below were obtained in the same run on an i7-1165G7 laptop with 16 GiB nonECC ram, Ubuntu/WSL/Win10; first line prime95 stopped, second line prime95 also running. Note, that Mlucas version explicitly does not support full iteration count computations exceeding 1M on exponents >~232 for LL, PRP, or Pepin. Run times to completion would be enormous on nearly all available computing hardware. The "faster" 3.62 second/iteration timing below corresponds to ~1,026. years to completion.
Code:
[2022-02-07 20:36:50] M8937021911 Iter# = 1000 [ 0.10% complete] clocks = 01:00:20.155 [3620.1556 msec/iter] Res64: 97D908FE3A52408B. AvgMaxErr = 0.274295593. MaxErr = 0.375000000. Residue shift count = 0.
[2022-02-08 10:19:21] M8937021911 Iter# = 10000 [ 1.00% complete] clocks = 01:37:03.718 [5823.7187 msec/iter] Res64: 2190CD0E45AEF927. AvgMaxErr = 0.283365601. MaxErr = 0.343750000. Residue shift count = 0.
The decline in bits/word is about 0.838 bits/word per factor of ten on exponent.
Both the interim results preceding, and additional subsequent combinations not tabulated above, due to the length of this post, are summarized in the last attachment.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 recent LL primality test error rate near 100M exponent.pdf (38.2 KB, 211 views) long term average error rate of LL tests near 100M exponent.pdf (48.9 KB, 222 views) observed error rate on LL tests of 100Mdigit exponents-2021-06-30.pdf (26.4 KB, 163 views) nth LL test probabilities as function of exponent.pdf (30.6 KB, 161 views) large exponent interim residues.pdf (59.2 KB, 43 views)

Last fiddled with by kriesel on 2022-05-23 at 15:15 Reason: added mlucas & gpuowl interim results

 Similar Threads Thread Thread Starter Forum Replies Last Post tServo Software 19 2016-04-23 21:30 jasong jasong 6 2013-10-16 20:09 BotXXX Hardware 16 2012-06-21 23:54 Xyzzy Linux 5 2006-06-01 14:56

All times are UTC. The time now is 02:02.

Sat Dec 10 02:02:55 UTC 2022 up 113 days, 23:31, 0 users, load averages: 0.81, 0.74, 0.78