View Single Post
Old 2020-09-08, 01:42   #27
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·52·97 Posts
Default Improving fallback to lower proof power etc.

This was a power 8 proof PRP run on Radeon 5700XT and Windows 10 that went awry. Aside from the reported checksum error, I see a few additional issues with this run sequence.
Code:
2020-09-07 11:48:16 asr2/5700xt 99356317 OK 98400000  99.04%; 2212 us/it; ETA 0d 00:35; ed3540b5aa993e56 (check 1.11s)
2020-09-07 11:55:39 asr2/5700xt 99356317 OK 98600000  99.24%; 2213 us/it; ETA 0d 00:28; 4569ca1b42f97bf5 (check 1.11s)
2020-09-07 12:03:03 asr2/5700xt 99356317 OK 98800000  99.44%; 2212 us/it; ETA 0d 00:21; 52d0f07278cb3ea0 (check 1.10s)
2020-09-07 12:10:27 asr2/5700xt 99356317 OK 99000000  99.64%; 2213 us/it; ETA 0d 00:13; 214b5e72adcb0097 (check 1.10s)
2020-09-07 12:17:50 asr2/5700xt 99356317 OK 99200000  99.84%; 2212 us/it; ETA 0d 00:06; 8dc4afa02db98b6e (check 1.10s)
2020-09-07 12:23:36 asr2/5700xt CC 99356317 / 99356317, af767eb4030a____
2020-09-07 12:23:38 asr2/5700xt 99356317 OK 99356800 100.00%; 2215 us/it; ETA 0d 00:00; 5a424b4dc57d3ccf (check 1.07s)
2020-09-07 12:23:39 asr2/5700xt proof: building level 1, hash dc19c1ed5074bfed
2020-09-07 12:23:39 asr2/5700xt proof: building level 2, hash e1c39c39ef8fec8c2020-09-07 12:23:40 asr2/5700xt proof: building level 3, hash 4dff4687239f51cc
2020-09-07 12:23:42 asr2/5700xt proof: building level 4, hash 7803518131602fc6
2020-09-07 12:23:45 asr2/5700xt proof: building level 5, hash e6e93fce0591589a
2020-09-07 12:23:51 asr2/5700xt proof: building level 6, hash 486bd862e2a3633f
2020-09-07 12:23:53 asr2/5700xt checksum 78d6fc30 (expected 9be1d4ca) in '.\99356317\proof\23286660'
2020-09-07 12:23:53 asr2/5700xt Exception NSt10filesystem7__cxx1116filesystem_errorE: filesystem error: checksum mismatch: No error
2020-09-07 12:23:53 asr2/5700xt Bye

>gpuowl-win
2020-09-07 17:44:04 gpuowl v6.11-364-g36f4e2a
2020-09-07 17:44:04 config: -user kriesel -cpu asr2/5700xt -d 2 -use NO_ASM -maxAlloc 7500
2020-09-07 17:44:04 device 2, unique id ''
2020-09-07 17:44:04 asr2/5700xt 99356317 FFT: 5.50M 1K:11:256 (17.23 bpw)
2020-09-07 17:44:04 asr2/5700xt Expected maximum carry32: 293D0000
2020-09-07 17:44:05 asr2/5700xt OpenCL args "-DEXP=99356317u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DPM1=0 -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0xb.52db15a632b98p-4 -DIWEIGHT_STEP_MINUS_1=-0xd.42fc054606498p-5 -DNO_ASM=1  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2020-09-07 17:44:13 asr2/5700xt OpenCL compilation in 8.11 s
2020-09-07 17:44:15 asr2/5700xt 99356317 OK 99200000 loaded: blockSize 400, 8dc4afa02db98b6e
2020-09-07 17:44:15 asr2/5700xt validating proof residues for power 8
2020-09-07 17:44:22 asr2/5700xt checksum 78d6fc30 (expected 9be1d4ca) in '.\99356317\proof\23286660'
2020-09-07 17:44:22 asr2/5700xt validating proof residues for power 9
2020-09-07 17:44:22 asr2/5700xt Can't open '.\99356317\proof\194056' (mode 'rb')
2020-09-07 17:44:22 asr2/5700xt validating proof residues for power 8
2020-09-07 17:44:27 asr2/5700xt checksum 78d6fc30 (expected 9be1d4ca) in '.\99356317\proof\23286660'
2020-09-07 17:44:27 asr2/5700xt validating proof residues for power 7
2020-09-07 17:44:30 asr2/5700xt checksum 78d6fc30 (expected 9be1d4ca) in '.\99356317\proof\23286660'
2020-09-07 17:44:30 asr2/5700xt validating proof residues for power 6
2020-09-07 17:44:30 asr2/5700xt Can't open '.\99356317\proof\1552443' (mode 'rb')
2020-09-07 17:44:30 asr2/5700xt Proof disabled because of missing checkpoints
2020-09-07 17:44:33 asr2/5700xt 99356317 OK 99200800  99.84%; 2199 us/it; ETA 0d 00:06; 924e6946b4f9fde2 (check 1.37s)
2020-09-07 17:50:17 asr2/5700xt CC 99356317 / 99356317, af767eb4030a5338
2020-09-07 17:50:18 asr2/5700xt 99356317 OK 99356400 100.00%; 2212 us/it; ETA 0d 00:00; abffad0e796314d6 (check 1.08s)
2020-09-07 17:50:18 asr2/5700xt {"status":"C", "exponent":"99356317", "worktype":"PRP-3", "res64":"af767eb4030a____", "residue-type":"1", "errors":{"gerbicz":"0"}, "fft-length":"5767168", "program":{"name":"gpuowl", "version":"v6.11-364-g36f4e2a"}, "user":"kriesel", "computer":"asr2/5700xt", "aid":"(redacted)", "timestamp":"2020-09-07 22:50:18 UTC"}
1) Gpuowl gives up, abandoning the run. It could skip to the next worktodo entry instead, putting hours or days of gpu time to productive use rather than leaving it idle until the user finds gpuowl halted.
2) There is a 1552444 iteration residue file, while in the restart it's looking for 1552443 at power 6. It seems there was a slight difference in computing how many iterations between the initial run and the restart or the original power and the fallback power.
3) It had already computed to 100% in the first run. And it recomputes from an indicated 99.84% to 100% in the restart. This is a minor production loss at 5 minutes 44 seconds.
4) The off-by-1, 1552444 vs. 1552443 prevents a power 6 proof from being generated in the restart.
5) Power 5 which would still save ~96% of a PRP DC is not attempted in the restart, or supported. (It might have the off by one, or more, issue too.) Admittedly this should be a rare case. Even power 4 would represent an occasional substantial savings over a complete DC as result of error.
6) For power 8, topk would be the next multiple of 256 above p which is 99356416 for p~99356317. Topk/256 for power 8 would be 388111. Saved residues would be at iterations that are multiples of that. Four times 388111=1552444, the first saved for power 6. The initial run goes past 99356416 to 99356800, presumably because of block size 400. But the restart computes only to 99356400, one less block for some reason. 99356400/256 = 388110.9375. Four times that is 1552443.75 which apparently got truncated to 1552443 for the power 6 restart. Or the restart proof attempts compute iteration count independently for each power ignoring the history of the exponent's run, or any need to ensure powers of 2 between iterations for different power proofs. If the restart omits the ceiling function, 99356317/256=388110.61328125; 4 times that is 1552442.453125, unlikely to produce 1552443 for power 6.
So I suspect there's no way currently to save the proof. I still have all the files generated, and have not yet reported the PRP result.

If in a future version, gpuowl computed topk for its maximum supported power (currently 9), then derived the specified power's iteration count for residues saved from multiples of that, some iteration multiples would be more reliably interchangeable among powers, improving fallback to lower powers upon an error. As is, topk/2^power for p=99356317 =
power, first residue save, proposed, nearest multiple from current power8 default;
9 194056 194056 na
8 388111 388112 388111
7 776222 776224 776222
6 1552443 1552448 1552444
5 3104885 3104896 3104888
4 6209770 6209792 6209776

Last fiddled with by kriesel on 2020-09-08 at 01:54
kriesel is online now   Reply With Quote