mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2020-04-28 16:06

[QUOTE=kruoli;544059]Okay, thank you for the information! Somehow I thought, there has been working LL in the past, but I guess, I confused it with CudaLucas etc.

A few LL ran fine without any errors and matched (e.g. [URL="https://www.mersenne.org/report_exponent/?exp_lo=57234283&full=1"]M57234283[/URL]), but others went erroneous (e.g. [URL="https://www.mersenne.org/report_exponent/?exp_lo=57234167&full=1"]M57234167[/URL], [URL="https://www.mersenne.org/report_exponent/?exp_lo=57234179&full=1"]M57234179[/URL], [URL="https://www.mersenne.org/report_exponent/?exp_lo=57233941&full=1"]M57233941[/URL], [URL="https://www.mersenne.org/report_exponent/?exp_lo=55297621&full=1"]M55233941[/URL]).

I uploaded the full logs and residue folders (I guess, that's what they are) compressed for both cards I ran it on [URL="http://mc.oliver-kruse.de/GIMPS/gpuOwl"]here[/URL].[/QUOTE]Very early gpuowl (before v0.7) implemented LL only, on AMD only. They are limited in fft length and so limited in exponent.One had Jacobi check which is 50% error detection probability. See [URL]https://www.mersenneforum.org/showpost.php?p=488539&postcount=4[/URL]

Great job sharing logs etc for diagnostic use.

kruoli 2020-04-28 16:27

[QUOTE=kriesel;544081]Great job sharing logs etc for diagnostic use.[/QUOTE]

Uhm... Thank you? If you are sarcastic: The logs are in the link of my post.
[QUOTE=kruoli;544059]...[URL="http://mc.oliver-kruse.de/GIMPS/gpuOwl"]here[/URL].[/QUOTE]

S485122 2020-04-28 16:40

[QUOTE=kruoli;544085]...
If you are sarcastic
...[/QUOTE]Not sarcastic, just paternalistic.

Jacob

kriesel 2020-04-28 17:58

[QUOTE=S485122;544086]Not sarcastic, just paternalistic.

Jacob[/QUOTE]Neither. Just sincerely appreciative of people who help the coding wizards improve the software. The more the better, in my opinion.

kriesel 2020-05-01 16:13

gpuowl-win v6.11-272-g07718b9 build
 
2 Attachment(s)
This is for the moment, the latest commit available.
Untested except for help output.

kriesel 2020-05-01 16:32

[QUOTE=kruoli;544066]No, I have not tuned at all, because I did not saw such an option in the "-h" menu. Maybe a bit foolish...[/QUOTE]For the tuning controls, look in the top of the source file gpuowl.cl, or in the "use flags list" text file I've started including in the .7z files I occasionally post, whichever is most convenient.

kriesel 2020-05-01 16:36

[QUOTE=preda;543924]Do you have another GPU of the same model that does not exhibit such errors? otherwise I'd suspect something amiss software-side (i.e. gpuowl, and the related OpenCL compilation).[/QUOTE]Quick update / recap on that;
two rx550s showed the issue, in one pcie slot of one system, while the system's ram fan was underperforming and ram was getting as hot as 100C. After fan replacement reduced ram temps by about 25C, EE errors were still occurring. Then I powered the box down again to move it to the floor, and resumed still with the second rx550 in place. In 5 days of running since, finishing one ~95M exponent PRP and part of another, zero EE have appeared. So I think the case for it being a software issue is weak. The move to the floor only lowered temps about 1C. Ram temps are currently 65-70C. (Higher than other systems of the same model with different cpu and gpu models installed; well within the 95C or higher Micron ram max operating temp spec.)

Prime95 2020-05-01 19:44

Note to all Linux users. If you are changing to the latest commit (recommended), upgrade to rocm 3.3.

kruoli 2020-05-01 20:24

[QUOTE=kriesel;544373]This is for the moment, the latest commit available.[/QUOTE]

For this build, I got:
[CODE]gpuowl-win.exe -prp 228479
2020-05-01 22:22:51 gpuowl v6.11-272-g07718b9
2020-05-01 22:22:51 Note: not found 'config.txt'
2020-05-01 22:22:51 config: -prp 228479
2020-05-01 22:22:51 device 0, unique id ''
2020-05-01 22:22:57 Intel(R) HD Graphics 630-0 228479 FFT: 128K 256:1:256 (1.74 bpw)
2020-05-01 22:22:57 Intel(R) HD Graphics 630-0 Expected maximum carry32: 00000
2020-05-01 22:22:57 Intel(R) HD Graphics 630-0 using long carry kernels
2020-05-01 22:22:57 Intel(R) HD Graphics 630-0 OpenCL args "-DEXP=228479u -DWIDTH=256u -DSMALL_HEIGHT=256u -DMIDDLE=1u -DWEIGHT_STEP=0x9.8f139e459cfc8p-3 -DIWEIGHT_STEP=0xd.640310ad3754p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DPM1=0 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-05-01 22:23:16 Intel(R) HD Graphics 630-0 OpenCL compilation in 19.11 s
2020-05-01 22:23:16 Intel(R) HD Graphics 630-0 Exception gpu_error: INVALID_BUFFER_SIZE clCreateBuffer at clwrap.cpp:285 makeBuf_
2020-05-01 22:23:16 Intel(R) HD Graphics 630-0 Bye[/CODE]

kriesel 2020-05-02 08:01

[QUOTE=kruoli;544405]For this build, I got:
[CODE]gpuowl-win.exe -prp 228479
2020-05-01 22:22:51 gpuowl v6.11-272-g07718b9
2020-05-01 22:22:51 Note: not found 'config.txt'
2020-05-01 22:22:51 config: -prp 228479
2020-05-01 22:22:51 device 0, unique id ''
2020-05-01 22:22:57 Intel(R) HD Graphics 630-0 228479 FFT: 128K 256:1:256 (1.74 bpw)
2020-05-01 22:22:57 Intel(R) HD Graphics 630-0 Expected maximum carry32: 00000
2020-05-01 22:22:57 Intel(R) HD Graphics 630-0 using long carry kernels
2020-05-01 22:22:57 Intel(R) HD Graphics 630-0 OpenCL args "-DEXP=228479u -DWIDTH=256u -DSMALL_HEIGHT=256u -DMIDDLE=1u -DWEIGHT_STEP=0x9.8f139e459cfc8p-3 -DIWEIGHT_STEP=0xd.640310ad3754p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DPM1=0 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-05-01 22:23:16 Intel(R) HD Graphics 630-0 OpenCL compilation in 19.11 s
2020-05-01 22:23:16 Intel(R) HD Graphics 630-0 Exception gpu_error: INVALID_BUFFER_SIZE clCreateBuffer at clwrap.cpp:285 makeBuf_
2020-05-01 22:23:16 Intel(R) HD Graphics 630-0 Bye[/CODE][/QUOTE]
Congratulations, you can apparently run mfakto on the hd630 because the Intel OpenCL is working. (But it does not have DP and OpenCL2.0, which gpuowl requires.)

Run gpuowl-win -h to see the program generated help, which lists the detected available opencl devices by number and model description, and fft specifications.

kruoli 2020-05-02 16:09

In that case, it would be nice if the program reports that.

[CODE]-device <N> : select a specific device:
0 : Intel(R) HD Graphics 630- not-AMD
1 : Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz- not-AMD[/CODE]

The program lists both my processor and its integrated graphics as valid devices, even though none of them work (the CPU gives a lot of OpenCL errors while kernel compilation).

I'm just trying out and thought that the program should give proper feedback when used on unsupported hardware. That could be realized by calling [FONT="Courier New"]clGetPlatformInfo[/FONT] (for reading out the OpenCL version) and [FONT="Courier New"]clGetDeviceInfo[/FONT] (with parameter [FONT="Courier New"]CL_DEVICE_DOUBLE_FP_CONFIG[/FONT]) and checking that information.

paulunderwood 2020-05-05 04:16

different fft sizes
 
I have been running two instances at 5632K. Now I have a new set of assignments at 6144K. Running two different FFT sizes has two effects.: The smaller runs faster and the larger runs much slower. It is more efficient to run one instance. Will this imbalance be redressed when I have equal 6144K assignments?

Prime95 2020-05-05 04:41

[QUOTE=paulunderwood;544616]I have been running two instances at 5632K. Now I have a new set of assignments at 6144K. Running two different FFT sizes has two effects.: The smaller runs faster and the larger runs much slower. It is more efficient to run one instance. Will this imbalance be redressed when I have equal 6144K assignments?[/QUOTE]

One would presume so. BTW, the latest commit supports exponents up to 106.6M in the 5.5M FFT.

paulunderwood 2020-05-05 06:37

[QUOTE=Prime95;544617]One would presume so. BTW, the latest commit supports exponents up to 106.6M in the 5.5M FFT.[/QUOTE]

[CODE]./gpuowl
./gpuowl: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by ./gpuowl)
[/CODE]

I don't have GLIBCXX_3.4.26 on my Debian Buster -- is there a work-around?

paulunderwood 2020-05-05 07:36

[QUOTE=paulunderwood;544620][CODE]./gpuowl
./gpuowl: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by ./gpuowl)
[/CODE]

I don't have GLIBCXX_3.4.26 on my Debian Buster -- is there a work-around?[/QUOTE]

I have two compilers. I think I installed gcc 9 manually for the source and 8 is native. Anyway for my purposes I hard wired g++-8 into the make file and all is hunky dory now.

kriesel 2020-05-05 18:50

[QUOTE=kriesel;544376]Quick update / recap on that[/QUOTE]
For what it's worth, this matching LL DC with shift 0 was performed in gpuowl v6.11-264 on the RX480 in the same system as was previously having frequent GEC errors on rx550s.
[url]https://www.mersenne.org/report_exponent/?exp_lo=55487503&full=1[/url]

ewmayer 2020-05-05 20:24

[QUOTE=paulunderwood;544616]I have been running two instances at 5632K. Now I have a new set of assignments at 6144K. Running two different FFT sizes has two effects.: The smaller runs faster and the larger runs much slower. It is more efficient to run one instance. Will this imbalance be redressed when I have equal 6144K assignments?[/QUOTE]

Paul, what expo(s) are you running that need 6144K? I'd be interested to see your per-job timings for the following 2-job setups:

1. Both @5632K;
2. One each @5632K and @6144K;
3. Both @6144K.

Presumably you already know the timing for the first 2 ... you could temporarily move one of your queued-up 6144K assignments to top of the worktodo file for the current 5632K run to get both @6144K.

If the slowdown for 2 compared to 1 and 3 really is as bad as you describe, I wonder if it's something to do with context-switching on the GPU between tasks that have different memory mappings: 2 jobs at same FFT length have different run data and e.g. DWT weights but have the same memory profile and GPU resources usage.

[b]Edit:[/b] I tried the above three 2-jobs scenarios on my own Radeon7, using expos ~107M to trigger the 6M FFT length. Here are the per-iteration timings:

1. Both @5632K: 1470 us/iter for each, total throughput 1360 iter/sec;
2. One each @5632K,@6144K: 1530,1546 us/iter resp., total throughput 1300 iter/sec;
3. Both @6144K: 1615 us/iter for each, total throughput 1238 iter/sec.

So no anomalous slowdowns for me at any of these combos, and the per-iteration timings hew very closely to what one would expect based on an n*log(n) per-autosquaring scaling.

paulunderwood 2020-05-05 21:59

[QUOTE=ewmayer;544664]Paul, what expo(s) are you running that need 6144K? I'd be interested to see your per-job timings for the following 2-job setups:

1. Both @5632K;
2. One each @5632K and @6144K;
3. Both @6144K.

Presumably you already know the timing for the first 2 ... you could temporarily move one of your queued-up 6144K assignments to top of the worktodo file for the current 5632K run to get both @6144K.

If the slowdown for 2 compared to 1 and 3 really is as bad as you describe, I wonder if it's something to do with context-switching on the GPU between tasks that have different memory mappings: 2 jobs at same FFT length have different run data and e.g. DWT weights but have the same memory profile and GPU resources usage.

[b]Edit:[/b] I tried the above three 2-jobs scenarios on my own Radeon7, using expos ~107M to trigger the 6M FFT length. Here are the per-iteration timings:

1. Both @5632K: 1470 us/iter for each, total throughput 1360 iter/sec;
2. One each @5632K,@6144K: 1530,1546 us/iter resp., total throughput 1300 iter/sec;
3. Both @6144K: 1615 us/iter for each, total throughput 1238 iter/sec.

So no anomalous slowdowns for me at any of these combos, and the per-iteration timings hew very closely to what one would expect based on an n*log(n) per-autosquaring scaling.[/QUOTE]

1. Both @5632K; ---> 1489us/it each
2. One each @5632K and @6144K ----> the latter was ~2300us/it (very slow); the former 1125us/it

At the moment (with latest commit) it is running ~1200us/it (103.9M) and ~1800us/it (104.9M). They were running at the average earlier until I restarted them.

It is my last 103.9M exponent.

paulunderwood 2020-05-07 05:49

Was ~1200us/it (103.9M) and ~1800us/it (104.9M).

Now 1440us/it each -- both at 104.9M.

xx005fs 2020-05-07 16:56

PM1 Result not understood
 
It seems that for factored PM1 results out of GPUOWL, primenet won't be able to understand it.

[CODE]{"status":"F", "exponent":"98141611", "worktype":"PM1", "B1":"750000", "B2":"15000000", "fft-length":"5767168", "factors":"["****"]", "program":{"name":"gpuowl", "version":"v6.11-258-gb92cdfd"}, "computer":"TITAN V-0", "aid":"******", "timestamp":"2020-05-06 07:29:29 UTC"}[/CODE]

S485122 2020-05-07 18:23

[QUOTE=paulunderwood;544778]Was ~1200us/it (103.9M) and ~1800us/it (104.9M).

Now 1440us/it each -- both at 104.9M.[/QUOTE]"us" ? Usually it is capitalised as "US", but it is not a unit (AFAIK.) Or do you (and preceding posters) mean µs ?

Jacob

paulunderwood 2020-05-07 18:35

[QUOTE=S485122;544825]"us" ? Usually it is capitalised as "US", but it is not a unit (AFAIK.) Or do you (and preceding posters) mean µs ?

Jacob[/QUOTE]

Yes, I meant µs. But how do I generate mu with the keyboard easily?

nvm: I found [URL="https://www.maketecheasier.com/quickly-type-special-characters-linux/"]this on how to do it Gnome[/URL] without having to remember and use unicodes. Thanks for prompting me!

PhilF 2020-05-07 21:05

[QUOTE=xx005fs;544813]It seems that for factored PM1 results out of GPUOWL, primenet won't be able to understand it.

[CODE]{"status":"F", "exponent":"98141611", "worktype":"PM1", "B1":"750000", "B2":"15000000", "fft-length":"5767168", "factors":"["****"]", "program":{"name":"gpuowl", "version":"v6.11-258-gb92cdfd"}, "computer":"TITAN V-0", "aid":"******", "timestamp":"2020-05-06 07:29:29 UTC"}[/CODE][/QUOTE]

Hmmm. I have reported PM-1 factors from gpuOwL before with no problem, by copying/pasting the result into the manual submission form.

ewmayer 2020-05-07 21:41

I figured us = microseconds was clear from the context.

@PaulU: Just noticed per-iter timings of the 2 jobs on my R7 also went askew early this a.m. ... as of midnight both were PRPing expos ~103.9M @5.5M FFT, each run ~1470 us/iter. Around 1am one job's PRP finished and that task started a PRP of an expo ~104.9M, still at 5.5M FFT, but the per-iter time of that job dropped to 1265 us/iter right from the beginning, at the same time the per-iter times of the other ongoing job with p ~ 103.9M jumped to 1664 us/iter. I killed and restarted both jobs first thing this morning by way of daily kworker-task CPU-cycle parasitism control, the timing disparity continued after both were restarted.

Looking closely at the two OpenCL args lists for the 2 jobs, FFT params same, main diffs are the expected ones in the various DTW-weights-associated consts of the 2 expos ... the only salient-appearing diff I see is that the p ~ 104.9M job sports an extra "-DMM2_CHAIN=1u" arg which the other one lacks. Whatever that means code-branch and memory-map-wise, it caused the ROCm priority management engine to apparently give a higher priority to that job. Total throughput for 2 jobs running ~1470 us/iter each was ~1360 iter/sec, with the timing disparity it is ~1390 us/iter, so I've actually gained a few % total throughput.

preda 2020-05-07 22:22

[QUOTE=xx005fs;544813]It seems that for factored PM1 results out of GPUOWL, primenet won't be able to understand it.

[CODE]{"status":"F", "exponent":"98141611", "worktype":"PM1", "B1":"750000", "B2":"15000000", "fft-length":"5767168", "factors":"["****"]", "program":{"name":"gpuowl", "version":"v6.11-258-gb92cdfd"}, "computer":"TITAN V-0", "aid":"******", "timestamp":"2020-05-06 07:29:29 UTC"}[/CODE][/QUOTE]

There was a bug, an extra set of quotes around the factors array: "["****"]". The bug has been fixed (you can upgrade), and this result can probably be submitted by manually dropping the extra quotes, to: "factors":["****"]

LaurV 2020-05-08 06:24

[QUOTE=S485122;544825]"us" ? Usually it is capitalised as "US", but it is not a unit (AFAIK.) Or do you (and preceding posters) mean µs ?
Jacob[/QUOTE]
I assume that was a nitpicking/joke. If it was not, then you should learn that "u" is the right/standard/accepted abbreviation for "micro" in all domains I ever touched, and where typing µ or µ or \(\mu\) would be tedious. Including computer science and software manufacturing (see the famous uVision from Keil, or uTorrent, etc). In my daily work I measure the electric potential in uV (microvolts), current in uA (microamperes), and thickness of bonding wires in um (micrometers, or microns).

kriesel 2020-05-08 16:12

gpuowl-win v6.11-278-ga39cc1a build
 
2 Attachment(s)
The usual shower of compile warnings, tested only as far as included help output, etc.

kriesel 2020-05-08 16:46

Third RX550 (a 2GB model) showed a transient EE issue on a different system This is on an open frame setup with no temperature issues known.[CODE]2020-05-07 23:10:36 gpuowl v6.11-272-g07718b9
2020-05-07 23:10:36 config: -user kriesel -cpu asr2/rx550 -d 1 -use NO_ASM
2020-05-07 23:10:36 device 1, unique id ''
2020-05-07 23:10:36 asr2/rx550 worktodo.txt line ignored: ""
2020-05-07 23:10:36 asr2/rx550 107000389 FFT: 6M 1K:12:256 (17.01 bpw)
2020-05-07 23:10:36 asr2/rx550 Expected maximum carry32: 25260000
2020-05-07 23:10:37 asr2/rx550 OpenCL args "-DEXP=107000389u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DWEIGHT_STEP=0xf.eb7509fc7be48p-3 -DIWEIGHT_STEP=0x
8.0a52bc152d0dp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-std=CL2.
0 "
2020-05-07 23:10:39 asr2/rx550 OpenCL compilation in 2.29 s
2020-05-07 23:10:48 asr2/rx550 107000389 OK 0 loaded: blockSize 400, 0000000000000003
2020-05-07 23:11:09 asr2/rx550 107000389 OK 800 0.00%; 17645 us/it; ETA 21d 20:27; 4f39fc137c27de54 (check 7.30s)
2020-05-08 00:13:46 asr2/rx550 107000389 EE 200000 0.19%; 18837 us/it; ETA 23d 06:49; 65fe4f6dd6c92d4e (check 8.97s)
2020-05-08 00:13:55 asr2/rx550 107000389 EE 800 loaded: blockSize 400, 79b18fd6bfda22f9 (expected 4f39fc137c27de54)
2020-05-08 00:13:55 asr2/rx550 Exiting because "error on load"
2020-05-08 00:13:55 asr2/rx550 Bye

C:\Users\ken\Documents\gpuowl-v6.11-272>gpuowl-win
2020-05-08 01:03:03 gpuowl v6.11-272-g07718b9
2020-05-08 01:03:03 config: -user kriesel -cpu asr2/rx550 -d 1 -use NO_ASM
2020-05-08 01:03:03 device 1, unique id ''
2020-05-08 01:03:03 asr2/rx550 worktodo.txt line ignored: ""
2020-05-08 01:03:03 asr2/rx550 107000389 FFT: 6M 1K:12:256 (17.01 bpw)
2020-05-08 01:03:03 asr2/rx550 Expected maximum carry32: 25260000
2020-05-08 01:03:04 asr2/rx550 OpenCL args "-DEXP=107000389u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DWEIGHT_STEP=0xf.eb7509fc7be48p-3 -DIWEIGHT_STEP=0x
8.0a52bc152d0dp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-std=CL2.
0 "
2020-05-08 01:03:11 asr2/rx550 OpenCL compilation in 7.20 s
2020-05-08 01:03:20 asr2/rx550 107000389 OK 800 loaded: blockSize 400, 4f39fc137c27de54
2020-05-08 01:03:41 asr2/rx550 107000389 OK 1600 0.00%; 17649 us/it; ETA 21d 20:34; 00cff77f1e4010a4 (check 7.32s)
2020-05-08 02:02:08 asr2/rx550 107000389 OK 200000 0.19%; 17639 us/it; ETA 21d 19:18; 65fe4f6dd6c92d4e (check 7.32s)
2020-05-08 03:01:04 asr2/rx550 107000389 OK 400000 0.37%; 17645 us/it; ETA 21d 18:29; bbdb6f6d3790a362 (check 7.32s)
2020-05-08 04:00:00 asr2/rx550 107000389 OK 600000 0.56%; 17639 us/it; ETA 21d 17:20; 902382f6237a1979 (check 7.32s)
2020-05-08 04:58:56 asr2/rx550 107000389 OK 800000 0.75%; 17645 us/it; ETA 21d 16:31; 8087f982145cff93 (check 7.32s)
2020-05-08 05:57:51 asr2/rx550 107000389 OK 1000000 0.93%; 17639 us/it; ETA 21d 15:22; 6d75bf2bfb36a594 (check 7.32s)
2020-05-08 06:56:47 asr2/rx550 107000389 OK 1200000 1.12%; 17645 us/it; ETA 21d 14:34; 48c73046fff69459 (check 7.32s)
2020-05-08 07:55:43 asr2/rx550 107000389 OK 1400000 1.31%; 17639 us/it; ETA 21d 13:25; e84d6918ae180382 (check 7.32s)
2020-05-08 08:54:39 asr2/rx550 107000389 OK 1600000 1.50%; 17645 us/it; ETA 21d 12:37; 0e04b83a1aa1b2f2 (check 7.32s)
2020-05-08 09:53:34 asr2/rx550 107000389 OK 1800000 1.68%; 17639 us/it; ETA 21d 11:28; 8d7a21d8a97a586f (check 7.32s)
2020-05-08 10:52:31 asr2/rx550 107000389 OK 2000000 1.87%; 17645 us/it; ETA 21d 10:39; c213cfc1386c1fca (check 7.32s)
-[/CODE]

ewmayer 2020-05-09 02:50

1 Attachment(s)
[QUOTE=ewmayer;544843]@PaulU: Just noticed per-iter timings of the 2 jobs on my R7 also went askew early this a.m. ... as of midnight both were PRPing expos ~103.9M @5.5M FFT, each run ~1470 us/iter. Around 1am one job's PRP finished and that task started a PRP of an expo ~104.9M, still at 5.5M FFT, but the per-iter time of that job dropped to 1265 us/iter right from the beginning, at the same time the per-iter times of the other ongoing job with p ~ 103.9M jumped to 1664 us/iter. I killed and restarted both jobs first thing this morning by way of daily kworker-task CPU-cycle parasitism control, the timing disparity continued after both were restarted.

Looking closely at the two OpenCL args lists for the 2 jobs, FFT params same, main diffs are the expected ones in the various DTW-weights-associated consts of the 2 expos ... the only salient-appearing diff I see is that the p ~ 104.9M job sports an extra "-DMM2_CHAIN=1u" arg which the other one lacks. Whatever that means code-branch and memory-map-wise, it caused the ROCm priority management engine to apparently give a higher priority to that job. Total throughput for 2 jobs running ~1470 us/iter each was ~1360 iter/sec, with the timing disparity it is ~1390 us/iter, so I've actually gained a few % total throughput.[/QUOTE]

Let's call the aforementioned 2 instances run0 and run1, using the subdir-names in which I run them. Earlier today run0 finsihed its p ~104.9M job and started one with p ~103.9M, at which point the 2 run timings again equalized at 1470 us/iter. Just now run1 finished a p ~103.9M job and started one with p ~104.9M, at which point I expected the timing-skew to resume, this time in favor of run1 ... but no, timings remain unchanged, identical. But I see this latest p ~ 104.9M job lacks the extra "-DMM2_CHAIN=1u" OpenCL arg of the earlier one ... likely because it has p just below 104.9M, the earlier job had p slightly above 104.9M.

Preda, I'm guessing -DMM2_CHAIN is an accuracy-related flag, which kicks in at the higher p-ranges of each FFT length? If so, what is the precise breakover point at 5.5M FFT?

Prime95 2020-05-09 03:48

[QUOTE=ewmayer;544932]
Preda, I'm guessing -DMM2_CHAIN is an accuracy-related flag, which kicks in at the higher p-ranges of each FFT length? If so, what is the precise breakover point at 5.5M FFT?[/QUOTE]

As you get closer and closer to the FFT limit, there are several improved-accuracy-but-slower versions. The flags are MM2_CHAIN=1,2,3 and MM_CHAIN=1,2,3. At a later date (I hope) there will also be an ULTRA_TRIG=1.

From FFTConfig.h: 5.5M FFT supports 18.489 bits-per-FFT-word which gets the slowest code.

From FFTconfig.cpp: {0.06964, 0.14050, 0.03840, 0.02710, 0.01719, 0.00497},
which says 0.00497 bpw from the max we ease up a little bit, at 0.00497+0.01719 bpw from the max we ease up a little more, and so forth.

ewmayer 2020-05-09 19:57

[QUOTE=Prime95;544933]As you get closer and closer to the FFT limit, there are several improved-accuracy-but-slower versions. The flags are MM2_CHAIN=1,2,3 and MM_CHAIN=1,2,3. At a later date (I hope) there will also be an ULTRA_TRIG=1.

From FFTConfig.h: 5.5M FFT supports 18.489 bits-per-FFT-word which gets the slowest code.

From FFTconfig.cpp: {0.06964, 0.14050, 0.03840, 0.02710, 0.01719, 0.00497},
which says 0.00497 bpw from the max we ease up a little bit, at 0.00497+0.01719 bpw from the max we ease up a little more, and so forth.[/QUOTE]

Thanks, but ITYM e.g. "within 0.00497 bpw of max we ease up a lot, within (0.00497+0.01719) we ease up a little less", etc. Because the math only works for me when I add all 6 ease-up fractions to get 0.29780, and (letting n = 5.5*2^20) observe that (18.489 - 0.29780)*n = 104911706.5..., which lies between the 2 exponents (104892731 and 104972429) just-on-either-side of the first (MM2_CHAIN=1) ease-up threshold.

And as I noted, on my system having one run at MM2_CHAIN=1 and the other with no ease-up counterintuitively gave me 2% more total throughput than with both runs using expos below the threshold, so I'd like to try forcing both of my current runs (which are below-threshold) to use MM2_CHAIN=1 to see what the resulting total throughput is. May I presume that forcing MM2_CHAIN=1 for an expo that does not need it is safe to do?

Prime95 2020-05-09 21:12

[QUOTE=ewmayer;544979]
And as I noted, on my system having one run at MM2_CHAIN=1 and the other with no ease-up counterintuitively gave me 2% more total throughput than with both runs using expos below the threshold, so I'd like to try forcing both of my current runs (which are below-threshold) to use MM2_CHAIN=1 to see what the resulting total throughput is. May I presume that forcing MM2_CHAIN=1 for an expo that does not need it is safe to do?[/QUOTE]

Yes, adding -use MM2_CHAIN=1 is perfectly safe

ewmayer 2020-05-09 21:55

[QUOTE=Prime95;544986]Yes, adding -use MM2_CHAIN=1 is perfectly safe[/QUOTE]
Cool - did this for my run of 104892731, expected timing-skew between the 2 runs resumed, total throughput again went from ~1360 iter/sec to ~1390 iter/sec

Then also switched the other run (p = 103923257) to using the flag, timings again equalize, but at 1410 us/iter, meaning total throughput ~1420 iter/sec, a gain of 4.5% [!] over both runs using default settings. That's nearly as much gain as I get from upping my sclk setting from 4 to 5, but the latter ups the wattage by a massive 60W, temps increase proportionally. Wattage currently is a mere 5-10W higher than before the switch to both runs using MM2_CHAIN=1.

Say I start making it the the default ... if a run hits an expo which needs an even-higher extra-accuracy setting, will that automatically kick in, thus overriding the user's setting of the flag?

Prime95 2020-05-09 22:35

[QUOTE=ewmayer;544990]
Say I start making it the the default ... if a run hits an expo which needs an even-higher extra-accuracy setting, will that automatically kick in, thus overriding the user's setting of the flag?[/QUOTE]

If you specify the MM2_CHAIN setting and a different MM2_CHAIN setting is auto-generated, I do not know which one will win.

ewmayer 2020-05-09 23:03

[QUOTE=Prime95;544995]If you specify the MM2_CHAIN setting and a different MM2_CHAIN setting is auto-generated, I do not know which one will win.[/QUOTE]

Guess I'll find out once the 5.5M wavefront gets closer to 106M ... if my calculations are correct the next step-up to MM2_CHAIN=2 is for p >105313332, so we're getting pretty close.

Related question for you & Mihai: Can the program determine at runtime if the MM2_CHAIN setting needs upping? Because ROEs are not necessarily monotonic with exponent (much depends on the particluar DWT weights and their rounding-to-double) I found it useful in Mlucas to allow runtime-detection of such conditions, culminating in an upping of FFT length (and reset of the same-FFT-length ease-up params) if the highest setting proved insufficient for the exponent under test. But that all relies on per-iteration ROE data sampling.

Oh, have you tried forcing MM2_CHAIN=1 in your own runs? It would be useful to see how broadly the "this runs faster" effect applies. Just Radeon VIIs? Just some subset thereof?

Prime95 2020-05-10 01:48

[QUOTE=ewmayer;544996]Guess I'll find out once the 5.5M wavefront gets closer to 106M ... if my calculations are correct the next step-up to MM2_CHAIN=2 is for p >105313332, so we're getting pretty close.[/quote]

The next step is the addition of MM_CHAIN=1

[quote]
Related question for you & Mihai: Can the program determine at runtime if the MM2_CHAIN setting needs upping? Because ROEs are not necessarily monotonic with exponent (much depends on the particluar DWT weights and their rounding-to-double) I found it useful in Mlucas to allow runtime-detection of such conditions, culminating in an upping of FFT length (and reset of the same-FFT-length ease-up params) if the highest setting proved insufficient for the exponent under test. But that all relies on per-iteration ROE data sampling.[/quote]

I've found the average ROE does increase fairly predictably.

[quote]
Oh, have you tried forcing MM2_CHAIN=1 in your own runs? It would be useful to see how broadly the "this runs faster" effect applies. Just Radeon VIIs? Just some subset thereof?[/QUOTE]

Ah, the mysteries of the rocm optimizer. Preda and I generally time MIDDLE=10 for selecting default optimization. Last time I tested (in rocm 3.1) no MM2_CHAIN was faster than MM2_CHAIN=1 for MIDDLE=10

ewmayer 2020-05-13 21:13

Cross-posting from the "R7 @ newegg for $500" thread - new-build is alive, same Ubuntu 19.10 image I used to upgrade my Haswell system to host a Radeon VII (but that system remains on ROCm 2.10 for now), ROCm 3.3 installed, latest gpuowl built, but having OpenCL issues - first hit a missing-shared-lib error on program invocation which Paul Underwood helped me look into. Here the OpenCL-install info from the system as of last night:
[code]apt-cache search libOpenCL
ocl-icd-libopencl1 - Generic OpenCL ICD Loader
libopencl-clang-dev - thin wrapper for clang -- development files
libopencl-clang9 - thin wrapper for clang
nvidia-libopencl1-331 - Transitional package for nvidia-libopencl1-340
nvidia-libopencl1-331-updates - Transitional package for nvidia-libopencl1-340
nvidia-libopencl1-340 - NVIDIA OpenCL Driver and ICD Loader library
nvidia-libopencl1-340-updates - Transitional package for nvidia-libopencl1-340
nvidia-libopencl1-384 - Transitional package for nvidia-headless-390[/code]
But none of the above was actually installed:
[code]apt list --installed | grep libopencl1

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.[/code]
Nothing further - so did 'sudo apt install ocl-icd-libopencl1', that produces this entry in the above listing:
[code]ocl-icd-libopencl1/eoan,now 2.2.11-1ubuntu1 amd64 [installed][/code]
and solves the missing-shared-lib problem, now gpuowl starts but immediately coredumps:
[code]2020-05-13 13:31:31 gpuowl v6.11-278-ga39cc1a
2020-05-13 13:31:31 Note: not found 'config.txt'
2020-05-13 13:31:31 device 0, unique id 'df7080c172fd5d6e'
2020-05-13 13:31:31 df7080c172fd5d6e 104954387 FFT: 5.50M 1K:11:256 (18.20 bpw)
2020-05-13 13:31:31 df7080c172fd5d6e Expected maximum carry32: 50D10000
Segmentation fault (core dumped)[/code]

Prime95 2020-05-13 21:16

Did you install libncurses5? rocm-dev?

Does clinfo work?

ewmayer 2020-05-13 21:53

1 Attachment(s)
[QUOTE=Prime95;545303]Did you install libncurses5? rocm-dev?[/quote]
I did the same install I used for the Haswell system, which IIRC was geared toward ROCm 3.0 (or maybe it was 3.1), which I later overrode to 2.10 to be able to run:
[i]
wget -qO - [url]http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key[/url] | sudo apt-key add -
echo 'deb [arch=amd64] [url]http://repo.radeon.com/rocm/apt/debian/[/url] xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update && sudo apt install rocm-dev
[/i]
[QUOTE]Does clinfo work?[/QUOTE]

'clinfo' gives
[code]Command 'clinfo' not found, but can be installed with:

sudo apt install clinfo[/code]
so did that, now 'clinfo' gives coredump. As noted above already had latest rocm-dev, but also grabbed the libncurses5 per your suggestion, now clinfo gives the expected dumpage (compressed txt-file attached), and we are looking good on the running-gpuowl front: Getting ~1355 us/iter for each of 2 runs @5.5M FFT, expos ~105M (i.e. I didn't need to use the force-MM2_CHAIN=1 speedup trick since that is the default for the expos queued up on this new build). That is appreciably faster than the 1410 us/iter I'm getting for my 2 jobs on the Haswell-system Radeon card, I wonder if ROCm 3.3 (new build) versus 2.10 (haswell) might be difference.

Prime95 2020-05-14 01:12

install libncurses5

kriesel 2020-05-14 09:57

Test= no-go, DoubleCheck= ok
 
[CODE]2020-05-14 04:53:20 gpuowl v6.11-278-ga39cc1a
2020-05-14 04:53:20 config: -user kriesel -cpu asr2/radeonvii3 -d 3 -use NO_ASM -maxAlloc 15000
2020-05-14 04:53:20 device 3, unique id ''
2020-05-14 04:53:20 asr2/radeonvii3 worktodo.txt line ignored: "Test=(AID),91493761,77,1"
2020-05-14 04:53:20 asr2/radeonvii3 Bye
[/CODE]

kriesel 2020-05-16 13:45

Mihai,

Please add pseudorandom shift to gpuowl. Its absence is interfering with doublecheck sampling of higher exponents. (I'm attempting to fill in double checks for LL and for PRP from the current state to where there's at least one of each for every million-exponent-range bin up to 200M, well ahead of the first-test wavefront. [URL]https://www.mersenneforum.org/showpost.php?p=501177&postcount=3[/URL] [URL]https://www.mersenneforum.org/showpost.php?p=501181&postcount=6[/URL]) As Radeon VIIs become more common in the GIMPS fleet, and further conversion from cudalucas to gpuowl occurs on NVIDIA, the issue will become more common in LL and PRP at the wavefront also. It's tedious to check shifts one by one, and I missed a few.

[CODE]2020-05-14 18:31:17 asr2/radeonvii2-w2 140000177 OK 139800000 99.86%; 2590 us/it; ETA 0d 00:09; 420066ee63e325a2 (check 1.42s)
2020-05-14 18:39:59 asr2/radeonvii2-w2 140000177 OK 140000000 100.00%; 2604 us/it; ETA 0d 00:00; d33ef20fe4d7b3c8 (check 1.54s)
{"status":"C", "exponent":"140000177", "worktype":"PRP-3", "res64":"892fa228d6b157__", "residue-type":"1", "errors":{"gerbicz":"0"}, "fft-length":"8388608", "program":{"name":"gpuowl", "version":"v6.11-278-ga39cc1a"}, "user":"kriesel", "computer":"asr2/radeonvii2-w2", "timestamp":"2020-05-14 23:40:02 UTC"}[/CODE](zero shift matches mrh.org's zero shift earlier run, so PrimeNet server rejects this doublecheck submission) [URL]https://www.mersenne.org/report_exponent/?exp_lo=140000177&exp_hi=&full=1[/URL]

[CODE]2020-05-15 10:07:43 asr2/radeonvii 152171251 OK 152150000 99.99%; 2624 us/it; ETA 0d 00:01; 09166e3101f3f7a1 (check 1.53s) 28 errors
{"status":"C", "exponent":"152171251", "worktype":"PRP-3", "res64":"d4e28827ea97dd__", "residue-type":"1", "errors":{"gerbicz":"28"}, "fft-length":"8388608", "program":{"name":"gpuowl", "version":"v6.11-278-ga39cc1a"}, "user":"kriesel", "computer":"asr2/radeonvii", "timestamp":"2020-05-15 15:08:42 UTC"}[/CODE](zero shift matches Mihai's zero shift earlier run, so PrimeNet server rejects this doublecheck submission) [URL]https://www.mersenne.org/report_exponent/?exp_lo=152171251&exp_hi=&full=1[/URL]
The good news is the PRP res64s on that one match to the extent it can be checked, despite 28 GEC errors detected and calculations redone from the previous check.
Well done, Dr. Gerbicz, Mihai, George, et al.

LaurV 2020-05-17 13:16

[QUOTE=kriesel;545523]Please add pseudorandom shift to gpuowl. Its absence is interfering with doublecheck sampling of higher exponents.[/QUOTE]
+1. As it is now, the owl is not very appealing... Beside of some doublechecking of old work, I can't do much with it, and soon I will [URL="https://www.mersenneforum.org/showthread.php?p=545606"]switch back[/URL] to my "forzes" and mfaktc, putting the "sevens" back in the store room.

kriesel 2020-05-17 13:53

[QUOTE=kriesel;545523]Mihai,

Please add pseudorandom shift to gpuowl. Its absence is interfering with doublecheck sampling of higher exponents. (I'm attempting to fill in double checks for LL and for PRP from the current state to where there's at least one of each for every million-exponent-range bin up to 200M, well ahead of the first-test wavefront. [URL]https://www.mersenneforum.org/showpost.php?p=501177&postcount=3[/URL] [URL]https://www.mersenneforum.org/showpost.php?p=501181&postcount=6[/URL])[/QUOTE]
Another one, matching Roland Clarkson's first test but rejected by the server:
[CODE]2020-05-15 00:28:04 asr2/radeonvii2 121642771 OK 121600000 99.96%; 2151 us/it; ETA 0d 00:02; f394cb39ecc84d04 (check 1.16s)

{"status":"C", "exponent":"121642771", "worktype":"PRP-3", "res64":"a3569f57e1792d__", "residue-type":"1", "errors":{"gerbicz":"0"}, "fft-length":"7340032", "program":{"name":"gpuowl", "version":"v6.11-278-ga39cc1a"}, "user":"kriesel", "computer":"asr2/radeonvii2", "timestamp":"2020-05-15 05:29:38 UTC"}

[/CODE][URL]https://www.mersenne.org/report_exponent/?exp_lo=121642771&exp_hi=&full=1[/URL]

kriesel 2020-05-20 00:04

gpuowl-win v6.11-285 build
 
2 Attachment(s)
Here it is. See the commit descriptions at github for what's new or changed. [url]https://github.com/preda/gpuowl[/url]

preda 2020-05-20 03:35

[QUOTE=kriesel;545611]Another one, matching Roland Clarkson's first test but rejected by the server:
[CODE]2020-05-15 00:28:04 asr2/radeonvii2 121642771 OK 121600000 99.96%; 2151 us/it; ETA 0d 00:02; f394cb39ecc84d04 (check 1.16s)

{"status":"C", "exponent":"121642771", "worktype":"PRP-3", "res64":"a3569f57e1792d__", "residue-type":"1", "errors":{"gerbicz":"0"}, "fft-length":"7340032", "program":{"name":"gpuowl", "version":"v6.11-278-ga39cc1a"}, "user":"kriesel", "computer":"asr2/radeonvii2", "timestamp":"2020-05-15 05:29:38 UTC"}

[/CODE][URL]https://www.mersenne.org/report_exponent/?exp_lo=121642771&exp_hi=&full=1[/URL][/QUOTE]

How do you get the assignments? I understand that the manual assignment page is smart enough to only return DC assignments that had a non-zero shift initially. If the server does not do that (but I was under the impression it did), maybe you could also verify that the initial LL had non-zero shift before starting the gpuowl DC on it.

LaurV 2020-05-20 04:46

Mihai, you miss the point. It doesn't matter how and where we got those assignments from. The tool doesn't work properly.

Assuming there is an yet-to-be-uncovered error in gpuOwl multiplication routines, starting always with the shift zero will always deal with the same data, therefore always producing the same incorrect result, and this we can't check.

This is most relevant for LL tests, or for P-1, where there is no GC. Random (or specified) shift at startup is a must to have, it ensures not only the sanity of the tests, and allows re-testing/DCing/etc of ANY former result (including results produced by gpuOwl itself), therefore adding [U]more utility[/U] to the tool, but also ensures the [U]sanity of the code itself[/U]. Here is where we (and Ken) are barking.

This is by no mean trying to undermine your work. You did an amazing effort to implement all the FFT and stuff by yourself from scratch, and to make it faster, and to share it with the community, and we really appreciate you for this. As a programmer myself, I can testify about the huge effort and knowledge needed for such a task. Now, why don't you want to make it... properly? :razz:

This should include shifts and proper file-names, and keeping the history, as cudaLucas is doing, till then, many of us would still prefer cudaLucas, albeit they don't have the time/guts/whatever to publicly speak here.

Everybody wants to use gpuOwl, because it is faster. But as it is now, its usefulness is quite limited, it can only be used to double check old runs which were NOT done with a zero shift. We are too paranoid to use it for new tests, and assuming the happiest case where more and more people start to use it, we will reach a point where gpuOwl users will have to WAIT for other people to complete P95 tests, to have something to DC for themselves. As a R7 is about 6-7 times faster than a 10-cores i7 processor at PRP, it is enough that 1/7 of the users put their cards to work, and we are in the mud. This may seem far-fetched, and long in the future, because many users don't have R7, but they also don't have 10-cores CPUs. The future may be sooner than most of us imagine. I already have a [U]list[/U] of tests which were PRP and DC in parallel runs in two cards, and I could not report the DC because of the same shift. I don't cry for credit or candies, but first of all, this is a waste of resources, and this slows the project down in long run as somebody will have to re-do in the future the work I already did and can not report.

Also, every time gpuOwl produces a mismatch, we will still need to wait for the (slow) run of P95 for the TC. And there are many other situations.

But yet, this is not the worst, the worst is that, in spite of the fact that I did TWO RUNS in parallel, I am still not confident in the fact that the result is correct. I am only confident that there was no hardware error, as the both runs produced the same final residue, so my hardware is sane. But [U][B]I cannot be sure[/B][/U] (and I mean general "I" here) that the FFT implementation is correct, because both of the instances started with the same shift, so they dealt with the same data along the test. If there is an error in the code, then both of the runs have the error. And my paranoia won't let me sleep... hehe...

Your job must be to offer an [U]alternative[/U] to P95, not a secretary to it.

kriesel 2020-05-20 09:38

For we hard core GIMPSters, the numerical discrepancy between gpuowl and cpu throughput will become much more than what laurv has stated. One modest 4-core cpu can support several Radeon VII (or Radeon Pro VII when they come out), with a suitable power supply, motherboard and chassis. Something like that is what George and Ernst are now doing, and others too. The power/performance efficiency of the Radeon VII will drive it that way.
We need nonzero shift in gpuowl, both PRP and LL. You've done it before in LL. Please bring it back.
Other error detection measures would be very welcome too. (You've done the Jacobi check before too.)

kriesel 2020-05-20 09:49

[QUOTE=preda;545913]How do you get the assignments? I understand that the manual assignment page is smart enough to only return DC assignments that had a non-zero shift initially. If the server does not do that (but I was under the impression it did), maybe you could also verify that the initial LL had non-zero shift before starting the gpuowl DC on it.[/QUOTE]I think you understand correctly. Where there is only one completed first test in a million-exponent-range bin, I will sometimes run a test without an assignment if I can not get one. See [URL]https://mersenneforum.org/showpost.php?p=545929&postcount=14[/URL] Some of that gets done when I'm too tired or not fully awake yet. I've overlooked the 0-offset coincidence problem at times. (Chalsal's "Never send a man to do a computer's job" comes to mind here.) Pseudorandom offset in gpuowl, as in essentially all other GIMPS production primality testers, would make that a nonissue and make gpuowl their equal. These have it:
mprime/prime95;
mlucas;
cudalucas
(I think cllucas did not, and was not used much.)

axn 2020-05-20 11:08

[QUOTE=kriesel;545930]We need nonzero shift in gpuowl, both PRP and LL. You've done it before in LL. Please bring it back.
Other error detection measures would be very welcome too. (You've done the Jacobi check before too.)[/QUOTE]
LL is not the future. PRP is. Non-zero shift is a relic of LL days without effective error check. It is completely unnecessary with PRP/GEC.

IMO, It is high time we made PRP the default test type and start forcing everyone to use these instead of 1st time LL test. [Yes, I know why it can't happen -- damn older clients].

Jan S 2020-05-20 11:59

@axn:

But it's still here LL-double/triple check. Morning i started one triple check via GPUowl, but when i read this thread, i stopped him.

preda 2020-05-20 12:13

[QUOTE=LaurV;545918]Mihai, you miss the point. It doesn't matter how and where we got those assignments from. The tool doesn't work properly.
[...]
Your job must be to offer an [U]alternative[/U] to P95, not a secretary to it.[/QUOTE]

Laur, I do consider all feature requests and I try not to reject dogmatically. (I also try to keep the scope, and the code-size, to a minimum.)

Talking about LL and PRP (i.e. ignoring P-1 for a moment), I think the offset is most useful for LL. Talking about LL with gpuowl, the focus is on double-checking past LL results. The majority of the past LL was done with non-zero offset with mprime. Validating an mprime, non-zero offset result with gpuowl is very strong, stronger than validating different-offsets with a single software.

So, what is the use-case that is a pain point for you, that is not covered? Are you doing first-time LL with GPUs? If so, maybe you should try to do PRP instead of LL.

The number of first-time LL tests done with gpuowl should be a tiny minority, that minority will be checked with mprime without any difficulty.

preda 2020-05-20 12:30

Jacobi is back
 
Hi, in a recent commit [url]https://github.com/preda/gpuowl/commit/7b1f67e96124cfbd0c7a8821fea13478c192bb3d[/url]
I try to bring back the Jacobi check for LL. This is how it works and what changes for LL:

1. When: by default a Jacobi check will be done every 1M iterations. This can be configured with the -jacobi <step> command line argument, giving it a number of iterations. The check is rather slow (on the order of 1minute) and takes up 1-core of CPU, so I think it shouldn't be done too often (thus the default of 1M iterations)

2. Savefiles: an LL is only ever saved after a successful Jacobi check. There is no possibility to do an LL save that did not pass Jacobi. This, combined with the above point about the frequency of Jacobi, means that the frequency of saves is reduced (by default every 1M its). The Jacobi check is also triggered on exit (Ctrl-C), thus if the user is willing to wait the 1min after Ctrl-C the savefile will be up-to-date. OTOH if there's a power-cut no luck.

3. Moving backwards: the check is done in the background on CPU, while the LL test keeps advancing. In the eventuality that the background Jacobi fails, the test should automatically resume from the most recent savepoint.

4. Logging: the log-lines for LL now contain these codes:
"LL": a simple not-checked log line of LL
"OK": an iteration that passed Jacobi
"EE": an iteration that failed Jacobi

There may be bugs, as usual.

axn 2020-05-20 12:31

[QUOTE=Jan S;545944]@axn:

But it's still here LL-double/triple check. Morning i started one triple check via GPUowl, but when i read this thread, i stopped him.[/QUOTE]

Sure. But it doesn't make sense to invest time/effort in a dead end, There are cudalucas/ cllucas/ older versions of gpuowl, etc. for that purpose. Also, looking at the points preda said, it might be possible to doublecheck with zero-shift, if the original had non-zero. I say "might" because I don't know whether server will accept or reject it; it should, but I don't know.

preda 2020-05-20 12:35

[QUOTE=Jan S;545944]@axn:

But it's still here LL-double/triple check. Morning i started one triple check via GPUowl, but when i read this thread, i stopped him.[/QUOTE]

I think LL double-check with gpuowl is fine. Because it double-checks a previous LL that was done with non-zero offset, and with a different software, thus the zero-offset gpuowl check is as strong as can be.

The only problem appears when attempting to double-check gpuowl LL with gpuowl, that is not a good idea.

preda 2020-05-20 12:43

[QUOTE=LaurV;545918]
This should include shifts and proper file-names, and keeping the history, as cudaLucas is doing[/QUOTE]

What is the problem with the file-names?

LaurV 2020-05-20 13:31

[QUOTE=preda;545946]Hi, in a recent commit ...

1. Jacobi check will be done every 1M iterations.
2. Savefiles:
3. Moving backwards:
4. Logging: [/QUOTE]
Hi Mihai, thanks for the fast answer. I totally agree with what you said. 1M is often enough, and usually for a 100M or 332M test respectively, this will run every 5M or 10M or more iterations, and 20 minutes in 3 days, or in the larger case, 33 minutes in two-three weeks is not too much, and I could happily afford this time, if it ensures the sanity of the test in reasonable limits.
Points 3 and 4 are already working like that in the version I run.

I was never neither pro nor contra GC/JC, and personally it wont hurt neither help me. (You would remark that I didn't make any comment related to GC/JC).
My "test case" is for years now, TWO cards running in parallel, and checking the residues at every checkpoint. This way, there is no time lost when the mismatch happens, and I consider that such procedure can't be further optimized, no matter what other people think. I still do this with cudaLucas, but now, it is [U]your fault[/U] because you made gpuOwl faster :razz: and I can't stop wishing to switch to it, and therefore I can't stop bothering you to make it to my liking :smile:

What do one needs for my "way"?

First, two identical cards. Or how many cards one has, they have to come in pairs. This I have.

Second, a fast software. That is for now, gpuOwl.

Third, due to the fact that the software runs in the same computer, the software should operate with different set of data. This will eliminate any bugs in the software, as long as both copies rund on different data and get the same result, the software is "sane", the FFT squaring is "sane". Otherwise, we don't know, unless a P95 (or other) test is run and we can compare the results. GC is still prone to errors, and when you have hundred million bits iterated hundred million times, the errors are usually cleverer than us. The chance is negligible, but not zero. Additionally, how do you "convince" PrimeNet to accept your DCs? If they were both done in a "no-shift" test, they are not two different tests. This just ensures the hardware is sane, as I already said, but it says nothing about the sanity of the software.

Fourth, a history. Sometimes, against all our precautions, one card gets faster than the other, because we do other work in the computer, or play games, or watch videos, etc, and when a mismatch happens, it can be that one card is "more than two checkpoints" in advance compared with the other. In that moment, the only way to continue the test is starting both instances from scratch, wasting a lot of time, days, or even weeks. Because you don't know which one was wrong, and the program only keeps the last two checkpoints. All checkpoints should be kept, the way cudaLucas does, and those should be deleted manually by the user at the end of the test. You can provide an option to delete them automatically, for the lazy users, but I strongly DO consider that doing a manual checking your folders and cleaning your checkpoints once or twice per month (one 332M test takes about 17 days on R7, and about double in all the others) it is not "too much" for the user. I mean, how lazy can one be?

You may say that such errors won't happen often, and their chance to happen when one card is more advanced than the other is slim, especially with GC active, but trust me, these things DO happen, and they may be "extremely rare", but every time such thing would happen, losing a week or two of [U]two[/U] cards would be totally pissing off, and I love my monitor so much, I don't want to break it with my head.

In fact, in such situation, will be better to let both test finish and report both residues, in the hope that one of them would be good, and you didn't waste the time, at least for one card. But this is another can of worms, you will end up by everybody reporting two results, one correct and one fake, and claiming they ran two tests in parallel, when they in fact did only one (some people may do this, for credits, whatever).

And I could argue like that, endless... The moral is we need a history, i.e. instead of deleting old files, rename them "exponent.iteration.residue.whatever", where the first 3 fields are MUST to have, so our comparison/resume tools work with minimum changes :blush:.

LaurV 2020-05-20 13:42

[QUOTE=preda;545950]What is the problem with the file-names?[/QUOTE]
Crosspost. I already replied, but to make it clear:

The file names should contain the exponent, the iteration, the residue. This is a must, for easily sorting, comparing, etc. Old files should not be deleted, but renamed properly and kept in the folder. The residue is needed in the name because (in case of shift/offset) the content of the files are different and can not be used for comparison. We discussed this in the past, and you came with the idea of putting the file header inside. Which is very good, but why would I need to open all >50MB files and "cat" them to get the residues? If I have the same files in both folders, it means the test is running smooth. If one folder has 55418387.12000000.adef1234cdeb9876.ll and the other folder has 55418387.12000000.def1234cdeb98765.ll instead, I know immediately that one card is in the weeds and I can stop and resume both from the last, I don't need any tool for that, just sharp eyes and fast fingers.

preda 2020-05-20 13:49

[QUOTE=LaurV;545954]
My "test case" is for years now, TWO cards running in parallel, and checking the residues at every checkpoint. This way, there is no time lost when the mismatch happens, and I consider that such procedure can't be further optimized, no matter what other people think.[/QUOTE]

OK I understand. At some point I was doing something similar, but on a single GPU, by running every iteration twice and comparing the residue for equality. That allows, as you say, detection of the errors as early as possible, but the cost is a halving of the throughput.

Running PRP you'd detect the errors just as well, and you'd double the capacity. Give it a try -- if you succedd in producing a failure of the check (i.e. a non-detected error), as you suspect may be possible, that would be a momentous achievement (but also much more difficult than simply finding the next mersenne prime IMO :)

LaurV 2020-05-20 13:56

[QUOTE=preda;545958]Give it a try -- if you succedd in producing a failure of the check (i.e. a non-detected error), as you suspect may be possible, that would be a momentous achievement (but also much more difficult than simply finding the next mersenne prime IMO :)[/QUOTE]
How can I? Do you mean to run a million PRP tests, report the result, and then wait for somebody to run P95 on them? :w00t:
The problem is exactly THAT: you run two tests that will always match, as long as there are no glitches in the hardware, no matter what you do in the software, because you always do the same thing, applied to the same data. You don't know if they have an error, unless you have an etalon. Maybe that is why there was no error found up to now, and not because GC is so strong... (ranting here...I understand the math part).

preda 2020-05-20 14:00

[QUOTE=LaurV;545957]Crosspost. I already replied, but to make it clear:

The file names should contain the exponent, the iteration, the residue. This is a must, for easily sorting, comparing, etc. Old files should not be deleted, but renamed properly and kept in the folder. The residue is needed in the name because (in case of shift/offset) the content of the files are different and can not be used for comparison. We discussed this in the past, and you came with the idea of putting the file header inside. Which is very good, but why would I need to open all >50MB files and "cat" them to get the residues? If I have the same files in both folders, it means the test is running smooth. If one folder has 55418387.12000000.adef1234cdeb9876.ll and the other folder has 55418387.12000000.def1234cdeb98765.ll instead, I know immediately that one card is in the weeds and I can stop and resume both from the last, I don't need any tool for that, just sharp eyes and fast fingers.[/QUOTE]

For PRP, any savefile is validated before being written, thus the possibility of having different savefile residues for the same iteration of the same exponent simply does not exist.

I'll think about doing something better about keeping those files around.

Shouldn't be hard to make a script tool (bash, perl, etc) that would rename the files adding the residue which is easily parsed from the first line to the file-name if desired.

(but anyway the proper fix is to switch to PRP)

preda 2020-05-20 14:03

[QUOTE=LaurV;545959]How can I? Do you mean to run a million PRP tests, report the result, and then wait for somebody to run P95 on them? :w00t:
The problem is exactly THAT: you run two tests that will always match, as long as there are no glitches in the hardware, no matter what you do in the software, because you always do the same thing, applied to the same data. You don't know if they have an error, unless you have an etalon. Maybe that is why there was no error found up to now, and not because GC is so strong... (ranting here...I understand the math part).[/QUOTE]

Well, I guess you run it twice and detect a difference.. Run it with different FFT setup, would be stronger than just a different offset.

(no, I don't actually recommend doing that, would be a waste of valuable resources)

LaurV 2020-05-20 14:13

[QUOTE=preda;545960]For PRP, any savefile is validated before being written, thus the possibility of having different savefile residues for the same iteration of the same exponent simply does not exist.
[/QUOTE]
Running in circles here. You assume GC never fails, and if that would be the case, all this discussion would be futile.

[QUOTE]
Shouldn't be hard to make a script tool (bash, perl, etc) that would rename the files adding the residue which is easily parsed from the first line to the file-name if desired.
[/QUOTE]I said in the forum that I already made one (batch, check if the file exists, rename it, sleep 10 minutes or so, repeat).

[QUOTE]
(but anyway the proper fix is to switch to PRP)[/QUOTE]

I promise you I will switch to PRP and [U]continue to run 2 instances of the same test in parallel[/U] until I will find that GC failure :razz:, if you give me the shift and the history (to be able to resume efficiently and to convince PrimeNet to accept my DCs, otherwise I only get candies for half of the effort, and waste the other half, hehe.

And when I'll catch you in Thai, after all this craziness with corona ends, I will force you to drink all the beer I will find in the fridge.

kriesel 2020-05-20 15:40

[QUOTE=axn;545938]LL is not the future. PRP is. Non-zero shift is a relic of LL days without effective error check. It is completely unnecessary with PRP/GEC.

IMO, It is high time we made PRP the default test type and start forcing everyone to use these instead of 1st time LL test. [Yes, I know why it can't happen -- damn older clients].[/QUOTE]I agree that PRP first test with the excellent GEC is preferable over LL with Jacobi check and its 50% error detection probability or LL without Jacobi as in CUDALucas and most gpuowl LL-supporting versions.
But the realities are that a great deal of LL is still being done by various programs.
[URL]https://www.mersenne.org/assignments/?exp_lo=90208999&exp_hi=92000000&execm=1&exdchk=1&exp1=1&extf=1[/URL] is almost all LL first tests. A check I did several months ago showed a nearly 50/50 mix of PRP and LL in recent results. And a great deal of LL was done in the past.
Even in PRP, shift has advantages. Suppose that the exponent under test is close enough to the limit of an fft length that roundoff error becomes an issue. A different shift may avoid the case where roundoff error repeatedly generates a Gerbicz error.
I think the case for preferring or requiring PRP first testing is stronger when pseudorandom or specifiable nonzero shift becomes available in gpuowl.
DC is still necessary for PRP for multiple reasons. (Errors have been observed outside the GEC check; users make manual reporting errors, and there is no PrimeNet API connection for gpu programs; there is no reliable built-in validation code to confirm actual work done; some rare few users submit falsified results intentionally.)

Gpuowl can't DC gpuowl without differing shift. The result is not accepted by the server.
As I recall, Ernst opined that adding shift has little if any effect on performance (from his mlucas development). It may be that Mihai and George choose to spend their time now on obtaining further performance. (And we are very appreciative of their efforts and results in this area.) Diminishing returns will occur. Perhaps they'll add shift later. When they do, I hope it is to both LL and PRP in gpuowl. I think the ideal situation would be the default is pseudorandom shift, and the user could specify a specific shift for QA test purposes.

kriesel 2020-05-20 16:07

Two gpuowl instances running on the same gpu reportedly helps total testing throughput. But not always. Test what you run. GTX10x0 seems not to benefit in PRP.
A particularly severe case of lowered throughput I saw recently on a Radeon VII follows.
1 instance, 48M fft PRP, 10510 us/iter, 95.15 iter/sec;
1 instance, 8M fft LL, 1382 us/iter, 723.6 iter/sec;
These two run together, 8M fft LL 6610. us/iter (151.3 iter/sec, 20.9% of solo throughput), + 48M fft PRP, 52438. us/iter (19.07 iter/sec, 20.04% of solo throughput), combined for just 40.94% of solo throughput.
It's probably best to run same computation type, same fft size, or perhaps very similar size.

LaurV 2020-05-20 16:58

My bad wording. Sorry. I meant 2 instances, each in its own card.

kriesel 2020-05-20 17:53

[QUOTE=LaurV;545985]My bad wording. Sorry. I meant 2 instances, each in its own card.[/QUOTE]I thought that was clear. And now it's clearer.
The 2 on one card I wrote of is about performance tweaking.

CUDALucas and LL will be with us for a while yet, not only because of user inertia, but because some gpus are not capable of running gpuowl but can run CUDALucas; any NVIDIA gpu that does not support a sufficient subset of opencl. Some have DP performance stronger relative to their 32bit-int performance than others, such as old Teslas and Quadros.

kriesel 2020-05-20 18:04

Jacobi returns, and Middle15 appears
 
2 Attachment(s)
Win7 x64 build of gpuowl v6.11-288-g20c4213 attached.
I didn't notice until now, but sometime between 6.11-219 and 6.11-255, the ffts longer than 96M (up to 192M) were dropped.

ewmayer 2020-05-20 19:45

Mihai, with all (and great) respect, I'm trying to understand your resistance to supporting shifted residues. Please let me know if any of the following are incorrect:

1. You previously supported shift, at least for LL, perhaps also for PRP (and if you only had it for LL, note that it's even more trivial to support for PRP, because one does not need to do the per-iteration "at which bit location does the -2 get injected?" computation LL needs, one only needs to update the shift count via doubling-mod-p);

2. You noted that supporting shift did not adversely affect performance.

So why not simply reactivate the old shift-supporting code segments? Has the main codebase changed so much in the meanwhile that doing so would be a major pain?

Knowing that 2 tests, whether using the same or different codes, run at the same FFT length are using fundamentally different residue-word data is a big deal, confidence-in-result-wise. "Gerbicz check is totally foolproof" is [a] a surmise, based on limited run numbers with independent-program DCs for cross-checking, and [b] is implementation-dependent, in similar fashion to "a mathematically foolproof cryptography scheme can be nullified by a flawed software implementation".

If it's relatively trivial to support, why not do so? Restore the hopefully-modest amounts of code needed to support it, and then move on. Shift-related code is simple enough that once deployed, it doesn't need further attention. The next time I expect to need to revisit my shift-related code in Mlucas will be if and when a major vendor puts out a 1024-or-more-bit SIMD architecture, at which point I'll need to make some modest changes to the LL-test carry routines to inject the per-iteration -2 into the proper one of the 16 doubles in the corresponding carry-step SIMD vector-of-residue-words datum.

kriesel 2020-05-20 20:39

Reviewing my notes and some old posts, I see Mihai implemented shift for LL, back at gpuowl v0.3, indicating an 0.5% overhead then, and dropped it by v0.6, when he first implemented Jacobi check with a stated overhead of 0.2%. (See links at [URL]https://www.mersenneforum.org/showpost.php?p=489083&postcount=7[/URL])

If those values are still about current, I would be fine with such 0.7% overhead or a bit more to gain some reliability through differing shift and some error detection combined in the same build. People like LaurV who run dual runs probably would be too. I think some P-1 error checking at 2% overhead, perhaps more, would be ok.

For LL, perhaps it could be made optional, such as by -use no_offset,no_jacobi, with offset and jacobi check on by default. It would be good if those only affected LL worktodo lines and did not prevent gpuowl from running mixed work in the same worktodo file; LL followed by P-1 and PRP etc. (Providing no_GEC to PRP is unappealing.) Some of us would be happy to benchmark the reliability options if/when available.

Way back when gpuowl was announced, Mihai closed [URL]https://www.mersenneforum.org/showpost.php?p=457032&postcount=1[/URL] with[CODE]I'll be looking for problems to fix. I hope you'll enjoy!
cheers,
Mihai[/CODE]Yes, immensely. It seems to me a lively 3 years and great progress to date. Looking forward to what comes next.

kriesel 2020-05-20 21:52

[QUOTE=kriesel;545974]Two gpuowl instances running on the same gpu reportedly helps total testing throughput. But not always. Test what you run. (terribly dispare instances detail... 40.94% of solo throughput.
It's probably best to run same computation type, same fft size, or perhaps very similar size.[/QUOTE]
One vs two instances on a gpu, gpuowl-win v6.11-278, 9M PRP/9M PRP, radeonvii gpu, i7-4790 cpu

1a 1467.4 us/it, 681.48 iter/sec
1b 1465.6 us/it, 682.31 iter/sec
average 681.9 iter/sec
2 together 2884.4 us/it +2876.8 us/it, 346.69 iter/sec + 347.61 iter/sec = 694.30 iter/sec,
2 instances / gpu yielded 1.01819 times speed of a single instance; 1.82% faster.
Equivalent to 217.3 GhzD/day. Seems low. Maybe thermally throttling?

kriesel 2020-05-21 17:40

Gpuowl-win v6.11-288 LL Jacobi test
 
First trial of GpuOwL v6.11-288 LL with Jacobi check, continuation of an existing run.
Happy/relieved to see the initial Jacobi check pass, although this gpu has successfully completed PRP wavefront exponents with no GEC errors.
[CODE]2020-05-20 18:27:57 gpuowl v6.11-288-g20c4213
2020-05-20 18:27:57 config: -user kriesel -cpu asr2/radeonvii2 -d 2 -use NO_ASM -maxAlloc 15000
2020-05-20 18:27:57 device 2, unique id ''
2020-05-20 18:27:57 asr2/radeonvii2 154155713 FFT: 8M 1K:8:512 (18.38 bpw)
2020-05-20 18:27:57 asr2/radeonvii2 Expected maximum carry32: 70B40000
2020-05-20 18:27:59 asr2/radeonvii2 OpenCL args "-DEXP=154155713u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=8u -DWEIGHT_STEP=0xc.528658a63b438p-3 -DIWEIGHT_STEP=0xa.633a
f6ee9bb58p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DCARRY64=1 -DMM_CHAIN=2u -DMM2_CHAIN=3u -DNO_ASM=1 -cl-fast
-relaxed-math -cl-std=CL2.0 "
2020-05-20 18:28:09 asr2/radeonvii2 OpenCL compilation in 10.16 s
2020-05-20 18:28:10 asr2/radeonvii2 154155713 LL 60507000 loaded: 223f2e70d392347e
2020-05-20 18:30:29 asr2/radeonvii2 154155713 LL 60600000 39.31%; 1492 us/it; ETA 1d 14:47; 70cdcc47cb7d66f7
2020-05-20 18:32:48 asr2/radeonvii2 154155713 LL 60700000 39.38%; 1397 us/it; ETA 1d 12:16; fa7b795be968fc04
2020-05-20 18:35:08 asr2/radeonvii2 154155713 LL 60800000 39.44%; 1397 us/it; ETA 1d 12:13; 3d74bd4d1e787fca
2020-05-20 18:37:28 asr2/radeonvii2 154155713 LL 60900000 39.51%; 1396 us/it; ETA 1d 12:10; 771738771fa1b18f
2020-05-20 18:39:47 asr2/radeonvii2 154155713 LL 61000000 39.57%; 1396 us/it; ETA 1d 12:08; 663deec0ffbc5691
2020-05-20 18:42:07 asr2/radeonvii2 154155713 LL 61100000 39.64%; 1396 us/it; ETA 1d 12:05; 8b7abb01cae7c971
2020-05-20 18:42:07 asr2/radeonvii2 154155713 OK 61000000 (jacobi == -1)
2020-05-20 18:44:27 asr2/radeonvii2 154155713 LL 61200000 39.70%; 1398 us/it; ETA 1d 12:06; 1d4fad0f5f424457
...
2020-05-21 04:42:25 asr2/radeonvii2 154155713 LL 86900000 56.37%; 1395 us/it; ETA 1d 02:04; b558d286cd9d5735
2020-05-21 04:44:45 asr2/radeonvii2 154155713 LL 87000000 56.44%; 1395 us/it; ETA 1d 02:01; 62c4e005f393521f
2020-05-21 04:47:04 asr2/radeonvii2 154155713 LL 87100000 56.50%; 1394 us/it; ETA 1d 01:58; d31bd8c703cd82e2
2020-05-21 04:49:23 asr2/radeonvii2 154155713 LL 87200000 56.57%; 1395 us/it; ETA 1d 01:57; 4ebde4106568146f
2020-05-21 04:49:23 asr2/radeonvii2 154155713 OK 87000000 (jacobi == -1)
2020-05-21 04:51:43 asr2/radeonvii2 154155713 LL 87300000 56.63%; 1396 us/it; ETA 1d 01:55; deb4718bd97c793e[/CODE]until it hit a repeatable Jacobi error between 87M and 88M:
[CODE]2020-05-21 05:07:59 asr2/radeonvii2 154155713 LL 88000000 57.09%; 1395 us/it; ETA 1d 01:38; 8a332b7fbfafa5f7
2020-05-21 05:10:19 asr2/radeonvii2 154155713 LL 88100000 57.15%; 1394 us/it; ETA 1d 01:35; 76a8832229ab7d36
2020-05-21 05:12:38 asr2/radeonvii2 154155713 LL 88200000 57.21%; 1395 us/it; ETA 1d 01:33; 687d9f07557de7fc
2020-05-21 05:12:38 asr2/radeonvii2 154155713 EE 88000000 (jacobi == 1)
2020-05-21 05:12:39 asr2/radeonvii2 154155713 LL 87000000 loaded: 62c4e005f393521f
2020-05-21 05:14:58 asr2/radeonvii2 154155713 LL 87100000 56.50%; 1394 us/it; ETA 1d 01:58; d31bd8c703cd82e2
2020-05-21 05:17:18 asr2/radeonvii2 154155713 LL 87200000 56.57%; 1394 us/it; ETA 1d 01:56; 4ebde4106568146f
2020-05-21 05:19:37 asr2/radeonvii2 154155713 LL 87300000 56.63%; 1394 us/it; ETA 1d 01:54; deb4718bd97c793e
2020-05-21 05:21:56 asr2/radeonvii2 154155713 LL 87400000 56.70%; 1395 us/it; ETA 1d 01:52; 78c200210067f63a
2020-05-21 05:24:16 asr2/radeonvii2 154155713 LL 87500000 56.76%; 1395 us/it; ETA 1d 01:49; e2e6daf06bc84732
2020-05-21 05:26:35 asr2/radeonvii2 154155713 LL 87600000 56.83%; 1394 us/it; ETA 1d 01:47; 421b198d26a5456f
2020-05-21 05:28:55 asr2/radeonvii2 154155713 LL 87700000 56.89%; 1395 us/it; ETA 1d 01:45; e6f1cee228a832b9
2020-05-21 05:31:14 asr2/radeonvii2 154155713 LL 87800000 56.96%; 1395 us/it; ETA 1d 01:42; 8f22ca3882842947
2020-05-21 05:33:34 asr2/radeonvii2 154155713 LL 87900000 57.02%; 1395 us/it; ETA 1d 01:40; 956c469d245e11f3
2020-05-21 05:35:53 asr2/radeonvii2 154155713 LL 88000000 57.09%; 1395 us/it; ETA 1d 01:38; 8a332b7fbfafa5f7
2020-05-21 05:38:13 asr2/radeonvii2 154155713 LL 88100000 57.15%; 1394 us/it; ETA 1d 01:35; 76a8832229ab7d36
2020-05-21 05:40:32 asr2/radeonvii2 154155713 LL 88200000 57.21%; 1395 us/it; ETA 1d 01:33; 687d9f07557de7fc
2020-05-21 05:40:32 asr2/radeonvii2 154155713 EE 88000000 (jacobi == 1)
2020-05-21 05:40:32 asr2/radeonvii2 154155713 LL 87000000 loaded: 62c4e005f393521f
2020-05-21 05:42:52 asr2/radeonvii2 154155713 LL 87100000 56.50%; 1394 us/it; ETA 1d 01:58; d31bd8c703cd82e2
2020-05-21 05:45:11 asr2/radeonvii2 154155713 LL 87200000 56.57%; 1395 us/it; ETA 1d 01:56; 4ebde4106568146f
2020-05-21 05:47:31 asr2/radeonvii2 154155713 LL 87300000 56.63%; 1394 us/it; ETA 1d 01:54; deb4718bd97c793e
2020-05-21 05:49:50 asr2/radeonvii2 154155713 LL 87400000 56.70%; 1395 us/it; ETA 1d 01:52; 78c200210067f63a
2020-05-21 05:52:10 asr2/radeonvii2 154155713 LL 87500000 56.76%; 1395 us/it; ETA 1d 01:49; e2e6daf06bc84732
2020-05-21 05:54:29 asr2/radeonvii2 154155713 LL 87600000 56.83%; 1395 us/it; ETA 1d 01:47; 421b198d26a5456f
2020-05-21 05:56:49 asr2/radeonvii2 154155713 LL 87700000 56.89%; 1395 us/it; ETA 1d 01:45; e6f1cee228a832b9
2020-05-21 05:59:08 asr2/radeonvii2 154155713 LL 87800000 56.96%; 1395 us/it; ETA 1d 01:43; 8f22ca3882842947
2020-05-21 06:01:28 asr2/radeonvii2 154155713 LL 87900000 57.02%; 1395 us/it; ETA 1d 01:40; 956c469d245e11f3
2020-05-21 06:03:47 asr2/radeonvii2 154155713 LL 88000000 57.09%; 1394 us/it; ETA 1d 01:37; 8a332b7fbfafa5f7
2020-05-21 06:06:06 asr2/radeonvii2 154155713 LL 88100000 57.15%; 1394 us/it; ETA 1d 01:35; 76a8832229ab7d36
2020-05-21 06:08:26 asr2/radeonvii2 154155713 LL 88200000 57.21%; 1395 us/it; ETA 1d 01:33; 687d9f07557de7fc
2020-05-21 06:08:26 asr2/radeonvii2 154155713 EE 88000000 (jacobi == 1)
2020-05-21 06:08:26 asr2/radeonvii2 154155713 LL 87000000 loaded: 62c4e005f393521f
2020-05-21 06:10:46 asr2/radeonvii2 154155713 LL 87100000 56.50%; 1394 us/it; ETA 1d 01:58; d31bd8c703cd82e2
2020-05-21 06:13:05 asr2/radeonvii2 154155713 LL 87200000 56.57%; 1395 us/it; ETA 1d 01:56; 4ebde4106568146f
2020-05-21 06:15:25 asr2/radeonvii2 154155713 LL 87300000 56.63%; 1395 us/it; ETA 1d 01:54; deb4718bd97c793e
2020-05-21 06:17:44 asr2/radeonvii2 154155713 LL 87400000 56.70%; 1394 us/it; ETA 1d 01:51; 78c200210067f63a
2020-05-21 06:20:04 asr2/radeonvii2 154155713 LL 87500000 56.76%; 1395 us/it; ETA 1d 01:49; e2e6daf06bc84732
2020-05-21 06:22:23 asr2/radeonvii2 154155713 LL 87600000 56.83%; 1395 us/it; ETA 1d 01:47; 421b198d26a5456f
2020-05-21 06:24:43 asr2/radeonvii2 154155713 LL 87700000 56.89%; 1395 us/it; ETA 1d 01:45; e6f1cee228a832b9
2020-05-21 06:27:02 asr2/radeonvii2 154155713 LL 87800000 56.96%; 1395 us/it; ETA 1d 01:42; 8f22ca3882842947
2020-05-21 06:29:22 asr2/radeonvii2 154155713 LL 87900000 57.02%; 1395 us/it; ETA 1d 01:40; 956c469d245e11f3
2020-05-21 06:31:41 asr2/radeonvii2 154155713 LL 88000000 57.09%; 1394 us/it; ETA 1d 01:37; 8a332b7fbfafa5f7
2020-05-21 06:34:00 asr2/radeonvii2 154155713 LL 88100000 57.15%; 1394 us/it; ETA 1d 01:35; 76a8832229ab7d36
2020-05-21 06:34:00 asr2/radeonvii2 154155713 EE 88000000 (jacobi == 1)
2020-05-21 06:34:01 asr2/radeonvii2 154155713 LL 87000000 loaded: 62c4e005f393521f
2020-05-21 06:36:20 asr2/radeonvii2 154155713 LL 87100000 56.50%; 1395 us/it; ETA 1d 01:59; d31bd8c703cd82e2
2020-05-21 06:38:40 asr2/radeonvii2 154155713 LL 87200000 56.57%; 1395 us/it; ETA 1d 01:56; 4ebde4106568146f
2020-05-21 06:40:59 asr2/radeonvii2 154155713 LL 87300000 56.63%; 1394 us/it; ETA 1d 01:54; deb4718bd97c793e
2020-05-21 06:43:19 asr2/radeonvii2 154155713 LL 87400000 56.70%; 1395 us/it; ETA 1d 01:52; 78c200210067f63a
2020-05-21 06:45:38 asr2/radeonvii2 154155713 LL 87500000 56.76%; 1395 us/it; ETA 1d 01:50; e2e6daf06bc84732
2020-05-21 06:47:58 asr2/radeonvii2 154155713 LL 87600000 56.83%; 1395 us/it; ETA 1d 01:47; 421b198d26a5456f
2020-05-21 06:50:17 asr2/radeonvii2 154155713 LL 87700000 56.89%; 1395 us/it; ETA 1d 01:45; e6f1cee228a832b9
2020-05-21 06:52:37 asr2/radeonvii2 154155713 LL 87800000 56.96%; 1396 us/it; ETA 1d 01:43; 8f22ca3882842947
2020-05-21 06:54:56 asr2/radeonvii2 154155713 LL 87900000 57.02%; 1395 us/it; ETA 1d 01:41; 956c469d245e11f3
2020-05-21 06:57:16 asr2/radeonvii2 154155713 LL 88000000 57.09%; 1395 us/it; ETA 1d 01:38; 8a332b7fbfafa5f7
2020-05-21 06:59:35 asr2/radeonvii2 154155713 LL 88100000 57.15%; 1395 us/it; ETA 1d 01:35; 76a8832229ab7d36
2020-05-21 06:59:35 asr2/radeonvii2 154155713 EE 88000000 (jacobi == 1)
2020-05-21 06:59:36 asr2/radeonvii2 154155713 LL 87000000 loaded: 62c4e005f393521f
2020-05-21 06:59:43 asr2/radeonvii2 Stopping, please wait..
2020-05-21 06:59:43 asr2/radeonvii2 154155713 LL 87005000 56.44%; 1457 us/it; ETA 1d 03:11; b349261c8446786d
2020-05-21 06:59:43 asr2/radeonvii2 waiting for the Jacobi check to finish..
2020-05-21 07:02:09 asr2/radeonvii2 154155713 OK 87005000 (jacobi == -1)
2020-05-21 07:02:09 asr2/radeonvii2 Exiting because "stop requested"
2020-05-21 07:02:09 asr2/radeonvii2 Bye[/CODE]Reducing the Jacobi interval after a fail or two in a row might be helpful. This particular cpu/gpu combination would be limited to about 200K or above for gpu time > cpu Jacobi check time so that there's at most one Jacobi check pending at a time in a continuing run. [CODE]2020-05-21 11:12:22 asr2/radeonvii2 OpenCL compilation in 8.66 s
2020-05-21 11:12:23 asr2/radeonvii2 154155713 LL 87503000 loaded: b5116bfd98e830ad
2020-05-21 11:14:38 asr2/radeonvii2 154155713 LL 87600000 56.83%; 1395 us/it; ETA 1d 01:47; 421b198d26a5456f
2020-05-21 11:16:58 asr2/radeonvii2 154155713 LL 87700000 56.89%; 1396 us/it; ETA 1d 01:46; e6f1cee228a832b9
2020-05-21 11:17:04 asr2/radeonvii2 Stopping, please wait..
2020-05-21 11:17:04 asr2/radeonvii2 154155713 LL 87704000 56.89%; 1477 us/it; ETA 1d 03:16; 79d3368b77502fd3
2020-05-21 11:17:04 asr2/radeonvii2 waiting for the Jacobi check to finish..
2020-05-21 11:19:24 asr2/radeonvii2 154155713 OK 87704000 (jacobi == -1)[/CODE]Manually starting and stopping after 100K or 200K intervals, then smaller intervals after 87.9M, I coaxed it further past 87M. Jacobi checks were ok to 87.973M; not at 88M.[CODE]2020-05-21 12:00:12 gpuowl v6.11-288-g20c4213
2020-05-21 12:00:12 config: -user kriesel -cpu asr2/radeonvii2 -d 2 -use NO_ASM -maxAlloc 15000
2020-05-21 12:00:12 device 2, unique id ''
2020-05-21 12:00:12 asr2/radeonvii2 154155713 FFT: 8M 1K:8:512 (18.38 bpw)
2020-05-21 12:00:12 asr2/radeonvii2 Expected maximum carry32: 70B40000
2020-05-21 12:00:14 asr2/radeonvii2 OpenCL args "-DEXP=154155713u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=8u -DWEIGHT_STEP=0xc.528658a63b438p-3 -DIWEIGHT_STEP=0xa.633a
f6ee9bb58p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DCARRY64=1 -DMM_CHAIN=2u -DMM2_CHAIN=3u -DNO_ASM=1 -cl-fast
-relaxed-math -cl-std=CL2.0 "
2020-05-21 12:00:25 asr2/radeonvii2 OpenCL compilation in 11.03 s
2020-05-21 12:00:26 asr2/radeonvii2 154155713 LL 87952000 loaded: 64a16470a62c62a6
2020-05-21 12:00:55 asr2/radeonvii2 Stopping, please wait..
2020-05-21 12:00:56 asr2/radeonvii2 154155713 LL 87973000 57.07%; 1406 us/it; ETA 1d 01:51; 6cdf9822b4a86ada
2020-05-21 12:00:56 asr2/radeonvii2 waiting for the Jacobi check to finish..
2020-05-21 12:03:22 asr2/radeonvii2 154155713 OK 87973000 (jacobi == -1)
2020-05-21 12:03:22 asr2/radeonvii2 Exiting because "stop requested"[/CODE] I saved a copy of the ll file at 87.952M iterations.
The above was with the default fft selected by the program, 8M 1K:8:512, which seemed ambitious at 18.38 bits/word.
Trying different fft was mostly foiled by the following -fft specs not being accepted, called too small, 0K or 128K, even though the first two were copied and pasted from the help output:
-fft 1K:4:1K
-fft 512:8:1K
-fft +1[CODE]
2020-05-21 12:03:45 gpuowl v6.11-288-g20c4213
2020-05-21 12:03:45 config: -user kriesel -cpu asr2/radeonvii2 -d 2 -use NO_ASM -maxAlloc 15000 -fft 1K:4:1K
2020-05-21 12:03:45 device 2, unique id ''
2020-05-21 12:03:45 asr2/radeonvii2 154155713 FFT: 0M 1K:4K:1K (inf bpw)
2020-05-21 12:03:45 asr2/radeonvii2 FFT size too small for exponent (inf bits/word).
2020-05-21 12:03:45 asr2/radeonvii2 Exiting because "FFT size too small"
2020-05-21 12:03:45 asr2/radeonvii2 Bye

2020-05-21 12:04:54 gpuowl v6.11-288-g20c4213
2020-05-21 12:04:54 config: -user kriesel -cpu asr2/radeonvii2 -d 2 -use NO_ASM -maxAlloc 15000 -fft +1
2020-05-21 12:04:54 device 2, unique id ''
2020-05-21 12:04:54 asr2/radeonvii2 154155713 FFT: 128K 256:1:256 (1176.11 bpw)
2020-05-21 12:04:54 asr2/radeonvii2 FFT size too small for exponent (1176.11 bits/word).
2020-05-21 12:04:54 asr2/radeonvii2 Exiting because "FFT size too small"
2020-05-21 12:04:54 asr2/radeonvii2 Bye

2020-05-21 12:08:51 gpuowl v6.11-288-g20c4213
2020-05-21 12:08:51 config: -user kriesel -cpu asr2/radeonvii2 -d 2 -use NO_ASM -maxAlloc 15000 -fft 512:8:1K
2020-05-21 12:08:51 device 2, unique id ''
2020-05-21 12:08:51 asr2/radeonvii2 154155713 FFT: 0M 512:8K:1K (inf bpw)
2020-05-21 12:08:51 asr2/radeonvii2 FFT size too small for exponent (inf bits/word).
2020-05-21 12:08:51 asr2/radeonvii2 Exiting because "FFT size too small"
2020-05-21 12:08:51 asr2/radeonvii2 Bye[/CODE] -fft 4K:4:256 was accepted and ran. This got it through 88M with a correct Jacobi and a different res64 8d2dd3005435c15e.
I'm continuing the exponent with that last fft choice, to check speed and luck. If that doesn't work I'll try 9M next.

Edit: That didn't take long. [CODE]2020-05-21 12:39:37 gpuowl v6.11-288-g20c4213
2020-05-21 12:39:37 config: -user kriesel -cpu asr2/radeonvii2 -d 2 -use NO_ASM -maxAlloc 15000 -fft 4K:4:256
2020-05-21 12:39:37 device 2, unique id ''
2020-05-21 12:39:37 asr2/radeonvii2 154155713 FFT: 8M 4K:4:256 (18.38 bpw)
2020-05-21 12:39:37 asr2/radeonvii2 Expected maximum carry32: 70B40000
2020-05-21 12:39:39 asr2/radeonvii2 OpenCL args "-DEXP=154155713u -DWIDTH=4096u -DSMALL_HEIGHT=256u -DMIDDLE=4u -DWEIGHT_STEP=0xc.528658a63b438p-3 -DIWEIGHT_STEP=0xa.633a
f6ee9bb58p-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DPM1=0 -DAMDGPU=1 -DCARRY64=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DNO_ASM=1 -cl-fast
-relaxed-math -cl-std=CL2.0 "
2020-05-21 12:39:46 asr2/radeonvii2 OpenCL compilation in 6.65 s
2020-05-21 12:39:47 asr2/radeonvii2 154155713 LL 88000000 loaded: 8d2dd3005435c15e
2020-05-21 12:42:32 asr2/radeonvii2 154155713 LL 88100000 57.15%; 1654 us/it; ETA 1d 06:21; a1378029fe3d2180
2020-05-21 12:45:18 asr2/radeonvii2 154155713 LL 88200000 57.21%; 1655 us/it; ETA 1d 06:19; cdbb9de9a9320c46
2020-05-21 12:47:03 asr2/radeonvii2 Stopping, please wait..
2020-05-21 12:47:04 asr2/radeonvii2 154155713 LL 88264000 57.26%; 1656 us/it; ETA 1d 06:19; 30218bc01cbfa191
2020-05-21 12:47:04 asr2/radeonvii2 waiting for the Jacobi check to finish..
2020-05-21 12:49:25 asr2/radeonvii2 154155713 EE 88264000 (jacobi == 1)[/CODE]Trying 9M; different res64 at 88.1M and onward, and only ~3% slower than the default 8M fft spec.

Prime95 2020-05-21 19:18

Ken,

Judging from these flags -DMM_CHAIN=2u -DMM2_CHAIN=3u the exponent is near the upper limits of that FFT length. Our gpuowl defaults may be too aggressive.

Assuming you saved the problematic files:
Can you try getting past the bad iteration with "ULTRA_TRIG=1"?
Can you get past with "MM_CHAIN=3?

ATH 2020-05-21 20:14

[QUOTE=kriesel;545970]I agree that PRP first test with the excellent GEC is preferable over LL with Jacobi check and its 50% error detection probability or LL without Jacobi as in CUDALucas and most gpuowl LL-supporting versions.
But the realities are that a great deal of LL is still being done by various programs.
[/QUOTE]

I checked today May 21st around 06:00 UTC, here are the LL and PRP numbers:


[CODE] LL PRP LL-DC PRP-DC
50000000 735 0 19033 0
51000000 226 0 19389 0
52000000 2685 0 16974 0
53000000 7196 0 12482 0
54000000 13839 0 5648 0
55000000 15560 0 4097 0
56000000 16364 0 3419 0
57000000 17324 0 2347 0
58000000 18143 0 1645 0
59000000 19212 0 504 0
60000000 19259 0 410 1
61000000 19165 0 433 0
62000000 19083 0 434 0
63000000 19328 0 396 0
64000000 19138 0 402 0
65000000 19075 0 404 0
66000000 19163 0 330 0
67000000 18767 0 627 0
68000000 18998 0 570 0
69000000 18584 0 744 0
70000000 18870 0 622 2
71000000 18975 0 423 0
72000000 18633 0 661 0
73000000 18528 0 985 0
74000000 18535 0 789 1
75000000 18695 0 771 91
76000000 18499 0 545 269
77000000 18303 3 655 233
78000000 18461 3 540 184
79000000 18374 12 584 206
80000000 18531 90 576 207
81000000 18532 49 697 257
82000000 18326 153 508 311
83000000 18184 646 415 95
84000000 17352 1395 376 17
85000000 16062 2496 474 48
86000000 14688 3901 439 55
87000000 13157 5706 324 26
88000000 11390 7422 318 23
89000000 9821 9002 262 34
90000000 5765 12929 223 20
91000000 8306 9079 104 26
92000000 2032 16016 3 6
93000000 4834 13058 4 12
94000000 1576 16983 3 35
95000000 890 17133 2 2
96000000 246 1486 8 8
97000000 244 176 3 0
98000000 2081 4042 19 1
99000000 2446 4488 17 2
100000000 505 842 16 1
101000000 495 506 2 1
102000000 1003 684 8 2
103000000 766 701 6 1
104000000 221 127 1 0
105000000 395 243 2 1
106000000 963 473 2 1
107000000 707 462 2 1
108000000 11 4 1 1
109000000 16 1 1 0
110000000 15 1 1 0
[/CODE]

kriesel 2020-05-21 21:02

[QUOTE=Prime95;546145]Ken,

Judging from these flags -DMM_CHAIN=2u -DMM2_CHAIN=3u the exponent is near the upper limits of that FFT length. Our gpuowl defaults may be too aggressive.

Assuming you saved the problematic files:
Can you try getting past the bad iteration with "ULTRA_TRIG=1"?
Can you get past with "MM_CHAIN=3?[/QUOTE]
Will do. I also have an interesting case at 94.95M 5M fft on an RX550, 18.11 bits/word, that has been problematic in both v6.11-257 and -288.

preda 2020-05-21 23:22

[QUOTE=kriesel;546132]
Trying different fft was mostly foiled by the following -fft specs not being accepted, called too small, 0K or 128K, even though the first two were copied and pasted from the help output:
-fft 1K:4:1K
-fft 512:8:1K
-fft +1[/QUOTE]

Ken, thanks for reporting this FFT-spec parsing bug, should be fixed now.

-fft +1 does not work anymore -- there is no format for "relative to the default" anymove. So -fft +1 is the same as -fft 1, which is an invalid FFT size.

I plan to reduce the default Jacobi step to 500K. It's safe to set it manually to a smaller value, it won't start a new background check if there is one ongoing.

kriesel 2020-05-22 03:03

[QUOTE=ATH;546150]I checked today May 21st around 06:00 UTC, here are the LL and PRP numbers:[/QUOTE]So, totals,
[CODE] 677277 130312 101680 2181
83.86% 16.14% 97.90% 2.10%
LL PRP LL-DC PRP-DC[/CODE]Not enough PRP-DC for statistics yet.

preda 2020-05-22 03:20

[QUOTE=kriesel;546172]So, totals,
[CODE] 677277 130312 101680 2181
83.86% 16.14% 97.90% 2.10%
LL PRP LL-DC PRP-DC[/CODE]Not enough PRP-DC for statistics yet.[/QUOTE]

I count about 3000 PRP-DC, that's enough for some statistics. What is the error count in those PRP-DC?

kriesel 2020-05-22 03:36

[QUOTE=Prime95;546145]
Can you try getting past the bad iteration with "ULTRA_TRIG=1"?
Can you get past with "MM_CHAIN=3?[/QUOTE]
MM_CHAIN=3 not tried; ULTRA_TRIG handled both 154M stops seen earlier for 8M fft.

[CODE]2020-05-21 22:16:45 gpuowl v6.11-288-g20c4213
2020-05-21 22:16:45 config: -user kriesel -cpu asr2/radeonvii2 -d 2 -use NO_ASM,ULTRA_TRIG=1 -maxAlloc 15000
2020-05-21 22:16:45 device 2, unique id ''
2020-05-21 22:16:45 asr2/radeonvii2 154155713 FFT: 8M 1K:8:512 (18.38 bpw)
2020-05-21 22:16:45 asr2/radeonvii2 Expected maximum carry32: 70B40000
2020-05-21 22:16:49 asr2/radeonvii2 OpenCL args "-DEXP=154155713u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=8u -DWEIGHT_STEP=0xc.528658a63b438p-3 -DIWEIGHT_STEP=0xa.633a
f6ee9bb58p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DCARRY64=1 -DMM_CHAIN=2u -DMM2_CHAIN=3u -DNO_ASM=1 -DULTRA_T
RIG=1 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-05-21 22:16:59 asr2/radeonvii2 OpenCL compilation in 9.73 s
2020-05-21 22:17:00 asr2/radeonvii2 154155713 LL 87952000 loaded: 64a16470a62c62a6
2020-05-21 22:18:07 asr2/radeonvii2 154155713 LL 88000000 57.09%; 1404 us/it; ETA 1d 01:48; 503baed50ef43276
2020-05-21 22:21:34 asr2/radeonvii2 154155713 LL 88100000 57.15%; [U]2065[/U] us/it; ETA 1d 13:53; 109820c89c99e3ce
2020-05-21 22:21:34 asr2/radeonvii2 154155713 OK 88000000 (jacobi == -1)
2020-05-21 22:23:54 asr2/radeonvii2 154155713 LL 88200000 57.21%; 1403 us/it; ETA 1d 01:42; 6669371a0075ba93
2020-05-21 22:26:14 asr2/radeonvii2 154155713 LL 88300000 57.28%; 1402 us/it; ETA 1d 01:38; 4245f339fae6d84a
2020-05-21 22:27:13 asr2/radeonvii2 Stopping, please wait..
2020-05-21 22:27:13 asr2/radeonvii2 154155713 LL 88342000 57.31%; 1406 us/it; ETA 1d 01:42; ad659b2e1c5725b0
2020-05-21 22:27:13 asr2/radeonvii2 waiting for the Jacobi check to finish..
2020-05-21 22:29:29 asr2/radeonvii2 154155713 OK 88342000 (jacobi == -1)
2020-05-21 22:29:29 asr2/radeonvii2 Exiting because "stop requested"
2020-05-21 22:29:29 asr2/radeonvii2 Bye[/CODE]The underlined timing is anomalous and due to my scrolling the console window.

ATH 2020-05-22 03:49

I can find 19 "PRP Bad" which is a lot more than I thought, but it is only 4 different users:

[M]77230663[/M]
[M]78410041[/M]
[M]79062083[/M]
[M]79075979[/M]
[M]79078529[/M]
[M]79087627[/M]
[M]79109357[/M]
[M]79143199[/M]
[M]81062081[/M]
[M]81069221[/M]
[M]81085727[/M]
[M]81705373[/M]
[M]82054421[/M]
[M]82561769[/M]
[M]82821223[/M]
[M]83018863[/M]
[M]83404301[/M]
[M]83509891[/M]
[M]86670349[/M]


First time PRP tests are called "[B]PRP Unverified (Reliable)[/B]", anyone know what they are called if they are not reliable? ""PRP Unverified (Unreliable)"? "PRP Unverified (Suspect)"? I cannot find any other version than Reliable for the 1st time PRP tests.


Edit: I found "[B]PRP Unverified[/B]" without any parentheses, does that mean they are not reliable? I hope not because there are thousands of those, I guess it is an older way to write the result?

kriesel 2020-05-22 04:14

[QUOTE=preda;546174]I count about 3000 PRP-DC, that's enough for some statistics. What is the error count in those PRP-DC?[/QUOTE]
Totals I posted are from a spreadsheet summation of what ATH posted, 50M thru 110M exponents. Copy/paste; create sums. Copy/paste results to thread.

19 / 2181 from ATH's posts is a much higher PRP error rate than expected.

This one was reported in with an error count of 20 from a flaky Radeon VII running gpuowl. [M]655685803[/M] and it's called "Unverified (Reliable)"

preda 2020-05-22 07:26

[QUOTE=ATH;546176]I can find 19 "PRP Bad" which is a lot more than I thought, but it is only 4 different users:

[M]77230663[/M]
[M]78410041[/M]
[M]79062083[/M]
[M]79075979[/M]
[M]79078529[/M]
[M]79087627[/M]
[M]79109357[/M]
[M]79143199[/M]
[M]81062081[/M]
[M]81069221[/M]
[M]81085727[/M]
[M]81705373[/M]
[M]82054421[/M]
[M]82561769[/M]
[M]82821223[/M]
[M]83018863[/M]
[M]83404301[/M]
[M]83509891[/M]
[M]86670349[/M]


First time PRP tests are called "[B]PRP Unverified (Reliable)[/B]", anyone know what they are called if they are not reliable? ""PRP Unverified (Unreliable)"? "PRP Unverified (Suspect)"? I cannot find any other version than Reliable for the 1st time PRP tests.


Edit: I found "[B]PRP Unverified[/B]" without any parentheses, does that mean they are not reliable? I hope not because there are thousands of those, I guess it is an older way to write the result?[/QUOTE]

Thanks for the list!

It would be useful to have the software&version of all the bad PRP results. (it would be good for the software to be directly displayed in the "exponent status" table). Right now I know that when offset!=0 the software is most likely mprime, for sure not gpuowl. When exponent==0 I don't know -- was there an early version of mprime/PRP that was producing exponent==0? (otherwise it's gpuowl).

It seems to me that all bad results with offset!=0 originate from a single user "Sir Rutherford J. Pinkerton III" (many of them). This raises the question, did mprime generate the bad results, which version, do we understand why it happened? (i.e. was there a bug affecting that version). Why are all of them from this single user? (if it was a bug, why nobody else was affected).

For the bad results with offset==0, I would like to know the software and version. Was it gpuowl producing those? that would be a bit of a surprise to me, but better to know if there's a bug vs. blissful ignorance.

Bad results with non-zero offset, *all* originating from Sir Rutherford J. Pinkerton III
[QUOTE]
86670349 2019-01-08
79143199 2019-01-24
79109357 2019-01-19
79087627 2019-01-18
79078529 2019-01-16
79075979 2019-01-15
79062083 2019-01-09
78410041 2019-02-09
[/QUOTE]

Bad results with zero offset:
[QUOTE]
83509891 2018-04-16 Milwizzle
83404301 2018-04-03 Milwizzle
83018863 2018-03-14 Milwizzle
82821223 2018-03-05 Milwizzle
82561769 2018-02-23 Milwizzle
82054421 2018-02-05 Milwizzle
81705373 2018-05-12 Milwizzle
81085727 2018-08-22 George Woltman
81069221 2018-08-20 George Woltman
81062081 2018-08-20 George Woltman
77230663 2019-09-29 nokno
[/QUOTE]

R. Gerbicz 2020-05-22 07:28

[QUOTE=ATH;546176]I can find 19 "PRP Bad" which is a lot more than I thought, but it is only 4 different users:
[/QUOTE]

Look also the bad(!) residue's date, as I can see only one of them came in March 2019 or later: [url]https://www.mersenne.org/report_exponent/?exp_lo=77230663&full=1[/url] .

And what happened in February: [url]https://mersenneforum.org/showthread.php?p=508163#post508163[/url]
With a proper implementation of error checked all iterations you should never see an error.

PhilF 2020-05-22 14:11

[QUOTE=ATH;546176]Edit: I found "[B]PRP Unverified[/B]" without any parentheses, does that mean they are not reliable? I hope not because there are thousands of those, I guess it is an older way to write the result?[/QUOTE]

I have noticed my older CPU-based PRP tests are listed as (Reliable). But my more recent GPU-based PRP tests (using gpuOwl and a shift of 0) and submitted manually, are simply listed as Unverified.

kriesel 2020-05-22 14:26

Any ideas what's up with this one? It used to be fine in 6.11-257, 5M fft[CODE]
2020-04-21 16:06:06 roa/rx550 94955227 OK 45000000 47.39%; 14196 us/it; ETA 8d 04:59; cf193f810b74508f (check 5.90s)
2020-04-21 16:53:31 roa/rx550 94955227 OK 45200000 47.60%; 14195 us/it; ETA 8d 04:11; 5e5d2f79fcd4c510 (check 5.90s)
2020-04-21 17:40:56 roa/rx550 94955227 OK 45400000 47.81%; 14195 us/it; ETA 8d 03:24; d5fd846bbdc9b33b (check 5.92s)
2020-04-21 18:28:21 roa/rx550 94955227 OK 45600000 48.02%; 14195 us/it; ETA 8d 02:37; 6ce4060b3179317f (check 5.90s)
2020-04-21 19:15:46 roa/rx550 94955227 OK 45800000 48.23%; 14195 us/it; ETA 8d 01:49; ae609bd1a5f3fea0 (check 5.92s)
2020-04-21 20:03:11 roa/rx550 94955227 OK 46000000 48.44%; 14195 us/it; ETA 8d 01:02; cc6f06c61df1792d (check 5.92s)
2020-04-21 20:50:36 roa/rx550 94955227 OK 46200000 48.65%; 14195 us/it; ETA 8d 00:15; bab3f41612444053 (check 5.90s)
2020-04-21 21:38:00 roa/rx550 94955227 OK 46400000 48.86%; 14194 us/it; ETA 7d 23:27; f02f6569aecceee9 (check 5.90s)
2020-04-21 22:25:25 roa/rx550 94955227 OK 46600000 49.08%; 14195 us/it; ETA 7d 22:41; 28ae28232765506f (check 5.91s)
2020-04-21 23:12:50 roa/rx550 94955227 OK 46800000 49.29%; 14194 us/it; ETA 7d 21:52; d9dd697f81777087 (check 5.92s)
2020-04-22 00:00:15 roa/rx550 94955227 OK 47000000 49.50%; 14195 us/it; ETA 7d 21:06; c89bd4f70a72fa7a (check 5.93s)
[/CODE]Maybe the RX550 2GB gpu got damaged somewhat? System went back to the seller after failure to boot April 22. 2 motherboard VRMs and cpu bad. Upon return after repair, still v6.11-257:[CODE]2020-05-18 20:20:59 roa/rx550 94955227 FFT: 5M 1K:10:256 (18.11 bpw)
2020-05-18 20:20:59 roa/rx550 Expected maximum carry32: 48210000
2020-05-18 20:21:01 roa/rx550 OpenCL args "-DEXP=94955227u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.cff594044f17p-3 -DIWEIGHT_STEP=0x8.a4359b7df0158p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-05-18 20:21:12 roa/rx550 OpenCL compilation in 11.02 s
2020-05-18 20:21:20 roa/rx550 94955227 OK 47000000 loaded: blockSize 400, c89bd4f70a72fa7a
2020-05-18 20:21:37 roa/rx550 94955227 OK 47000800 49.50%; 14000 us/it; ETA 7d 18:29; dc7a27da26e7abc8 (check 5.83s)
2020-05-18 20:36:09 roa/rx550 Stopping, please wait..
2020-05-18 20:36:20 roa/rx550 94955227 EE 47063200 49.56%; 14070 us/it; ETA 7d 19:11; d7d7ac6572a3fc43 (check 5.83s)
2020-05-18 20:36:27 roa/rx550 94955227 OK 47000800 loaded: blockSize 400, dc7a27da26e7abc8
2020-05-18 20:36:27 roa/rx550 Exiting because "stop requested"
2020-05-18 20:36:27 roa/rx550 Bye
2020-05-18 20:39:01 config: -device 0 -user kriesel -cpu roa/rx550 -use NO_ASM
2020-05-18 20:39:01 config:
2020-05-18 20:39:01 config: ;UNROLL_HEIGHT,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,CARRY32,ORIGINAL_METHOD,LESS_ACCURATE
2020-05-18 20:39:01 device 0, unique id ''
2020-05-18 20:39:01 roa/rx550 94955227 FFT: 5M 1K:10:256 (18.11 bpw)
2020-05-18 20:39:01 roa/rx550 Expected maximum carry32: 48210000
2020-05-18 20:39:03 roa/rx550 OpenCL args "-DEXP=94955227u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.cff594044f17p-3 -DIWEIGHT_STEP=0x8.a4359b7df0158p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-05-18 20:39:13 roa/rx550 OpenCL compilation in 9.63 s
2020-05-18 20:39:20 roa/rx550 94955227 EE 47000800 loaded: blockSize 400, 45c7eadb491ccd1d (expected dc7a27da26e7abc8)
2020-05-18 20:39:20 roa/rx550 Exiting because "error on load"
2020-05-18 20:39:20 roa/rx550 Bye[/CODE]Update to 6.11-288, tried larger fft, ULTRA_TRIG=1, still trouble:[CODE]
2020-05-21 21:44:18 gpuowl v6.11-288-g20c4213
2020-05-21 21:44:18 config: -d 1 -user kriesel -cpu roa/rx550 -use NO_ASM,ULTRA_TRIG=1 -maxAlloc 1500 -fft 1K:11:256
2020-05-21 21:44:18 device 1, unique id ''
2020-05-21 21:44:18 roa/rx550 94955227 FFT: 5.50M 1K:11:256 (16.46 bpw)
2020-05-21 21:44:18 roa/rx550 Expected maximum carry32: 184C0000
2020-05-21 21:44:20 roa/rx550 OpenCL args "-DEXP=94955227u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xb.97dc04c1d123p-3 -DIWEIGHT_STEP=0xb.0a7bf824a91fp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1 -DULTRA_TRIG=1 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-05-21 21:44:29 roa/rx550 OpenCL compilation in 8.65 s
2020-05-21 21:44:37 roa/rx550 94955227 OK 51550000 loaded: blockSize 400, 9c860c40de33a4cf
2020-05-21 21:44:58 roa/rx550 94955227 OK 51550800 54.29%; 17109 us/it; ETA 8d 14:17; 0ce53d8d9ae59540 (check 7.07s) 24 errors
2020-05-21 21:59:05 roa/rx550 94955227 EE 51600000 54.34%; 17081 us/it; ETA 8d 13:42; f789a22e21e9407b (check 7.07s) 24 errors
2020-05-21 21:59:13 roa/rx550 94955227 OK 51550800 loaded: blockSize 400, 0ce53d8d9ae59540
2020-05-21 22:13:39 roa/rx550 94955227 OK 51600000 54.34%; 17458 us/it; ETA 8d 18:15; f789a22e21e9407b (check 7.07s) 25 errors
2020-05-21 22:28:00 roa/rx550 94955227 OK 51650000 54.39%; 17071 us/it; ETA 8d 13:21; 843af679abee43e9 (check 7.07s) 25 errors
2020-05-21 22:42:20 roa/rx550 94955227 OK 51700000 54.45%; 17071 us/it; ETA 8d 13:07; 5fe9ccd16267b99b (check 7.08s) 25 errors
2020-05-21 22:56:41 roa/rx550 94955227 OK 51750000 54.50%; 17071 us/it; ETA 8d 12:52; 208e56e41e0a40ee (check 7.09s) 25 errors
2020-05-21 23:11:02 roa/rx550 94955227 EE 51800000 54.55%; 17083 us/it; ETA 8d 12:47; 670a56bd99ed976f (check 7.05s) 25 errors
2020-05-21 23:11:10 roa/rx550 94955227 OK 51750000 loaded: blockSize 400, 208e56e41e0a40ee
2020-05-21 23:25:31 roa/rx550 94955227 EE 51800000 54.55%; 17072 us/it; ETA 8d 12:39; 670a56bd99ed976f (check 7.07s) 26 errors
2020-05-21 23:25:38 roa/rx550 94955227 OK 51750000 loaded: blockSize 400, 208e56e41e0a40ee
2020-05-21 23:39:59 roa/rx550 94955227 OK 51800000 54.55%; 17071 us/it; ETA 8d 12:39; 670a56bd99ed976f (check 7.10s) 27 errors
2020-05-21 23:54:20 roa/rx550 94955227 EE 51850000 54.60%; 17071 us/it; ETA 8d 12:24; a65c3c95ed2aec74 (check 7.06s) 27 errors
2020-05-21 23:54:27 roa/rx550 94955227 OK 51800000 loaded: blockSize 400, 670a56bd99ed976f
2020-05-22 00:08:48 roa/rx550 94955227 OK 51850000 54.60%; 17072 us/it; ETA 8d 12:25; a65c3c95ed2aec74 (check 7.07s) 28 errors
2020-05-22 00:23:09 roa/rx550 94955227 OK 51900000 54.66%; 17083 us/it; ETA 8d 12:19; 6b998180a02fdb4e (check 7.07s) 28 errors
2020-05-22 00:37:30 roa/rx550 94955227 OK 51950000 54.71%; 17071 us/it; ETA 8d 11:56; 156ed12844fefe6e (check 7.08s) 28 errors
2020-05-22 00:51:50 roa/rx550 94955227 OK 52000000 54.76%; 17071 us/it; ETA 8d 11:41; 53dfaf72c1f11375 (check 7.07s) 28 errors
2020-05-22 01:06:11 roa/rx550 94955227 EE 52050000 54.82%; 17071 us/it; ETA 8d 11:27; 5244891010efead5 (check 7.05s) 28 errors
2020-05-22 01:06:19 roa/rx550 94955227 OK 52000000 loaded: blockSize 400, 53dfaf72c1f11375
2020-05-22 01:20:39 roa/rx550 94955227 EE 52050000 54.82%; 17072 us/it; ETA 8d 11:28; 5244891010efead5 (check 7.05s) 29 errors
2020-05-22 01:20:47 roa/rx550 94955227 OK 52000000 loaded: blockSize 400, 53dfaf72c1f11375
2020-05-22 01:35:08 roa/rx550 94955227 OK 52050000 54.82%; 17072 us/it; ETA 8d 11:28; 5244891010efead5 (check 7.09s) 30 errors
2020-05-22 01:49:28 roa/rx550 94955227 EE 52100000 54.87%; 17071 us/it; ETA 8d 11:13; 9fa1556aabca1139 (check 7.05s) 30 errors
2020-05-22 01:49:36 roa/rx550 94955227 OK 52050000 loaded: blockSize 400, 5244891010efead5
2020-05-22 02:03:57 roa/rx550 94955227 OK 52100000 54.87%; 17071 us/it; ETA 8d 11:13; 9fa1556aabca1139 (check 7.09s) 31 errors
2020-05-22 02:18:18 roa/rx550 94955227 OK 52150000 54.92%; 17071 us/it; ETA 8d 10:59; 124a2bbd1257aa91 (check 7.07s) 31 errors
2020-05-22 02:32:40 roa/rx550 94955227 EE 52200000 54.97%; 17071 us/it; ETA 8d 10:45; af8a0d4f2c184d08 (check 7.05s) 31 errors
2020-05-22 02:32:48 roa/rx550 94955227 EE 52150000 loaded: blockSize 400, 10993d18dd3e57da (expected 124a2bbd1257aa91)
2020-05-22 02:32:48 roa/rx550 Exiting because "error on load"
2020-05-22 02:32:48 roa/rx550 Bye[/CODE]

kriesel 2020-05-22 15:47

gpuowl-win v6.11-292-gecab9ae build
 
2 Attachment(s)
Here it is, in the usual form.

ATH 2020-05-22 16:50

[QUOTE=PhilF;546205]I have noticed my older CPU-based PRP tests are listed as (Reliable). But my more recent GPU-based PRP tests (using gpuOwl and a shift of 0) and submitted manually, are simply listed as Unverified.[/QUOTE]

Of course the manual submissions does not have enough information if the results are reliable or not, thanks.

Prime95 2020-05-22 17:27

[QUOTE=ATH;546219]Of course the manual submissions does not have enough information if the results are reliable or not, thanks.[/QUOTE]

The "(reliable)" indicates Gerbicz checking was used.

Issues:
1) The first prime95 Gerbicz implementation had a bug which could miss an error.
2) Gpuowl JSON was not reporting "Gerbicz:0" in some versions a few months ago.
3) If you report manually with the screen output rather than the results.json.txt file then the web page will not know that prime95 used Gerbicz checking.

Summary: The PRPDC data needs to be looked at very carefully. I'll look at it myself later today.

Prime95 2020-05-22 17:36

[QUOTE=kriesel;546206]Update to 6.11-288, tried larger fft, ULTRA_TRIG=1, still trouble:[/QUOTE]

ULTRA_TRIG=1 and MIDDLE=11 are a problem -- needs some research on my part. That's why ULTRA_TRIG is not used anywhere by default.

We are looking at changing the FFT crossovers to be more conservative. This is made a bit more difficult due to a rocm optimizer bug that sometimes corrupts the output of our sin/cos routines.

FYI there are 6 crossover points within each FFT length! As gpuowl cycles through several different and more accurate implementations of some code (the MM_CHAIN and MM2_CHAIN variables).

We'll make it better soon.

kriesel 2020-05-22 18:11

[QUOTE=ATH;546219]Of course the manual submissions does not have enough information if the results are reliable or not, thanks.[/QUOTE]
Current versions gpuowl report number of GEC error detections; some earlier versions did not.

ewmayer 2020-05-22 19:35

[QUOTE=Prime95;546145]Judging from these flags -DMM_CHAIN=2u -DMM2_CHAIN=3u the exponent is near the upper limits of that FFT length. Our gpuowl defaults may be too aggressive.

Assuming you saved the problematic files:
Can you try getting past the bad iteration with "ULTRA_TRIG=1"?
Can you get past with "MM_CHAIN=3?[/QUOTE]

George & Mihai, would it possible to reduce the odds of the program simply quitting even in the presence of such errors (whether caused by the ROCm optimizer or overaggressive breakpoints, or hardware glitchage) by using the nearness of the expo to the default max-p for the given FFT length via something like the following, which is more or less the way I do things in Mlucas? Note that even in the absence of per-iter-and-per-convolution-output-datum ROE checking, we can use the bits-per-word to make a reasonable inference as to whether excess ROE might be the cause of a J-check or G-check failure.

[1] If given expo is not running at max-accuracy-needed settings for the FFT length in question but is near the default crossover-to-next-more-accurate MM_CHAIN/MM2_CHAIN settings, switch to the next-more accurate such settings;

[2] If given expo is already running at max-accuracy-needed settings for the FFT length in question, reset MM_CHAIN/MM2_CHAIN settings to least-accurate mode and switch run to next-larger FFT length;

[3] If run is already at a larger-than-default FFT length as a result of [2], assume either user's hardware is faulty or some kind of transient data glitches are occurring, and switch *back* to default FFT lenght and MM_CHAIN/MM2_CHAIN settings.

The strategy in [3] might strike some as bizarre, but:

[a] I've actually seen stuff occurring not-infrequently on my favorite flaky-hardware test system, my old Haswell;

[b] If all the above workarounds fail due to user's hardware or software-install being borked, the run flailing around endlessly makes user no worse off than he would be is the run simply aborted, and - this is important - we may get some valuable diagnostic data from the logfile capturing the flailing-around.

We need to operate under the assumption that all manner of errors might bork an ongoing run, and code our software to be robust in the face of such - users get a whole lot less upset at the program continuing to run, perhaps in slightly suboptimal mode for the given expo, than they do at waking up or checking in to find it simply quit hours or days before.

Prime95 2020-05-22 20:57

[QUOTE=ATH;546176]I can find 19 "PRP Bad" which is a lot more than I thought, but it is only 4 different users?[/QUOTE]

Sir Rutherford's had prime95's suspect Gerbicz check.
My 3 were when I switched to PRP but forgot to upgrade prime95 to a Gerbicz version.
The other 2 users were using pre-Gerbicz prime95s.

In summary, nothing to see here.

Prime95 2020-05-22 21:26

[QUOTE=ewmayer;546228]George & Mihai, would it possible to reduce the odds of the program simply quitting even in the presence of such errors.[/QUOTE]

Current plan is to recompute the default crossovers to be more conservative.

Your idea to restart PRP with the next MM_CHAIN setting is worthwhile. Preda can comment on how difficult that would be to implement.

Other ideas include:
1) More conservative default settings for LL and P-1 as opposed to PRP.
2) A command line parameter for those that want to be more aggressive.

Thanks for everyone's patience. It is a work in progress.

kriesel 2020-05-22 22:13

[QUOTE=ewmayer;546228]We need to operate under the assumption that all manner of errors might bork an ongoing run, and code our software to be robust in the face of such - users get a whole lot less upset at the program continuing to run, perhaps in slightly suboptimal mode for the given expo, than they do at waking up or checking in to find it simply quit hours or days before.[/QUOTE]Amen to that.Even the next longer fft. After running through whatever list is settled on, for "if the default or user-entered settings generate errors too frequently, try successive alternate approaches on the first worktodo entry", comment the problematic entry out, and give the next entry a try. You'll probably want a way to lock out that substitution process as an option, such as for testing.[QUOTE=Prime95;546241]Current plan is to recompute the default crossovers to be more conservative.


Your idea to restart PRP with the next MM_CHAIN setting is worthwhile. Preda can comment on how difficult that would be to implement.

Other ideas include:
1) More conservative default settings for LL and P-1 as opposed to PRP.
2) A command line parameter for those that want to be more aggressive.

Thanks for everyone's patience. It is a work in progress.[/QUOTE]Great progress for 3 years, especially considering the low license cost. And hardware manufacturers and the march up the exponent scale virtually guarantee it will remain a work in progress for quite some time.

preda 2020-05-22 22:53

[QUOTE=ewmayer;546228]George & Mihai, would it possible to reduce the odds of the program simply quitting even in the presence of such errors (whether caused by the ROCm optimizer or overaggressive breakpoints, or hardware glitchage) by using the nearness of the expo to the default max-p for the given FFT length via something like the following, which is more or less the way I do things in Mlucas? Note that even in the absence of per-iter-and-per-convolution-output-datum ROE checking, we can use the bits-per-word to make a reasonable inference as to whether excess ROE might be the cause of a J-check or G-check failure.
[/QUOTE]

There is no need to push the FFT limits in the danger-zone, the goal is to have them set safely such that it's extremely unlikely to hit an error because of FFT-size. In this situation adapting the size dinamically would just "hide the error" in the FFT limits. I'd rather hear the error lound and clear and fix it.

I do have plenty of errors on my poor GPUs, and (in my case) they've never been because of FFT limits yet (unfortunately, they're more serious HW errors). Anyway, adding the adaptive-FFT on top of that would be very confusing.

There is one way to reliably detect FFT errors though: if the same error is hit in the same pleace *with the same residue*, i.e. the error is deterministic, then it's unlikely to be HW.

ewmayer 2020-05-22 23:18

In my experience one can never be 100% safe with exponent limits ... one wants to set them as aggressively as reasonably possible to save wasted work, but we simply cannot eliminate the possibility of one iteration of a particular test hitting an 'unlucky' combination of FFT inputs which cause one or more convolution outputs to have an anomalously high fractional part. With per-iter-and-per-output-ROE checking those will result in a repeatable-on-retry ROE, which is the code is set up to handle, is no big deal, we just the breakpoints to be set so as to make this a rare event.

People have done rigorous ROE bounding for these kinds of computations, and the result is invariably that to be 100% safe, one must use exponent bounds *drastically* lower than we actually do. But, not my call outside of my own code.

preda 2020-05-23 01:54

[QUOTE=ewmayer;546250]In my experience one can never be 100% safe with exponent limits ... one wants to set them as aggressively as reasonably possible to save wasted work, but we simply cannot eliminate the possibility of one iteration of a particular test hitting an 'unlucky' combination of FFT inputs which cause one or more convolution outputs to have an anomalously high fractional part. With per-iter-and-per-output-ROE checking those will result in a repeatable-on-retry ROE, which is the code is set up to handle, is no big deal, we just the breakpoints to be set so as to make this a rare event.

People have done rigorous ROE bounding for these kinds of computations, and the result is invariably that to be 100% safe, one must use exponent bounds *drastically* lower than we actually do. But, not my call outside of my own code.[/QUOTE]

It can be modelled statistically with a distribution. Then you don't talk about 100% safe, but about the probability of having at least one roundoff-overflow over the number of iterations of the whole test. E.g. the FFT bounds could be set such that the probability of "at least one" roundoff-overflow over the 100M iterations of a test to be under 0.001, which would mean that I expect one roundoff problem in 1000 tests, and that's it.


All times are UTC. The time now is 07:02.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.