![]() |
[QUOTE=preda;543102]I see in this case that the residue was correct when the error was reported (because the line with "OK" on the same iteration has the same residue). Is this a pattern -- do you see the same for the previous errors?
The most likely explanation is still GPU error (either memory-related or processor related). Do you have another similar GPU to try on, for comparison? The roundoff being large is most likely a red herring here.[/QUOTE] I now have -use STATS on 3 gpus, for the time being, and will report anything that seems interesting. The performance drag is considerable. That 30 hours to catch one EE with stats could have been run in 25 hours without stats. I may update an rx480 instance and add it to the test; that's on the same dual-hex-core system so might be same cpu core but more likely not unless I start setting core affinities. More likely the same complement of system ram though. [CODE]Preda asks, Were the earlier occurrences' res64s correct despite the EEs? yes 6, unknown 2, no 0; the ayes have it. 1,2, can't tell from logs v6.11-257 2020-04-14 04:59:31 condorella/rx550 94741139 OK 6800000 7.18%; 13712 us/it; ETA 13d 22:58; cab0b7a0fb0cc066 (check 5.65s) 2020-04-14 05:45:17 condorella/rx550 94741139 EE 7000000 7.39%; 13710 us/it; ETA 13d 22:09; 5e731e02beb738ea (check 5.61s) 2020-04-14 05:45:23 condorella/rx550 94741139 OK 6800000 loaded: blockSize 400, cab0b7a0fb0cc066 2020-04-14 05:54:37 condorella/rx550 94741139 OK 6840000 7.22%; 13711 us/it; ETA 13d 22:46; 4f7b98cea0650fb9 (check 5.63s) 1 errors 2020-04-14 06:22:07 condorella/rx550 94741139 OK 6960000 7.35%; 13714 us/it; ETA 13d 22:24; a47542d527e8a188 (check 5.63s) 1 errors 2020-04-14 06:49:37 condorella/rx550 94741139 EE 7080000 7.47%; 13711 us/it; ETA 13d 21:53; b71198a3d710f35b (check 5.62s) 1 errors 2020-04-14 06:49:43 condorella/rx550 94741139 OK 6960000 loaded: blockSize 400, a47542d527e8a188 2020-04-14 07:09:07 condorella/rx550 94741139 OK 7040000 7.43%; 13716 us/it; ETA 13d 22:08; b7ef942604ff7e9d (check 5.62s) 2 errors 2020-04-14 07:27:29 condorella/rx550 94741139 OK 7120000 7.52%; 13710 us/it; ETA 13d 21:42; 5c248da8f1d53306 (check 5.64s) 2 errors 2020-04-14 07:45:51 condorella/rx550 94741139 OK 7200000 7.60%; 13713 us/it; ETA 13d 21:27; 679505ffa3183075 (check 5.63s) 2 errors 3,4 yes 2020-04-14 10:31:16 condorella/rx550 94741139 OK 7920000 8.36%; 13706 us/it; ETA 13d 18:32; 65631a9de20e1074 (check 5.80s) 2 errors 2020-04-14 10:49:38 condorella/rx550 94741139 EE 8000000 8.44%; 13714 us/it; ETA 13d 18:27; d5fca8bd937ae862 (check 5.62s) 2 errors 2020-04-14 10:49:44 condorella/rx550 94741139 OK 7920000 loaded: blockSize 400, 65631a9de20e1074 2020-04-14 10:58:58 condorella/rx550 94741139 EE 7960000 8.40%; 13708 us/it; ETA 13d 18:27; 5512a7950572a594 (check 5.61s) 3 errors 2020-04-14 10:59:04 condorella/rx550 94741139 OK 7920000 loaded: blockSize 400, 65631a9de20e1074 2020-04-14 11:08:18 condorella/rx550 94741139 OK 7960000 8.40%; 13716 us/it; ETA 13d 18:38; 5512a7950572a594 (check 5.62s) 4 errors 2020-04-14 11:17:32 condorella/rx550 94741139 OK 8000000 8.44%; 13712 us/it; ETA 13d 18:24; d5fca8bd937ae862 (check 5.73s) 4 errors 2020-04-14 11:26:45 condorella/rx550 94741139 OK 8040000 8.49%; 13706 us/it; ETA 13d 18:06; 8443b148d25a6ac2 (check 5.63s) 4 errors 5 yes 2020-04-14 15:17:31 condorella/rx550 94741139 OK 9040000 9.54%; 13708 us/it; ETA 13d 14:20; 1d49e02d84ebc14b (check 5.80s) 4 errors 2020-04-14 15:26:44 condorella/rx550 94741139 EE 9080000 9.58%; 13709 us/it; ETA 13d 14:13; 071b118bfd9c270e (check 5.61s) 4 errors 2020-04-14 15:26:50 condorella/rx550 94741139 OK 9040000 loaded: blockSize 400, 1d49e02d84ebc14b 2020-04-14 15:36:04 condorella/rx550 94741139 OK 9080000 9.58%; 13714 us/it; ETA 13d 14:19; 071b118bfd9c270e (check 5.62s) 5 errors 2020-04-14 15:45:18 condorella/rx550 94741139 OK 9120000 9.63%; 13708 us/it; ETA 13d 14:02; 6a7e1626bbbcb964 (check 5.63s) 5 errors 6 yes 2020-04-14 20:49:58 condorella/rx550 94741139 OK 10440000 11.02%; 13712 us/it; ETA 13d 09:06; d2720a83125e79ee (check 5.62s) 5 errors 2020-04-14 20:59:11 condorella/rx550 94741139 EE 10480000 11.06%; 13717 us/it; ETA 13d 09:03; 947499a4cc1fd4fe (check 5.61s) 5 errors 2020-04-14 20:59:18 condorella/rx550 94741139 OK 10440000 loaded: blockSize 400, d2720a83125e79ee 2020-04-14 21:08:32 condorella/rx550 94741139 OK 10480000 11.06%; 13722 us/it; ETA 13d 09:10; 947499a4cc1fd4fe (check 5.63s) 6 errors 7,8 yes 2020-04-15 00:31:38 condorella/rx550 94741139 OK 11360000 11.99%; 13710 us/it; ETA 13d 05:33; 98dbac1057665909 (check 5.63s) 6 errors 2020-04-15 00:40:52 condorella/rx550 94741139 EE 11400000 12.03%; 13717 us/it; ETA 13d 05:33; a0464605ba0b9bf6 (check 5.61s) 6 errors 2020-04-15 00:40:58 condorella/rx550 94741139 OK 11360000 loaded: blockSize 400, 98dbac1057665909 2020-04-15 00:50:12 condorella/rx550 94741139 OK 11400000 12.03%; 13715 us/it; ETA 13d 05:30; a0464605ba0b9bf6 (check 5.63s) 7 errors 2020-04-15 00:59:25 condorella/rx550 94741139 EE 11440000 12.07%; 13705 us/it; ETA 13d 05:07; 59fd18a546ca4936 (check 5.61s) 7 errors 2020-04-15 00:59:31 condorella/rx550 94741139 OK 11400000 loaded: blockSize 400, a0464605ba0b9bf6 2020-04-15 01:08:45 condorella/rx550 94741139 OK 11440000 12.07%; 13712 us/it; ETA 13d 05:17; 59fd18a546ca4936 (check 5.72s) 8 errors [/CODE] |
[QUOTE=kriesel;543107]I now have -use STATS on 3 gpus, for the time being, and will report anything that seems interesting. The performance drag is considerable. That 30 hours to catch one EE with stats could have been run in 25 hours without stats. I may update an rx480 instance and add it to the test; that's on the same dual-hex-core system so might be same cpu core but more likely not unless I start setting core affinities. More likely the same complement of system ram though.
[CODE]Preda asks, Were the earlier occurrences' res64s correct despite the EEs? yes 6, unknown 2, no 0; the ayes have it. [/CODE] [/QUOTE] Interesting. Are you running with any -use options, in particular CARRYM32 ? I'm trying to understand what would produce the "spurious errors" you see. The main computation is correct, the errors are affecting the check only (or something related to the check). I don't see any particular benefit in running this with STATS. There is no danger of a genuine overflow for those exponents. |
[QUOTE=preda;543113]Interesting. Are you running with any -use options, in particular CARRYM32 ?[/QUOTE]NO_ASM and sometimes STATS. That's it.
I gave up for now on trying to keep up on the rapidly shifting availability of optimization choices. They also made operation fragile, since they sometimes would be illegal and terminate the program when fft length changed from one worktodo exponent to the next. What was optimal for one fft length was not allowed on another and the program would terminate, costing hours or days. [QUOTE=kriesel;542963]All on the same RX550 gpu and host system, that has reliably run without GEC errors for multiple 5M PRP first-tests on v6.11-134: 90710093 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.30 bits/word 92858651 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.71 bits/word 93461911 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.83 bits/word 93873049 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.90 bits/word 94418047 began FFT 5120K: Width 256x4, Height 64x4, Middle 10; 18.01 bits/word, ran ok. finished at 6M because of problems encountered at 7M on 131.5M, requiring -fft +6 All runs [B]-use NO_ASM[/B] V6.11-134 on rx550 from start on 94741139; no GEC to 2.8M iterations; this was at 6M fft, 18370 us/iter due to problems seen with 7M fft (leftover -fft +6 config.txt content) V6.11-257 continuation on same rx550, 5M fft 1K:5:512; 14310 us/iter to 6.0544M iterations V6.11-257 continuation on same RX550, no fft specification; 1K:10:256 chosen by program; ~13712 us/iter, 9 GEC errors by 22.1M iterations. V6.11-259 continuation with -use STATS on same RX550, 1K:10:256, ~16542 us/iter, no additional GEC through 25.64M iterations. V6.11-259 continuation without -use STATS underway now, 13750 usec/iter[/QUOTE] V6.11-257 folder's entire config.txt:[CODE]-device 1 -user kriesel -cpu condorella/rx550 -yield -maxAlloc 3600 -use NO_ASM [/CODE]V6.11-259 folder's entire config.txt:[CODE]-device 1 -user kriesel -cpu condorella/rx550 -yield -maxAlloc 3600 -use NO_ASM,STATS [/CODE]A look through the Windows system log showed nothing relevant around the times of the EE occurrences. |
[QUOTE=preda;543113]I don't see any particular benefit in running this with STATS. There is no danger of a genuine overflow for those exponents.[/QUOTE]I've continued STATS a while. If nothing else, it provides confirmation of the margins built in against roundoff error.
The RX480 instance has been updated to v6.11-264 and has not shown any issues on the same system. I own multiple RX550s, and this may be one of the older ones. EEs #11 and 12 also matched their respective ok res64s:[CODE]2020-04-19 08:41:38 condorella/rx550 94741139 OK 35760000 37.74%; 16554 us/it; ETA 11d 07:13; f181b958d4edbb40 (check 6.79s) 10 errors 2020-04-19 08:52:40 condorella/rx550 Roundoff: N=40500, max 0.507109, avg 0.205147, sdev 0.012284 (0.059880, 0.061539), max-round 0.401694 2020-04-19 08:52:40 condorella/rx550 Carry: N=40499, max 39fed29e, avg 2b55373f; CarryM: N=1, max 8d57bc50, avg 8d57bc50 2020-04-19 08:52:47 condorella/rx550 94741139 EE 35800000 37.79%; 16553 us/it; ETA 11d 07:01; 09e2502ff060aa59 (check 6.80s) 10 errors 2020-04-19 08:52:54 condorella/rx550 94741139 OK 35760000 loaded: blockSize 400, f181b958d4edbb40 2020-04-19 09:03:56 condorella/rx550 Roundoff: N=40928, max 0.292142, avg 0.205081, sdev 0.012278 (0.059870, 0.061527), max-round 0.401532 2020-04-19 09:03:56 condorella/rx550 Carry: N=40926, max 39fed29e, avg 2b52fd08; CarryM: N=2, max 7f8fee13, avg 6ae2001d 2020-04-19 09:04:03 condorella/rx550 94741139 OK 35800000 37.79%; 16549 us/it; ETA 11d 06:57; 09e2502ff060aa59 (check 6.79s) 11 errors 2020-04-19 09:15:05 condorella/rx550 Roundoff: N=40500, max 0.308341, avg 0.205111, sdev 0.012159 (0.059278, 0.060903), max-round 0.399648 2020-04-19 09:15:05 condorella/rx550 Carry: N=40499, max 3ff202a2, avg 2b532f47; CarryM: N=1, max 79fdf537, avg 79fdf537 2020-04-19 09:15:11 condorella/rx550 94741139 OK 35840000 37.83%; 16551 us/it; ETA 11d 06:47; 72081dbe2707fc8b (check 6.80s) 11 errors ... 2020-04-19 12:02:25 condorella/rx550 94741139 OK 36440000 38.46%; 16554 us/it; ETA 11d 04:05; 371ec60215fd8888 (check 6.83s) 11 errors 2020-04-19 12:13:28 condorella/rx550 Roundoff: N=40500, max 0.317265, avg 0.205178, sdev 0.012361 (0.060245, 0.061923), max-round 0.402951 2020-04-19 12:13:28 condorella/rx550 Carry: N=40499, max 39e4bbc8, avg 2b541ee3; CarryM: N=1, max 88c9f1c4, avg 88c9f1c4 2020-04-19 12:13:34 condorella/rx550 94741139 OK 36480000 38.50%; 16552 us/it; ETA 11d 03:53; cf6fab81c6c3e53d (check 6.79s) 11 errors 2020-04-19 12:24:37 condorella/rx550 Roundoff: N=40500, max 0.507318, avg 0.205169, sdev 0.012261 (0.059762, 0.061414), max-round 0.401350 2020-04-19 12:24:37 condorella/rx550 Carry: N=40499, max 4031f5fd, avg 2b513507; CarryM: N=1, max 7ca22e72, avg 7ca22e72 2020-04-19 12:24:43 condorella/rx550 94741139 EE 36520000 38.55%; 16557 us/it; ETA 11d 03:46; a73fe45b46915805 (check 6.78s) 11 errors 2020-04-19 12:24:51 condorella/rx550 94741139 OK 36480000 loaded: blockSize 400, cf6fab81c6c3e53d 2020-04-19 12:35:56 condorella/rx550 Roundoff: N=40928, max 0.289129, avg 0.205117, sdev 0.012271 (0.059823, 0.061479), max-round 0.401450 2020-04-19 12:35:56 condorella/rx550 Carry: N=40926, max 4031f5fd, avg 2b4ebaf7; CarryM: N=2, max 9fd2a241, avg 7a4f0e2d 2020-04-19 12:36:03 condorella/rx550 94741139 OK 36520000 38.55%; 16643 us/it; ETA 11d 05:10; a73fe45b46915805 (check 6.79s) 12 errors[/CODE]I've just migrated this run in progress again, to v6.11-268 on the same gpu, same config.txt. |
Gpuowl-win v6.11-268-g0d07d21 build
2 Attachment(s)
Under test now.
|
[QUOTE=preda;543106]More precisely, given reverse-weight "w" and FFT-output word "x", the error is computed as:
abs(FMA(x, w, -rint(x * w))); which, arguably, can be larger that 0.5.[/QUOTE] If we agree that x is a given convolution output, which is expected to be an integer using exact arithmetic, and the fractional part of x as computed using inexact arithmetic is the absolute value of the difference between x and nearest_int(x), that is by definition in [0,0.5]. If your actual way of computing said fractional error itself introduces addition error so that your frac(x) != abs(x - nearest_int(x)), that is a separate issue. But appreciate the clarification. Since gpuOwl is not using stats about such fractional errors to decide if the FFT length needs to be upped, accurately computing frac(x) is less important than for programs which do make use of same. |
[QUOTE=ewmayer;543200]Since gpuOwl is not using stats about such fractional errors to decide if the FFT length needs to be upped, accurately computing frac(x) is less important than for programs which do make use of same.[/QUOTE]
Well, we are using the stats to decide when the FFT length needs to be upped. However, the current code is just as likely to compute the fractional part low vs. high and were using the average max roundoff error so it all works in out in the end. Over the last few weeks we've managed to increase the maximum exponent that can be tested with a 5M FFT by over a million. I had to do this because I'm oh so close to being assigned exponents that would have pushed me into the 5.5M FFT. I know, very selfish :) |
[QUOTE=kriesel;543176]I've just migrated this run in progress again, to v6.11-268 on the same gpu, same config.txt.[/QUOTE]This was still V6.11-259, same rx550 and system:
For EEs # 13, 14, 15, res64s also repeated, and max > 0.50 were observed.[CODE]2020-04-19 18:35:46 condorella/rx550 94741139 OK 37800000 39.90%; 16558 us/it; ETA 10d 21:54; c6ff48ecd994dbaf (check 6.81s) 12 errors 2020-04-19 18:46:49 condorella/rx550 Roundoff: N=40500, [B]max 0.507793[/B], avg 0.205110, sdev 0.012265 (0.059799, 0.061453), max-round 0.401355 2020-04-19 18:46:49 condorella/rx550 Carry: N=40499, max 3938aaef, avg 2b4f1494; CarryM: N=1, max 8da124ce, avg 8da124ce 2020-04-19 18:46:55 condorella/rx550 94741139 [B]EE[/B] 37840000 39.94%; 16556 us/it; ETA 10d 21:41; 55bb210f90f33b08 (check 6.77s) 12 errors 2020-04-19 18:47:03 condorella/rx550 94741139 OK 37800000 loaded: blockSize 400, c6ff48ecd994dbaf 2020-04-19 18:58:05 condorella/rx550 Roundoff: N=40928, max 0.285702, avg 0.205056, sdev 0.012282 (0.059894, 0.061553), max-round 0.401561 2020-04-19 18:58:05 condorella/rx550 Carry: N=40926, max 3938aaef, avg 2b4d3aac; CarryM: N=2, max 8184df04, avg 6c12c45b 2020-04-19 18:58:12 condorella/rx550 94741139 OK 37840000 39.94%; 16561 us/it; ETA 10d 21:45; 55bb210f90f33b08 (check 6.80s) 13 errors 2020-04-19 19:09:14 condorella/rx550 Roundoff: N=40500, max 0.280020, avg 0.205015, sdev 0.012088 (0.058962, 0.060569), max-round 0.398423 2020-04-19 19:09:14 condorella/rx550 Carry: N=40499, max 3a6f97e3, avg 2b4b4bdd; CarryM: N=1, max 80be9c3a, avg 80be9c3a ... 2020-04-19 21:23:08 condorella/rx550 94741139 OK 38360000 40.49%; 16551 us/it; ETA 10d 19:12; 7a4b392aea8ba6b3 (check 6.79s) 13 errors 2020-04-19 21:34:10 condorella/rx550 Roundoff: N=40500, [B]max 0.505375[/B], avg 0.205216, sdev 0.012323 (0.060051, 0.061719), max-round 0.402390 2020-04-19 21:34:10 condorella/rx550 Carry: N=40499, max 39289dae, avg 2b50054a; CarryM: N=1, max 89e4790c, avg 89e4790c 2020-04-19 21:34:17 condorella/rx550 94741139 [B]EE[/B] 38400000 40.53%; 16552 us/it; ETA 10d 19:03; 72bd6fa0e937b804 (check 6.77s) 13 errors 2020-04-19 21:34:24 condorella/rx550 94741139 OK 38360000 loaded: blockSize 400, 7a4b392aea8ba6b3 2020-04-19 21:45:26 condorella/rx550 Roundoff: N=40928, max 0.308075, avg 0.205175, sdev 0.012357 (0.060226, 0.061904), max-round 0.402885 2020-04-19 21:45:26 condorella/rx550 Carry: N=40926, max 39289dae, avg 2b4e41e9; CarryM: N=2, max 87679ef6, avg 75444180 2020-04-19 21:45:33 condorella/rx550 94741139 OK 38400000 40.53%; 16549 us/it; ETA 10d 19:00; 72bd6fa0e937b804 (check 6.84s) 14 errors 2020-04-19 21:56:35 condorella/rx550 Roundoff: N=40500, max 0.297735, avg 0.205119, sdev 0.012226 (0.059606, 0.061249), max-round 0.400741 2020-04-19 21:56:35 condorella/rx550 Carry: N=40499, max 3b844536, avg 2b5043a2; CarryM: N=1, max 8422e711, avg 8422e711 ... 2020-04-20 01:50:43 condorella/rx550 94741139 OK 39280000 41.46%; 16541 us/it; ETA 10d 14:49; 656629c4657b02f0 (check 6.84s) 14 errors 2020-04-20 02:01:44 condorella/rx550 Roundoff: N=40500, [B]max 0.503906[/B], avg 0.205072, sdev 0.012220 (0.059587, 0.061229), max-round 0.400584 2020-04-20 02:01:44 condorella/rx550 Carry: N=40499, max 3d252f79, avg 2b52cd29; CarryM: N=1, max 7af459d9, avg 7af459d9 2020-04-20 02:01:51 condorella/rx550 94741139 [B]EE [/B]39320000 41.50%; 16542 us/it; ETA 10d 14:40; 4b750a9575434d29 (check 6.77s) 14 errors 2020-04-20 02:01:58 condorella/rx550 94741139 OK 39280000 loaded: blockSize 400, 656629c4657b02f0 2020-04-20 02:13:00 condorella/rx550 Roundoff: N=40928, max 0.302707, avg 0.205027, sdev 0.012249 (0.059741, 0.061392), max-round 0.401003 2020-04-20 02:13:00 condorella/rx550 Carry: N=40926, max 3d252f79, avg 2b513065; CarryM: N=2, max 82276967, avg 704e0507 2020-04-20 02:13:07 condorella/rx550 94741139 OK 39320000 41.50%; 16543 us/it; ETA 10d 14:40; 4b750a9575434d29 (check 6.80s) 15 errors 2020-04-20 02:24:09 condorella/rx550 Roundoff: N=40500, max 0.297502, avg 0.205151, sdev 0.012245 (0.059690, 0.061338), max-round 0.401078 2020-04-20 02:24:09 condorella/rx550 Carry: N=40499, max 3de87af2, avg 2b53fed7; CarryM: N=1, max 7d3fca8a, avg 7d3fca8a[/CODE]I will swap out the RX550 for a different unit after a trial of v6.11-268 if it also produces such EE occurrences. |
1 Attachment(s)
The biggest issue I see now with gpuOwl+LL is the fact that when something like this happens, you need to restart both of them from scratch, even if you would be able to detect when the "thing" happens, if the cards are just a bit "out of phase" (which they ARE, because one is always a bit faster), there is no way to know which one is good and which one is bad, and there is no way to resume from THAT specific iteration. We tried old switches that used to work, like -saveStep or variants, they are not in the help anymore, but we hoped... and hoped...
[ATTACH]22071[/ATTACH] We want checkpoint files, called "exponent.iteration.residue.whatever", otherwise we are doomed to waste two R7s for few hours in average every time a naughty bit humps here and there... And non-zero shift... And vanilla ice cream... |
[QUOTE=LaurV;543250]The biggest issue I see now with gpuOwl+LL is the fact that when something like this happens, you need to restart both of them from scratch[/QUOTE]You don't do backups?
You could try a tiebreaker third gpu. And/or one running CUDALucas on a suitable gpu. From the CUDALucas.ini file, [CODE]# SaveAllCheckpoints is the same as the -s option. When active, CUDALucas will # save each checkpoint separately in the folder specified in the "SaveFolder" # option above. This is a binary option; set to 1 to activate, 0 to de-activate. SaveAllCheckpoints=1 # This option is the name of the folder where the separate checkpoint files are # saved. This option is only checked if SaveAllCheckpoints is activated. SaveFolder=savefiles[/CODE] [URL]https://www.mersenneforum.org/showpost.php?p=489059&postcount=2[/URL] |
As usual, unrelated. Lots of clutter.
[QUOTE=LaurV;543250]gpuOwl+LL[/QUOTE] |
| All times are UTC. The time now is 23:06. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.