![]() |
[QUOTE=kriesel;540709]I'd be skeptical about the performance advantage of running too disparate parallel runs. I've seen it reduce throughput. PRP & LL in tandem, for example, which is different code from different versions.
Did you use -maxAlloc for your P-1 run? If not, start, and if doing parallel runs the limit will need to be lower than if the P-1 stage 2 has the gpu ram to itself.[/QUOTE] I was just running 2 separate PRP-assignment jobs - for the PRPs there is a marked throughput boost from 2-job-running (cf. my timings in post #1956) - one of which just happened to start on a PRP-assignment for which p-1 had not yet been done. Not using -maxAlloc. [QUOTE]So can we count you as another fan of P-1 save files?[/QUOTE] I'm a fan of doing whatever works for increasing users' overall throughput! :) That of course includes minimizing wasted time resulting from run-crashes/BSODs/system-resets/etc. |
GpuOwl P-1 error detection and handling
Gpuowl stage 1 needs a res64 error check. This was in v6.11-134.[CODE]2020-03-25 00:57:17 roa/radeonvii 550000007 FFT 36864K: Width 256x4, Height 256x8, Middle 9; 14.57 bits/word
2020-03-25 00:57:25 roa/radeonvii OpenCL args "-DEXP=550000007u -DWIDTH=1024u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xa.c7166b9401b18p-3 -DIWEIGHT_STEP=0xb.e05b1786463ap-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_METHOD=1 -DLESS_ACCURATE=1 -DMERGED_MIDDLE=1 -DNO_ASM=1 -DORIG_MIDDLEMUL2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_REVERSELINE=1 -DUNROLL_MIDDLEMUL2=1 -DWORKINGIN5=1 -DWORKINGOUT3=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-03-25 00:57:31 roa/radeonvii OpenCL compilation in 5.88 s 2020-03-25 00:57:34 roa/radeonvii 550000007 P1 B1=5070000, B2=152100000; 7315345 bits; starting at 0 2020-03-25 00:59:08 roa/radeonvii 550000007 P1 10000 0.14%; 9433 us/it; ETA 0d 19:09; c8b2127abc38054b 2020-03-25 01:00:43 roa/radeonvii 550000007 P1 20000 0.27%; 9433 us/it; ETA 0d 19:07; 6f401486d14cb20f 2020-03-25 01:02:17 roa/radeonvii 550000007 P1 30000 0.41%; 9431 us/it; ETA 0d 19:05; 18a926611c75b118 2020-03-25 01:02:37 roa/radeonvii saved ... 2020-03-25 15:08:30 roa/radeonvii saved 2020-03-25 15:09:08 roa/radeonvii 550000007 P1 5390000 73.68%; 9582 us/it; ETA 0d 05:07; 4cfe624d31a00e27 2020-03-25 15:10:42 roa/radeonvii 550000007 P1 5400000 73.82%; 9428 us/it; ETA 0d 05:01; [COLOR=Red][B]0000000000000000[/B][/COLOR] 2020-03-25 15:12:16 roa/radeonvii 550000007 P1 5410000 73.95%; 9424 us/it; ETA 0d 04:59; 0000000000000000 2020-03-25 15:13:32 roa/radeonvii saved[/CODE]Fourteen hours into the computation, an error occurred that zeroed the residue. The program does not detect the error. It continued powering the zero residue for the remaining iteration count, and periodically updating its two save files with bad interim results, for 5 more hours. It then appears to skip the stage 1 GCD under the error condition detected at the end of the set of iterations. Resume proceeds despite the bad input from the latter part of stage 1, also skipping the stage 1 GCD.[CODE]2020-03-25 20:10:24 roa/radeonvii saved 2020-03-25 20:11:58 roa/radeonvii 550000007 P1 7310000 99.93%; 9581 us/it; ETA 0d 00:01; 0000000000000000 2020-03-25 20:12:50 roa/radeonvii saved 2020-03-25 20:12:51 roa/radeonvii 550000007 P1 7315345 100.00%; 9913 us/it; ETA 0d 00:00; [COLOR=red][B]0000000000000000[/B][/COLOR] 2020-03-25 20:12:56 roa/radeonvii P-1 (B1=5070000, B2=152100000, D=30030): primes 8202674, expanded 8746218, doubles 1277965 (left 5804395), singles 5646744, total 6924709 (84%) 2020-03-25 20:12:56 roa/radeonvii 550000007 P2 using blocks [169 - 5065] to cover 6924709 primes 2020-03-25 20:12:57 roa/radeonvii 550000007 P2 using 38 buffers of 288.0 MB each 2020-03-25 20:31:18 roa/radeonvii 550000007 P2 38/2880: 92454 primes; setup 2.16 s, 11.881 ms/prime 2020-03-25 20:31:18 roa/radeonvii Exception St12domain_error: GCD invalid input 2020-03-25 20:31:18 roa/radeonvii waiting for background GCDs.. 2020-03-25 20:31:18 roa/radeonvii Bye C:\Users\ken\Documents\gpuowl-v6.11-134-g1e0ce1d>g611 C:\Users\ken\Documents\gpuowl-v6.11-134-g1e0ce1d>gpuowl-win 2020-03-26 09:35:49 gpuowl v6.11-134-g1e0ce1d 2020-03-26 09:35:49 config: -device 1 -user kriesel -cpu roa/radeonvii -yield -maxAlloc 16000 -use NO_ASM,UNROLL_MIDDLEMUL2,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT,CARRY32,CHEBYSHEV_METHOD,ORIG_MIDDLEMUL2,LESS_ACCURATE 2020-03-26 09:35:49 config: 2020-03-26 09:35:49 config: ;NO_ASM,ORIG_SLOWTRIG 2020-03-26 09:35:49 config: ;40M NO_ASM,UNROLL_MIDDLEMUL2,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT,CARRY32,CHEBYSHEV_METHOD,ORIG_MIDDLEMUL2,LESS_ACCURATE 2020-03-26 09:35:49 device 1, unique id '' 2020-03-26 09:35:49 roa/radeonvii 550000007 FFT 36864K: Width 256x4, Height 256x8, Middle 9; 14.57 bits/word 2020-03-26 09:35:58 roa/radeonvii OpenCL args "-DEXP=550000007u -DWIDTH=1024u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xa.c7166b9401b18p-3 -DIWEIGHT_STEP=0xb.e05b1786463ap-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_METHOD=1 -DLESS_ACCURATE=1 -DMERGED_MIDDLE=1 -DNO_ASM=1 -DORIG_MIDDLEMUL2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_REVERSELINE=1 -DUNROLL_MIDDLEMUL2=1 -DWORKINGIN5=1 -DWORKINGOUT3=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-03-26 09:36:05 roa/radeonvii OpenCL compilation in 6.68 s 2020-03-26 09:36:08 roa/radeonvii 550000007 P1 B1=5070000, B2=152100000; 7315345 bits; starting at 7315344 2020-03-26 09:36:09 roa/radeonvii 550000007 P2 B1=5070000, B2=152100000, starting at 38 2020-03-26 09:36:14 roa/radeonvii P-1 (B1=5070000, B2=152100000, D=30030): primes 8202674, expanded 8746218, doubles 1277965 (left 5804395), singles 5646744, total 6924709 (84%) 2020-03-26 09:36:14 roa/radeonvii 550000007 P2 using blocks [169 - 5065] to cover 6924709 primes 2020-03-26 09:36:14 roa/radeonvii 550000007 P2 using 38 buffers of 288.0 MB each 2020-03-26 09:54:35 roa/radeonvii 550000007 P2 76/2880: 92460 primes; setup 2.30 s, 11.868 ms/prime [/CODE]Since there is no periodic permanently retained save file from before the error occurred, and both the stage 1 save files are from after the unhandled error, the entire run is a loss (~33 hours wall clock). Stage 2 should not proceed from bad input from stage 1, but it does, without warning. Error checks, and a field for "passed last error check" in the save file could handle that. |
[QUOTE=kriesel;540929]Gpuowl stage 1 needs a res64 error check.[/QUOTE]
Hi Ken, I agree that was a loss. I'll look into improving this. |
Gpuowl-win v6.11-219-ge70ec99 build
2 Attachment(s)
Built, produced a help output, no other testing yet.
|
1 Attachment(s)
[QUOTE=preda;540307]I would like to start using __attribute__((overloadable)) in gpuowl OpenCL source, but before that I'd like to find out whether it's supported everywhere we care.
The attribute is described here: [URL]https://clang.llvm.org/docs/AttributeReference.html#overloadable[/URL] I would like confirmation that it works on these platforms: - windows (with whatever OpenCL windows uses for AMD GPUs -- catalyst?) - Nvidia - amdgpuPro (the other driver for Linux vs. ROCm) To check the attribute, simply add "__attribute__((overloadable))" to some function between the return type and function name, e.g.: in gpuowl.cl Replace T2 mul(T2 a, T2 b) ... with T2 __attribute__((overloadable)) mul(T2 a, T2 b) ... And recompile, and afterwards *run* the resulting gpuowl to check the OpenCL compilation that happens at startup. Thanks! Note: the title should read "__attribute__((overloadable))", double parens.[/QUOTE] AOK on AMD RX480 /Win7 x64: [CODE]// complex mul T2 __attribute__((overloadable)) mul(T2 a, T2 b) { return U2(mad1(a.x, b.x, -a.y * b.y), mad1(a.x, b.y, a.y * b.x)); } Driver version as indicated by GPU-Z: 25.20.14007.1000 (Adrenalin 18.10.21/Win 64) [/CODE][CODE]2020-03-26 17:16:48 gpuowl v6.11-219-ge70ec99-dirty 2020-03-26 17:16:48 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 2020-03-26 17:16:48 device 0, unique id '' 2020-03-26 17:16:48 condorella/rx480 97685813 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 16.94 bits/word 2020-03-26 17:16:51 condorella/rx480 OpenCL args "-DEXP=97685813u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x8.598a6d26b0dap-3 -DIWEIGHT_STE P=0xf.546b91e1254f8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=1 -DAMDGPU=1 -cl-fast-relaxed-math -cl-std=CL2.0 " 2020-03-26 17:16:55 condorella/rx480 OpenCL compilation in 4.81 s 2020-03-26 17:16:56 condorella/rx480 97685813 P1 B1=1000000, B2=27000000; 1442134 bits; starting at 0 2020-03-26 17:17:34 condorella/rx480 97685813 P1 10000 0.69%; 3785 us/it; ETA 0d 01:30; 6bd301fd8aadd98a[/CODE]Also on Win7 x64, NVIDIA GTX1080, NVIDIA driver version 378.92:[CODE]C:\Users\ken\Documents\gpuowl-v6.11-219-ge70ec99\overloadable test>gpuowl-win 2020-03-26 18:05:21 gpuowl v6.11-219-ge70ec99-dirty 2020-03-26 18:05:21 config: -device 0 -user kriesel -cpu emu/gtx1080 -yield -maxAlloc 7500 -use NO_ASM 2020-03-26 18:05:21 device 0, unique id '' 2020-03-26 18:05:21 emu/gtx1080 97685953 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 16.94 bits/word 2020-03-26 18:05:23 emu/gtx1080 OpenCL args "-DEXP=97685953u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x8.598138082486p-3 -DIWEIGHT_STEP=0xf .547c79820ff18p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-std=CL2.0 " 2020-03-26 18:05:28 emu/gtx1080 2020-03-26 18:05:28 emu/gtx1080 OpenCL compilation in 5.26 s 2020-03-26 18:05:29 emu/gtx1080 97685953 P1 B1=1000000, B2=270000000; 1442134 bits; starting at 0 2020-03-26 18:06:18 emu/gtx1080 97685953 P1 10000 0.69%; 4908 us/it; ETA 0d 01:57; 4577ae6cbb52f038 2020-03-26 18:07:07 emu/gtx1080 97685953 P1 20000 1.39%; 4917 us/it; ETA 0d 01:57; fc2022db22907e71[/CODE] |
[QUOTE=preda;539360]Yes. All gpuowl does on savefile is write the file and close it. From this point on, it's the OS's job to persist the file to disk. It turns out often the OS is lazy and prefers to keep the data in RAM for a while longer, and if a OS crash happens in this window, the savefile isn't properly persisted.[/QUOTE]This on fflush sounds like you could force the commit to disk for critical info. [url]https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fflush?view=vs-2019[/url]
|
I recommend no P-1 testing until further notice. I'm investigating a bug.
|
[QUOTE=Prime95;540993]I recommend no P-1 testing until further notice. I'm investigating a bug.[/QUOTE]Do you have any guidance on what versions are thought affected or unaffected?
|
[QUOTE=preda;540961]Hi Ken, I agree that was a loss. I'll look into improving this.[/QUOTE]
Thanks. For reference, [URL]https://mersenneforum.org/showpost.php?p=537396&postcount=1838[/URL] [URL]https://mersenneforum.org/showpost.php?p=537580&postcount=1853[/URL] [URL]https://mersenneforum.org/showpost.php?p=537628&postcount=1856[/URL] [URL]https://mersenneforum.org/showpost.php?p=537647&postcount=1859[/URL] [URL]https://mersenneforum.org/showpost.php?p=540929&postcount=1982[/URL] |
[QUOTE=kriesel;540995]Do you have any guidance on what versions are thought affected or unaffected?[/QUOTE]
In the current version CARRYM64 is broken. If testing near the upper limit of an FFT (I'm working on what "near" means), use CARRY64 option. |
[QUOTE=Prime95;541007]In the current version CARRYM64 is broken. If testing near the upper limit of an FFT (I'm working on what "near" means), use CARRY64 option.[/QUOTE]
George, any update on the exponent ranges in question? |
| All times are UTC. The time now is 23:09. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.