mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

ewmayer 2020-03-24 01:30

[QUOTE=kriesel;540709]I'd be skeptical about the performance advantage of running too disparate parallel runs. I've seen it reduce throughput. PRP & LL in tandem, for example, which is different code from different versions.
Did you use -maxAlloc for your P-1 run? If not, start, and if doing parallel runs the limit will need to be lower than if the P-1 stage 2 has the gpu ram to itself.[/QUOTE]
I was just running 2 separate PRP-assignment jobs - for the PRPs there is a marked throughput boost from 2-job-running (cf. my timings in post #1956) - one of which just happened to start on a PRP-assignment for which p-1 had not yet been done. Not using -maxAlloc.

[QUOTE]So can we count you as another fan of P-1 save files?[/QUOTE]

I'm a fan of doing whatever works for increasing users' overall throughput! :) That of course includes minimizing wasted time resulting from run-crashes/BSODs/system-resets/etc.

kriesel 2020-03-26 15:35

GpuOwl P-1 error detection and handling
 
Gpuowl stage 1 needs a res64 error check. This was in v6.11-134.[CODE]2020-03-25 00:57:17 roa/radeonvii 550000007 FFT 36864K: Width 256x4, Height 256x8, Middle 9; 14.57 bits/word
2020-03-25 00:57:25 roa/radeonvii OpenCL args "-DEXP=550000007u -DWIDTH=1024u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xa.c7166b9401b18p-3 -DIWEIGHT_STEP=0xb.e05b1786463ap-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_METHOD=1 -DLESS_ACCURATE=1 -DMERGED_MIDDLE=1 -DNO_ASM=1 -DORIG_MIDDLEMUL2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_REVERSELINE=1 -DUNROLL_MIDDLEMUL2=1 -DWORKINGIN5=1 -DWORKINGOUT3=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-03-25 00:57:31 roa/radeonvii OpenCL compilation in 5.88 s
2020-03-25 00:57:34 roa/radeonvii 550000007 P1 B1=5070000, B2=152100000; 7315345 bits; starting at 0
2020-03-25 00:59:08 roa/radeonvii 550000007 P1 10000 0.14%; 9433 us/it; ETA 0d 19:09; c8b2127abc38054b
2020-03-25 01:00:43 roa/radeonvii 550000007 P1 20000 0.27%; 9433 us/it; ETA 0d 19:07; 6f401486d14cb20f
2020-03-25 01:02:17 roa/radeonvii 550000007 P1 30000 0.41%; 9431 us/it; ETA 0d 19:05; 18a926611c75b118
2020-03-25 01:02:37 roa/radeonvii saved
...
2020-03-25 15:08:30 roa/radeonvii saved
2020-03-25 15:09:08 roa/radeonvii 550000007 P1 5390000 73.68%; 9582 us/it; ETA 0d 05:07; 4cfe624d31a00e27
2020-03-25 15:10:42 roa/radeonvii 550000007 P1 5400000 73.82%; 9428 us/it; ETA 0d 05:01; [COLOR=Red][B]0000000000000000[/B][/COLOR]
2020-03-25 15:12:16 roa/radeonvii 550000007 P1 5410000 73.95%; 9424 us/it; ETA 0d 04:59; 0000000000000000
2020-03-25 15:13:32 roa/radeonvii saved[/CODE]Fourteen hours into the computation, an error occurred that zeroed the residue. The program does not detect the error. It continued powering the zero residue for the remaining iteration count, and periodically updating its two save files with bad interim results, for 5 more hours. It then appears to skip the stage 1 GCD under the error condition detected at the end of the set of iterations.
Resume proceeds despite the bad input from the latter part of stage 1, also skipping the stage 1 GCD.[CODE]2020-03-25 20:10:24 roa/radeonvii saved
2020-03-25 20:11:58 roa/radeonvii 550000007 P1 7310000 99.93%; 9581 us/it; ETA 0d 00:01; 0000000000000000
2020-03-25 20:12:50 roa/radeonvii saved
2020-03-25 20:12:51 roa/radeonvii 550000007 P1 7315345 100.00%; 9913 us/it; ETA 0d 00:00; [COLOR=red][B]0000000000000000[/B][/COLOR]
2020-03-25 20:12:56 roa/radeonvii P-1 (B1=5070000, B2=152100000, D=30030): primes 8202674, expanded 8746218, doubles 1277965 (left 5804395), singles 5646744, total 6924709 (84%)
2020-03-25 20:12:56 roa/radeonvii 550000007 P2 using blocks [169 - 5065] to cover 6924709 primes
2020-03-25 20:12:57 roa/radeonvii 550000007 P2 using 38 buffers of 288.0 MB each
2020-03-25 20:31:18 roa/radeonvii 550000007 P2 38/2880: 92454 primes; setup 2.16 s, 11.881 ms/prime
2020-03-25 20:31:18 roa/radeonvii Exception St12domain_error: GCD invalid input
2020-03-25 20:31:18 roa/radeonvii waiting for background GCDs..
2020-03-25 20:31:18 roa/radeonvii Bye
C:\Users\ken\Documents\gpuowl-v6.11-134-g1e0ce1d>g611

C:\Users\ken\Documents\gpuowl-v6.11-134-g1e0ce1d>gpuowl-win
2020-03-26 09:35:49 gpuowl v6.11-134-g1e0ce1d
2020-03-26 09:35:49 config: -device 1 -user kriesel -cpu roa/radeonvii -yield -maxAlloc 16000 -use NO_ASM,UNROLL_MIDDLEMUL2,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT,CARRY32,CHEBYSHEV_METHOD,ORIG_MIDDLEMUL2,LESS_ACCURATE
2020-03-26 09:35:49 config:
2020-03-26 09:35:49 config: ;NO_ASM,ORIG_SLOWTRIG
2020-03-26 09:35:49 config: ;40M NO_ASM,UNROLL_MIDDLEMUL2,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT,CARRY32,CHEBYSHEV_METHOD,ORIG_MIDDLEMUL2,LESS_ACCURATE
2020-03-26 09:35:49 device 1, unique id ''
2020-03-26 09:35:49 roa/radeonvii 550000007 FFT 36864K: Width 256x4, Height 256x8, Middle 9; 14.57 bits/word
2020-03-26 09:35:58 roa/radeonvii OpenCL args "-DEXP=550000007u -DWIDTH=1024u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xa.c7166b9401b18p-3 -DIWEIGHT_STEP=0xb.e05b1786463ap-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_METHOD=1 -DLESS_ACCURATE=1 -DMERGED_MIDDLE=1 -DNO_ASM=1 -DORIG_MIDDLEMUL2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_REVERSELINE=1 -DUNROLL_MIDDLEMUL2=1 -DWORKINGIN5=1 -DWORKINGOUT3=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-03-26 09:36:05 roa/radeonvii OpenCL compilation in 6.68 s
2020-03-26 09:36:08 roa/radeonvii 550000007 P1 B1=5070000, B2=152100000; 7315345 bits; starting at 7315344
2020-03-26 09:36:09 roa/radeonvii 550000007 P2 B1=5070000, B2=152100000, starting at 38
2020-03-26 09:36:14 roa/radeonvii P-1 (B1=5070000, B2=152100000, D=30030): primes 8202674, expanded 8746218, doubles 1277965 (left 5804395), singles 5646744, total 6924709 (84%)
2020-03-26 09:36:14 roa/radeonvii 550000007 P2 using blocks [169 - 5065] to cover 6924709 primes
2020-03-26 09:36:14 roa/radeonvii 550000007 P2 using 38 buffers of 288.0 MB each
2020-03-26 09:54:35 roa/radeonvii 550000007 P2 76/2880: 92460 primes; setup 2.30 s, 11.868 ms/prime
[/CODE]Since there is no periodic permanently retained save file from before the error occurred, and both the stage 1 save files are from after the unhandled error, the entire run is a loss (~33 hours wall clock).
Stage 2 should not proceed from bad input from stage 1, but it does, without warning. Error checks, and a field for "passed last error check" in the save file could handle that.

preda 2020-03-26 19:41

[QUOTE=kriesel;540929]Gpuowl stage 1 needs a res64 error check.[/QUOTE]

Hi Ken, I agree that was a loss. I'll look into improving this.

kriesel 2020-03-26 21:52

Gpuowl-win v6.11-219-ge70ec99 build
 
2 Attachment(s)
Built, produced a help output, no other testing yet.

kriesel 2020-03-26 23:09

1 Attachment(s)
[QUOTE=preda;540307]I would like to start using __attribute__((overloadable)) in gpuowl OpenCL source, but before that I'd like to find out whether it's supported everywhere we care.

The attribute is described here:
[URL]https://clang.llvm.org/docs/AttributeReference.html#overloadable[/URL]

I would like confirmation that it works on these platforms:
- windows (with whatever OpenCL windows uses for AMD GPUs -- catalyst?)
- Nvidia
- amdgpuPro (the other driver for Linux vs. ROCm)

To check the attribute, simply add "__attribute__((overloadable))" to some function between the return type and function name, e.g.:

in gpuowl.cl
Replace

T2 mul(T2 a, T2 b) ...
with
T2 __attribute__((overloadable)) mul(T2 a, T2 b) ...

And recompile, and afterwards *run* the resulting gpuowl to check the OpenCL compilation that happens at startup.
Thanks!

Note: the title should read "__attribute__((overloadable))", double parens.[/QUOTE]


AOK on AMD RX480 /Win7 x64:
[CODE]// complex mul
T2 __attribute__((overloadable)) mul(T2 a, T2 b) { return U2(mad1(a.x, b.x, -a.y * b.y), mad1(a.x, b.y, a.y * b.x)); }

Driver version as indicated by GPU-Z: 25.20.14007.1000 (Adrenalin 18.10.21/Win 64)
[/CODE][CODE]2020-03-26 17:16:48 gpuowl v6.11-219-ge70ec99-dirty
2020-03-26 17:16:48 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500
2020-03-26 17:16:48 device 0, unique id ''
2020-03-26 17:16:48 condorella/rx480 97685813 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 16.94 bits/word
2020-03-26 17:16:51 condorella/rx480 OpenCL args "-DEXP=97685813u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x8.598a6d26b0dap-3 -DIWEIGHT_STE
P=0xf.546b91e1254f8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=1 -DAMDGPU=1 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-03-26 17:16:55 condorella/rx480 OpenCL compilation in 4.81 s
2020-03-26 17:16:56 condorella/rx480 97685813 P1 B1=1000000, B2=27000000; 1442134 bits; starting at 0
2020-03-26 17:17:34 condorella/rx480 97685813 P1 10000 0.69%; 3785 us/it; ETA 0d 01:30; 6bd301fd8aadd98a[/CODE]Also on Win7 x64, NVIDIA GTX1080, NVIDIA driver version 378.92:[CODE]C:\Users\ken\Documents\gpuowl-v6.11-219-ge70ec99\overloadable test>gpuowl-win
2020-03-26 18:05:21 gpuowl v6.11-219-ge70ec99-dirty
2020-03-26 18:05:21 config: -device 0 -user kriesel -cpu emu/gtx1080 -yield -maxAlloc 7500 -use NO_ASM
2020-03-26 18:05:21 device 0, unique id ''
2020-03-26 18:05:21 emu/gtx1080 97685953 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 16.94 bits/word
2020-03-26 18:05:23 emu/gtx1080 OpenCL args "-DEXP=97685953u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x8.598138082486p-3 -DIWEIGHT_STEP=0xf
.547c79820ff18p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-03-26 18:05:28 emu/gtx1080

2020-03-26 18:05:28 emu/gtx1080 OpenCL compilation in 5.26 s
2020-03-26 18:05:29 emu/gtx1080 97685953 P1 B1=1000000, B2=270000000; 1442134 bits; starting at 0
2020-03-26 18:06:18 emu/gtx1080 97685953 P1 10000 0.69%; 4908 us/it; ETA 0d 01:57; 4577ae6cbb52f038
2020-03-26 18:07:07 emu/gtx1080 97685953 P1 20000 1.39%; 4917 us/it; ETA 0d 01:57; fc2022db22907e71[/CODE]

kriesel 2020-03-26 23:47

[QUOTE=preda;539360]Yes. All gpuowl does on savefile is write the file and close it. From this point on, it's the OS's job to persist the file to disk. It turns out often the OS is lazy and prefers to keep the data in RAM for a while longer, and if a OS crash happens in this window, the savefile isn't properly persisted.[/QUOTE]This on fflush sounds like you could force the commit to disk for critical info. [url]https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fflush?view=vs-2019[/url]

Prime95 2020-03-26 23:48

I recommend no P-1 testing until further notice. I'm investigating a bug.

kriesel 2020-03-27 00:04

[QUOTE=Prime95;540993]I recommend no P-1 testing until further notice. I'm investigating a bug.[/QUOTE]Do you have any guidance on what versions are thought affected or unaffected?

kriesel 2020-03-27 00:34

[QUOTE=preda;540961]Hi Ken, I agree that was a loss. I'll look into improving this.[/QUOTE]
Thanks. For reference,
[URL]https://mersenneforum.org/showpost.php?p=537396&postcount=1838[/URL]
[URL]https://mersenneforum.org/showpost.php?p=537580&postcount=1853[/URL]
[URL]https://mersenneforum.org/showpost.php?p=537628&postcount=1856[/URL]
[URL]https://mersenneforum.org/showpost.php?p=537647&postcount=1859[/URL]
[URL]https://mersenneforum.org/showpost.php?p=540929&postcount=1982[/URL]

Prime95 2020-03-27 01:19

[QUOTE=kriesel;540995]Do you have any guidance on what versions are thought affected or unaffected?[/QUOTE]

In the current version CARRYM64 is broken. If testing near the upper limit of an FFT (I'm working on what "near" means), use CARRY64 option.

ewmayer 2020-03-27 19:41

[QUOTE=Prime95;541007]In the current version CARRYM64 is broken. If testing near the upper limit of an FFT (I'm working on what "near" means), use CARRY64 option.[/QUOTE]

George, any update on the exponent ranges in question?


All times are UTC. The time now is 23:09.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.