mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2018-01-23 19:42

1 Attachment(s)
[QUOTE=kracker;478125]Latest build for windows as of right now (commit 74f1a38)[/QUOTE]

Thanks for the build, kracker.

A quick check run on RX550 seems to indicate, that in this build, the previously observed long log lag is gone, or reduced to seconds. (File time doesn't update but that's common; what matters is log contents.) Millisec/iteration times look good too so far, after running more than half an hour longer than when issues appeared before, but not yet as long as the interval between occurrences was.

preda 2018-01-23 20:19

It seems GpuOwl does not run on Nvidia GPUs. At first sight this seems to be caused by some quirk of Nvidia's OpenCL driver, not by a fault of GpuOwl.

Nevertheless, maybe I could fix this by a step-by-step trial and error on Nvidia hardware, but I don't have an Nvidia GPU.

For the device identification problem, I'll keep thinking of a solution (other than variable device order id).

kriesel 2018-01-23 23:24

[QUOTE=preda;478186]It seems GpuOwl does not run on Nvidia GPUs. At first sight this seems to be caused by some quirk of Nvidia's OpenCL driver, not by a fault of GpuOwl.

Nevertheless, maybe I could fix this by a step-by-step trial and error on Nvidia hardware, but I don't have an Nvidia GPU.

For the device identification problem, I'll keep thinking of a solution (other than variable device order id).[/QUOTE]

Just for clarity, I see it as a potentially unstable device to number mapping, for which mapping by one method and confirmation by an independent method (or more than one) would be good. I've thought the way to do this is to define and support a short list of confirmation methods and make a little text table of device/property/value trios. The user would need to update these when moving or changing gpu cards, such as if repositioning gpu cards for tuning cooling, the pcie slot mapping just changed. That then introduces the possibility for the user to also specify halt on mismatch, or warn on mismatch and attempt to continue. The mismatch may also develop hours days or weeks after the user launches a job if he is using a batch file that implements restarts.

I wonder if airsquirrels might have some insights to share, learned during his porting of a very early version of GpuOwL from OpenCL to CUDA. See [URL]http://www.mersenneforum.org/showpost.php?p=458485&postcount=107[/URL] where there's stated a few percentage speed difference.

If GpuOwL v1.9x could be made to work on NVIDIA, it could be the only PRP code for mersennes on that popular gpu type. (And therefore the fastest.)

Old Quadros 2000's or gtx750's can be found on eBay for $40-100. I like the Quadro 2000 for some spots where its single-slot width and PCIe-only power let it fit where others won't. It also is less likely to interfere with cables since it's short. It shows up on my systems as only OpenCL1.1 though.

All the following is from kracker's latest Windows build of 74f1a38.
[CODE]gpuOwL v1.9- GPU Mersenne primality checker
GeForce GTX 1050 Ti, 6x1468MHz
OpenCL compilation error -11 (args -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77959589u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u -save-temps=isa/M61_4M)
Error in processing command line: Don't understand command line argument "-save-temps=isa/M61_4M"!
OpenCL compilation error -11 (args -I. -cl-fast-relaxed-math -DEXP=77959589u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u -save-temps=isa/M61_4M)
Error in processing command line: Don't understand command line argument "-save-temps=isa/M61_4M"!

Bye
[/CODE]What's the GpuOwL-supported syntax for dump folder name? I tried foldername, .\foldername, and full path, and all failed as shown. Looks like NVIDIA OpenCL has some issues with the option or syntax. Windows uses \ while linux uses /.

Without it, for -fft M61, it produces
[CODE]gpuOwL v1.9- GPU Mersenne primality checker
GeForce GTX 1050 Ti, 6x1468MHz
OpenCL compilation error -11 (args -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77959589u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u )
In file included from <kernel>:1:
./gpuowl.cl:92:10: fatal error: 'nttshared.h' file not found
#include "nttshared.h"
^

OpenCL compilation error -11 (args -I. -cl-fast-relaxed-math -DEXP=77959589u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u )
In file included from <kernel>:1:
./gpuowl.cl:92:10: fatal error: 'nttshared.h' file not found
#include "nttshared.h"
^


Bye[/CODE]Something missing from the gpuowl.cl file? Ok, found it on gpuowl github.

Then it will produce
[CODE]gpuOwL v1.9- GPU Mersenne primality checker
GeForce GTX 1050 Ti, 6x1468MHz


OpenCL compilation in 6331 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77959589u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u "
Note: using long carry kernels
PRP-3: FFT 4M (1024 * 2048 * 2) of 77959589 (18.59 bits/word) [2018-01-23 16:23:54 Central Standard Time]
Starting at iteration 0
error -9999 (fftH)
[/CODE]

kriesel 2018-01-23 23:29

(never mind; 4 vs 3gb)
 
[QUOTE=kriesel;478174]
OpenCL seems confused about the memory capacity of the 3GB GTX1050Ti, reporting 4GB. At one point GPU-Z was reporting 3.8GB in use.
...
Oddly, at one point GPU-Z indicated about 3.8GB of memory in use on the 3GB GTX1050Ti.
[/QUOTE]
Nope, that was _me_ confusing the FOUR GB 1050Ti with the GTX1060-3GB that's in a different system.

airsquirrels 2018-01-24 02:05

[QUOTE=preda;478186]It seems GpuOwl does not run on Nvidia GPUs. At first sight this seems to be caused by some quirk of Nvidia's OpenCL driver, not by a fault of GpuOwl.

Nevertheless, maybe I could fix this by a step-by-step trial and error on Nvidia hardware, but I don't have an Nvidia GPU.

For the device identification problem, I'll keep thinking of a solution (other than variable device order id).[/QUOTE]

As GPUs can be a bit hard to come by right now, I could send you 1 or 2 1060s if it would help.

It has been sometime since I did the quick CUDA port on nvidia, but I don’t think it would be too difficult at least for initial. It may actually be easier and nearly as performant to adapt the OpenCL to work successfully with Nvidias toolkit’s native Cuda

kriesel 2018-01-24 03:01

[QUOTE=preda;478186]

For the device identification problem, I'll keep thinking of a solution (other than variable device order id).[/QUOTE]

In CUDALucas 2.06beta, flashjh implemented checking UUID, which produces responses like
UUID GPU-9b15b648-ccfe-f878-b7cb-2bba3cffd5b1
(not available in Windows 32-bit builds).
[URL="https://sourceforge.net/p/cudalucas/wiki/Home/"]https://sourceforge.net/projects/cudalucas/files/2.06Beta/[/URL]

preda 2018-01-24 04:15

Related to "-dump folder": this relies on the non-standard -save-temps OpenCL option, which works on ROCm and AMDGPU-pro, but as seen does not work on Nvidia. So that's the reason for the failure there.

For M61, indeed you need nttshared.h which is included by the .cl for the M61 case. But that still doesn't work on Nvidia...

In my experience, M61 is not faster than DP, at least on AMD GPUs.

preda 2018-01-24 04:21

[QUOTE=airsquirrels;478201]As GPUs can be a bit hard to come by right now, I could send you 1 or 2 1060s if it would help.
[/QUOTE]
Thanks for the generous offer! But let me think it over. I see there is demand to get it working on Nvidia, it seems I need to take it seriously :)

kriesel 2018-01-24 05:18

[QUOTE=preda;478205]Related to "-dump folder": this relies on the non-standard -save-temps OpenCL option, which works on ROCm and AMDGPU-pro, but as seen does not work on Nvidia. So that's the reason for the failure there.

For M61, indeed you need nttshared.h which is included by the .cl for the M61 case. But that still doesn't work on Nvidia...

In my experience, M61 is not faster than DP, at least on AMD GPUs.[/QUOTE]

Thanks for the info.

Attempting -M61 was useful in that it turned up the nttshared.h issue. I used the NVIDIA cards for command option etc testing and practice for when I try it on the RX550. That gpu is now at >10 hours and >3million iterations of rock-solid ms/iter output. Looks like that bug is squashed.

I still plan to try simultaneous runs of -DP and -M61 as mentioned in
[url]http://www.mersenneforum.org/showpost.php?p=478055&postcount=264[/url]
although, since a dual run with -DP and mfakto showed combined throughput substantially lowered, I won't be surprised if the DP/M61 combination also suffers. Could be a sign of a well tuned application.

kriesel 2018-01-25 04:58

[QUOTE=preda;478205]Related to "-dump folder": this relies on the non-standard -save-temps OpenCL option, which works on ROCm and AMDGPU-pro, but as seen does not work on Nvidia. So that's the reason for the failure there.

For M61, indeed you need nttshared.h which is included by the .cl for the M61 case. But that still doesn't work on Nvidia...

In my experience, M61 is not faster than DP, at least on AMD GPUs.[/QUOTE]
(Received back my HD620-equipped laptop today, from warranty repair, and after letting it warm up, unpacked it and tried gpuowl v1.9 on it. Will try some things on a real discrete AMD GPU later.)
[CODE]gpuowl-v1.9-74f1a38>gpuowl --help
gpuOwL v1.9- GPU Mersenne primality checker
Command line options:

-size 2M|4M|8M : override FFT size.
-fft DP|SP|M61|M31 : choose FFT variant [default DP]:
DP : double precision floating point.
SP : single precision floating point.
M61 : Fast Galois Transform (FGT) modulo M(61).
M31 : FGT modulo M(31).
-user <name> : specify the user name.
-cpu <name> : specify the hardware name.
-legacy : use legacy kernels
-dump <path> : dump compiled ISA to the folder <path> that must exist.
-verbosity <level> : change amount of information logged. [0-2, default 0].
-device <N> : select specific device among:
0 : Intel(R) HD Graphics 620, 24x1050MHz
1 : Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz, 4x2700MHz
[/CODE]-dump seems also not to work on Intel IGP.
[CODE]gpuowl -user kriesel -cpu falcon-hd620 -dump isa -verbosity 2 -device 0
gpuOwL v1.9- GPU Mersenne primality checker
Intel(R) HD Graphics 620, 24x1050MHz[/CODE]-fft DP silently crashes the program.
[CODE]gpuowl.exe
gpuOwL v1.9- GPU Mersenne primality checker
Intel(R) HD Graphics 620, 24x1050MHz
[/CODE]-fft M61 runs, after some warnings:
[CODE]gpuowl -device 0 -fft M61
gpuOwL v1.9- GPU Mersenne primality checker
Intel(R) HD Graphics 620, 24x1050MHz
warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64'

warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64'

warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64'

warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64'

fcl build 1 succeeded.
bcl build succeeded.

OpenCL compilation in 28284 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=83780327u -DWIDTH=2048u -DHEIGHT=2048u -DLOG_NWORDS=23u -DFGT_61=1 -DLOG_ROOT2=55u "
Note: using long carry kernels
PRP-3: FFT 8M (2048 * 2048 * 2) of 83780327 (9.99 bits/word) [2018-01-24 18:41:01 Central Standard Time]
Starting at iteration 0
OK 0 / 83780327 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [18:46:39]
OK 1000 / 83780327 [ 0.00%], 697.29 ms/it; ETA 676d 03:31; cc32f90fcf7bb85a [19:03:53]

Stopping, please wait..
OK 1500 / 83780327 [ 0.00%], 669.39 ms/it; ETA 649d 02:01; 7577c10a6a191001 [19:15:11]

Bye
[/CODE]It runs slowly, at great cost to Prime95 (V29.4b7 64-bit) throughput. Approx 270msec/iter for M61 4M at best, while the two intel cpu cores/workers are reduced in throughput from ~33msec/iter each (60 iter/sec total) Prime95 only, to 50msec/iter (40 iter/sec 2 cores combined) plus ~3iter/sec from the IGP = loss of throughput, 17 iter/sec to 43iter/sec (72%). Or with reduction to one IA core, 40ms/iter = 25iter/sec, ~4iter/sec from igp, 29 iter/sec combined (48%). I think it's the shared memory arrangement; cpu speed drops from ~2.7Ghz to 1.6 or 1.7, cpu utilization from 90+ to 50 or 25%, wattage from 13W to 5.5 or 4, as the IGP goes from 0.1 at idle to 8 or 9 watts of the 15W TDP budget. Primality testing is probably just too memory intensive for shared memory gpus. Trial factoring maybe?
[CODE]gpuowl -device 0 -fft M61 -dump isa
gpuOwL v1.9- GPU Mersenne primality checker
Intel(R) HD Graphics 620, 24x1050MHz
warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64'

warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64'

warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64'

warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64'

WARNING: -s is not supported on the Intel OpenCL GPU device.
fcl build 1 succeeded.
bcl build succeeded.

OpenCL compilation in 28378 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77973559u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u -save-temps=isa/M61_4M"
Note: using long carry kernels
PRP-3: FFT 4M (1024 * 2048 * 2) of 77973559 (18.59 bits/word) [2018-01-24 19:24:47 Central Standard Time]
Starting at iteration 0
OK 0 / 77973559 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [19:27:10]
OK 1000 / 77973559 [ 0.00%], 653.80 ms/it; ETA 590d 00:46; 04b3acbff2710af0 [19:40:33]
OK 5000 / 77973559 [ 0.01%], 292.32 ms/it; ETA 263d 19:00; debea46fa265c7e3 [20:02:30]
OK 10000 / 77973559 [ 0.01%], 289.25 ms/it; ETA 261d 00:15; f3ea22ad73c203e0 [20:29:03]
OK 20000 / 77973559 [ 0.03%], 270.90 ms/it; ETA 244d 09:58; c94ad5f9c8a09a3b [21:16:27][/CODE]I'm puzzled by the gpuowl ms/iter declining as the computation progresses, and note that the wall clock time is 1034 seconds for 1000 iterations that the program indicates executed at 697msec/iter (for the 8M -M61 0-1000 iteration interval).
Similarly for the 10k-20k interval on 4M, 270msec/iter is given for a 2844 second duration (284.4msec/iter)

I note the date and time zone have been removed from the progress lines, including when the time of day rolls over at midnight.

henryzz 2018-01-25 09:30

It is shared TDP that is slowing down your cpu not shared memory.


All times are UTC. The time now is 22:38.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.