View Single Post
Old 2018-01-25, 04:58   #285
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10111000101012 Posts
Default

Quote:
Originally Posted by preda View Post
Related to "-dump folder": this relies on the non-standard -save-temps OpenCL option, which works on ROCm and AMDGPU-pro, but as seen does not work on Nvidia. So that's the reason for the failure there.

For M61, indeed you need nttshared.h which is included by the .cl for the M61 case. But that still doesn't work on Nvidia...

In my experience, M61 is not faster than DP, at least on AMD GPUs.
(Received back my HD620-equipped laptop today, from warranty repair, and after letting it warm up, unpacked it and tried gpuowl v1.9 on it. Will try some things on a real discrete AMD GPU later.)
Code:
gpuowl-v1.9-74f1a38>gpuowl --help
gpuOwL v1.9- GPU Mersenne primality checker
Command line options:

-size 2M|4M|8M : override FFT size.
-fft DP|SP|M61|M31  : choose FFT variant [default DP]:
                DP  : double precision floating point.
                SP  : single precision floating point.
                M61 : Fast Galois Transform (FGT) modulo M(61).
                M31 : FGT modulo M(31).
-user <name>  : specify the user name.
-cpu  <name>  : specify the hardware name.
-legacy       : use legacy kernels
-dump <path>  : dump compiled ISA to the folder <path> that must exist.
-verbosity <level> : change amount of information logged. [0-2, default 0].
-device <N>   : select specific device among:
    0 : Intel(R) HD Graphics 620, 24x1050MHz
    1 : Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz,  4x2700MHz
-dump seems also not to work on Intel IGP.
Code:
gpuowl -user kriesel -cpu falcon-hd620 -dump isa -verbosity 2 -device 0
gpuOwL v1.9- GPU Mersenne primality checker
Intel(R) HD Graphics 620, 24x1050MHz
-fft DP silently crashes the program.
Code:
gpuowl.exe
gpuOwL v1.9- GPU Mersenne primality checker
Intel(R) HD Graphics 620, 24x1050MHz
-fft M61 runs, after some warnings:
Code:
gpuowl -device 0 -fft M61
gpuOwL v1.9- GPU Mersenne primality checker
Intel(R) HD Graphics 620, 24x1050MHz
warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64'

warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64'

warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64'

warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64'

fcl build 1 succeeded.
bcl build succeeded.

OpenCL compilation in 28284 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0  -DEXP=83780327u -DWIDTH=2048u -DHEIGHT=2048u -DLOG_NWORDS=23u -DFGT_61=1 -DLOG_ROOT2=55u "
Note: using long carry kernels
PRP-3: FFT 8M (2048 * 2048 * 2) of 83780327 (9.99 bits/word) [2018-01-24 18:41:01 Central Standard Time]
Starting at iteration 0
OK        0 / 83780327 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [18:46:39]
OK     1000 / 83780327 [ 0.00%], 697.29 ms/it; ETA 676d 03:31; cc32f90fcf7bb85a [19:03:53]

Stopping, please wait..
OK     1500 / 83780327 [ 0.00%], 669.39 ms/it; ETA 649d 02:01; 7577c10a6a191001 [19:15:11]

Bye
It runs slowly, at great cost to Prime95 (V29.4b7 64-bit) throughput. Approx 270msec/iter for M61 4M at best, while the two intel cpu cores/workers are reduced in throughput from ~33msec/iter each (60 iter/sec total) Prime95 only, to 50msec/iter (40 iter/sec 2 cores combined) plus ~3iter/sec from the IGP = loss of throughput, 17 iter/sec to 43iter/sec (72%). Or with reduction to one IA core, 40ms/iter = 25iter/sec, ~4iter/sec from igp, 29 iter/sec combined (48%). I think it's the shared memory arrangement; cpu speed drops from ~2.7Ghz to 1.6 or 1.7, cpu utilization from 90+ to 50 or 25%, wattage from 13W to 5.5 or 4, as the IGP goes from 0.1 at idle to 8 or 9 watts of the 15W TDP budget. Primality testing is probably just too memory intensive for shared memory gpus. Trial factoring maybe?
Code:
gpuowl -device 0 -fft M61 -dump isa
gpuOwL v1.9- GPU Mersenne primality checker
Intel(R) HD Graphics 620, 24x1050MHz
warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64'

warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64'

warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64'

warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64'

WARNING: -s is not supported on the Intel OpenCL GPU device.
fcl build 1 succeeded.
bcl build succeeded.

OpenCL compilation in 28378 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0  -DEXP=77973559u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u  -save-temps=isa/M61_4M"
Note: using long carry kernels
PRP-3: FFT 4M (1024 * 2048 * 2) of 77973559 (18.59 bits/word) [2018-01-24 19:24:47 Central Standard Time]
Starting at iteration 0
OK        0 / 77973559 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [19:27:10]
OK     1000 / 77973559 [ 0.00%], 653.80 ms/it; ETA 590d 00:46; 04b3acbff2710af0 [19:40:33]
OK     5000 / 77973559 [ 0.01%], 292.32 ms/it; ETA 263d 19:00; debea46fa265c7e3 [20:02:30]
OK    10000 / 77973559 [ 0.01%], 289.25 ms/it; ETA 261d 00:15; f3ea22ad73c203e0 [20:29:03]
OK    20000 / 77973559 [ 0.03%], 270.90 ms/it; ETA 244d 09:58; c94ad5f9c8a09a3b [21:16:27]
I'm puzzled by the gpuowl ms/iter declining as the computation progresses, and note that the wall clock time is 1034 seconds for 1000 iterations that the program indicates executed at 697msec/iter (for the 8M -M61 0-1000 iteration interval).
Similarly for the 10k-20k interval on 4M, 270msec/iter is given for a 2844 second duration (284.4msec/iter)

I note the date and time zone have been removed from the progress lines, including when the time of day rolls over at midnight.

Last fiddled with by kriesel on 2018-01-25 at 05:00
kriesel is offline   Reply With Quote