mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

LaurV 2020-02-02 05:16

[QUOTE=kriesel;536423]CUDALucas has not had any significant development in years, so naturally has fallen behind.[/QUOTE]
This is a bit misleading.

First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else. At least, for everything I run in my rigs, old cards (like 580 and clasic/black Titans) and new cards (like 1080Ti and 2080Ti) included.

Second, there is "almost nothing" to improve in CudaLucas (well, there are some minor things, that's why the quotes, but the big picture won't change much), this toy is just a "square, subtract 2, repeat" tool, which uses Nvidia cuda FFT libraries (cuFFT) to do the squaring. These libraries, [U]indeed, fell behind[/U], as you said. They were not updated by Nvidia for ages, and if we can convince them to make (or make by ourselves:chappy:) some cuFFT library a hundred times faster than the actual one, all CL would need would be a recompilation. :razz:. For the owl, Preda made the libraries from scratch, and they are well tuned for opencl, but nvidia cards are not so good in emulating opencl, they are faster when native cuda is used.

xx005fs 2020-02-02 05:26

[QUOTE=LaurV;536442]First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else.[/QUOTE]

This statement is a bit misleading since with the new gpuowl updates it has became significantly more efficient on memory bandwidth usage. I am seeing significant speedups on GPUs with high DP ratio like K80, P100, V100, Titan V. There is indeed not much difference for the GTX and RTX cards due to most of them being DP bound instead of memory.

nomead 2020-02-02 07:43

[QUOTE=LaurV;536442]First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else. At least, for everything I run in my rigs, old cards (like 580 and clasic/black Titans) and new cards (like 1080Ti and 2080Ti) included. [/QUOTE]
Nope, on my RTX 2080 at least, the current version of gpuowl is about 20-30% faster than cudalucas, varying a bit from FFT size to another. The big improvement came in the beginning of December 2019, and smaller optimizations have accumulated since then, so if you've tested gpuowl before that, please test again.

LaurV 2020-02-02 07:58

:rakes::surrender

You may be totally right... We didn't move to such new fancy things yet.. :sad:
Edit @nomead, crosspost, I was replying to xx, but what you say is really tempting, BRB soon :smile:

xx005fs 2020-02-02 08:20

[QUOTE=nomead;536452]Nope, on my RTX 2080 at least, the current version of gpuowl is about 20-30% faster than cudalucas, varying a bit from FFT size to another. The big improvement came in the beginning of December 2019, and smaller optimizations have accumulated since then, so if you've tested gpuowl before that, please test again.[/QUOTE]

Interesting. I saw only around 5% improvement going from CUDALucas to gpuowl on my 1070. Did RTX series get higher than 1/32 DP ratio?

preda 2020-02-02 20:33

It seems that your OpenCL compiler does not like __attribute__((opencl_unroll_hint(1))). To work around that, simply pass "-use UNROLL_ALL" (and none of the other UNROLL_ options), or, if running on a Nvidia card, don't pass any UNROLL option at all.

[QUOTE=JCoveiro;536429]Thanks!

But first, just want to say that there is a bug on the program.

[B]I'm using gpuowl v6.11-134-g1e0ce1d.[/B]

#####################################

Running the batch outputs the following errors:

[B]Error#1[/B]
Running the Windows batch file at:
2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_NONE
outputs some errors and after the following:
2020-02-01 23:55:14 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build

[B]Error#2[/B]
Running the Windows batch file at:
2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_WIDTH
outputs some errors and after the following:
2020-02-01 23:55:15 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build

[B]Error#3[/B]
Running the Windows batch file at:
2020-02-01 23:55:15 config: -time -iters 10000 -use NO_ASM,UNROLL_HEIGHT
outputs some errors and after the following:
2020-02-01 23:55:15 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build
[B]
Error#4[/B]
Running the Windows batch file at:
2020-02-01 23:55:15 config: -time -iters 10000 -use NO_ASM,UNROLL_MIDDLEMUL1
outputs some errors and after the following:
2020-02-01 23:55:16 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build

[B]Error#5[/B]
Running the Windows batch file at:
2020-02-01 23:55:16 config: -time -iters 10000 -use NO_ASM,UNROLL_MIDDLEMUL2
outputs some errors and after the following:
2020-02-01 23:55:16 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build

#####################################

[B]Here are some more details on Error#1:[/B]

[CODE]2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_NONE
2020-02-01 23:55:14 device 0, unique id ''
2020-02-01 23:55:14 GeForce GTX 1660-0 99753809 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.30 bits/word
2020-02-01 23:55:14 GeForce GTX 1660-0 OpenCL args "-DEXP=99753809u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xd.064531a6f6b48p-3 -DIWEIGHT_STEP=0x9.d3e00e7c301p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DUNROLL_NONE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-02-01 23:55:14 GeForce GTX 1660-0 OpenCL compilation error -11 (args -DEXP=99753809u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xd.064531a6f6b48p-3 -DIWEIGHT_STEP=0x9.d3e00e7c301p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DUNROLL_NONE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DNO_ASM=1)
2020-02-01 23:55:14 GeForce GTX 1660-0 <kernel>:1386:3: error: expected identifier or '('
for (i32 s = 4; s >= 0; s -= 2) {
^
<kernel>:1394:3: error: expected identifier or '('
for (i32 s = 4; s >= 0; s -= 2) {
^
<kernel>:1404:3: error: expected identifier or '('
for (i32 s = 3; s >= 0; s -= 3) {
^
<kernel>:1412:3: error: expected identifier or '('
for (i32 s = 3; s >= 0; s -= 3) {
^
<kernel>:1422:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 2) {
^
<kernel>:1430:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 2) {
^
<kernel>:1440:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 3) {
^
<kernel>:1448:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 3) {
^
<kernel>:1458:3: error: expected identifier or '('
for (i32 s = 5; s >= 2; s -= 3) {
^
<kernel>:1502:3: error: expected identifier or '('
for (i32 s = 5; s >= 2; s -= 3) {
^
<kernel>:2478:3: error: expected identifier or '('
for (i32 i = 0; i < MIDDLE; ++i) {
^

2020-02-01 23:55:14 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build
2020-02-01 23:55:14 GeForce GTX 1660-0 Bye
[/CODE][/QUOTE]

preda 2020-02-02 20:34

As the error says, you can't use "WORKINGOUT4" with that FFT size.

Did you try running the program without any -use options? does that work?

[QUOTE=JCoveiro;536435]I have found another bug, while trying to test M47 (a lower exponent).

[CODE]2020-02-02 01:36:38 gpuowl v6.11-134-g1e0ce1d
2020-02-02 01:36:38 Note: not found 'config.txt'
2020-02-02 01:36:38 config: -use UNROLL_ALL,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE,CARRY64,FANCYMIDDLEMUL1,LESS_ACCURATE
2020-02-02 01:36:38 device 0, unique id ''
2020-02-02 01:36:38 GeForce GTX 1660-0 43112609 FFT 2304K: Width 8x8, Height 256x8, Middle 9; 18.27 bits/word
2020-02-02 01:36:39 GeForce GTX 1660-0 OpenCL args "-DEXP=43112609u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.3ca600d8f455p-3 -DIWEIGHT_STEP=0x9.ab80a96f8aeap-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DCARRY64=1 -DFANCYMIDDLEMUL1=1 -DLESS_ACCURATE=1 -DT2_SHUFFLE=1 -DUNROLL_ALL=1 -DWORKINGIN4=1 -DWORKINGOUT4=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-02-02 01:36:39 GeForce GTX 1660-0 OpenCL compilation error -11 (args -DEXP=43112609u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.3ca600d8f455p-3 -DIWEIGHT_STEP=0x9.ab80a96f8aeap-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DCARRY64=1 -DFANCYMIDDLEMUL1=1 -DLESS_ACCURATE=1 -DT2_SHUFFLE=1 -DUNROLL_ALL=1 -DWORKINGIN4=1 -DWORKINGOUT4=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DNO_ASM=1)
2020-02-02 01:36:39 GeForce GTX 1660-0 <kernel>:2009:2: error: WORKINGOUT4 not compatible with this FFT size
#error WORKINGOUT4 not compatible with this FFT size
^

2020-02-02 01:36:39 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build
2020-02-02 01:36:39 GeForce GTX 1660-0 Bye[/CODE][/QUOTE]

JCoveiro 2020-02-02 20:43

[QUOTE=preda;536513]As the error says, you can't use "WORKINGOUT4" with that FFT size.

Did you try running the program without any -use options? does that work?[/QUOTE]

Yes. It runs without -use options.
I was just testing the "optimized settings" for Nvidia cards, but it seems that I can't use WORKINGOUT4.

Going to test again and publish the results for the GTX1660.

wfgarnett3 2020-02-06 09:29

[QUOTE=LaurV;536442]
First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else.[/QUOTE]

Same here -- CUDALucas is faster than gpuOwL on my EVGA Geforce 1050 2GB and also doesn't slow down Prime95 running on the CPU concurrently.

However even with the iteration times being a couple milleseconds slower on gpuOwL versus CUDALucas (plus a couple millesecond slowdown to Prime95 if it is running too) since gpuOwL eliminates the need for a double-check that makes gpuOwL the overall time saver winner over CUDALucas for me.

I only did one PRP double-check with gpuOwL and I occasionally do LL double-checks with CUDALucas.

Since the 1/32 double-precision ratio is terrible I mostly stick with Trial Factoring using mfaktc.

kriesel 2020-02-06 14:08

[QUOTE=wfgarnett3;536864]since gpuOwL eliminates the need for a double-check[/QUOTE]But it doesn't. There is a PRP DC work type for good reasons;
1) errors may occur outside the code that the GEC occurs, both in the software and in the manual reporting process, and some have already been confirmed to occur;
2) PRP DC guards against someone forging PRP first test submissions;
3) PRP GEC itself has a very low error rate, but not zero. Gerbicz himself has given error rate estimates.

[QUOTE]Since the 1/32 double-precision ratio is terrible I mostly stick with Trial Factoring using mfaktc.[/QUOTE]Good choice.

kriesel 2020-02-06 17:01

CUDALucas still has its place;
faster on a few gpu models than gpuowl;
will run on older NVIDIA gpus that are entirelly incapable of running gpuowl because they don't support the required OpenCL level for gpuowl;
relatively current gpuowl versions don't do LL so can't do LLDC (although v0.5 and v0.6 gpuowl can with 4M fft)
It would be great if CUDALucas had the Jacobi check.


All times are UTC. The time now is 23:12.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.