![]() |
[QUOTE=kriesel;536423]CUDALucas has not had any significant development in years, so naturally has fallen behind.[/QUOTE]
This is a bit misleading. First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else. At least, for everything I run in my rigs, old cards (like 580 and clasic/black Titans) and new cards (like 1080Ti and 2080Ti) included. Second, there is "almost nothing" to improve in CudaLucas (well, there are some minor things, that's why the quotes, but the big picture won't change much), this toy is just a "square, subtract 2, repeat" tool, which uses Nvidia cuda FFT libraries (cuFFT) to do the squaring. These libraries, [U]indeed, fell behind[/U], as you said. They were not updated by Nvidia for ages, and if we can convince them to make (or make by ourselves:chappy:) some cuFFT library a hundred times faster than the actual one, all CL would need would be a recompilation. :razz:. For the owl, Preda made the libraries from scratch, and they are well tuned for opencl, but nvidia cards are not so good in emulating opencl, they are faster when native cuda is used. |
[QUOTE=LaurV;536442]First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else.[/QUOTE]
This statement is a bit misleading since with the new gpuowl updates it has became significantly more efficient on memory bandwidth usage. I am seeing significant speedups on GPUs with high DP ratio like K80, P100, V100, Titan V. There is indeed not much difference for the GTX and RTX cards due to most of them being DP bound instead of memory. |
[QUOTE=LaurV;536442]First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else. At least, for everything I run in my rigs, old cards (like 580 and clasic/black Titans) and new cards (like 1080Ti and 2080Ti) included. [/QUOTE]
Nope, on my RTX 2080 at least, the current version of gpuowl is about 20-30% faster than cudalucas, varying a bit from FFT size to another. The big improvement came in the beginning of December 2019, and smaller optimizations have accumulated since then, so if you've tested gpuowl before that, please test again. |
:rakes::surrender
You may be totally right... We didn't move to such new fancy things yet.. :sad: Edit @nomead, crosspost, I was replying to xx, but what you say is really tempting, BRB soon :smile: |
[QUOTE=nomead;536452]Nope, on my RTX 2080 at least, the current version of gpuowl is about 20-30% faster than cudalucas, varying a bit from FFT size to another. The big improvement came in the beginning of December 2019, and smaller optimizations have accumulated since then, so if you've tested gpuowl before that, please test again.[/QUOTE]
Interesting. I saw only around 5% improvement going from CUDALucas to gpuowl on my 1070. Did RTX series get higher than 1/32 DP ratio? |
It seems that your OpenCL compiler does not like __attribute__((opencl_unroll_hint(1))). To work around that, simply pass "-use UNROLL_ALL" (and none of the other UNROLL_ options), or, if running on a Nvidia card, don't pass any UNROLL option at all.
[QUOTE=JCoveiro;536429]Thanks! But first, just want to say that there is a bug on the program. [B]I'm using gpuowl v6.11-134-g1e0ce1d.[/B] ##################################### Running the batch outputs the following errors: [B]Error#1[/B] Running the Windows batch file at: 2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_NONE outputs some errors and after the following: 2020-02-01 23:55:14 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B]Error#2[/B] Running the Windows batch file at: 2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_WIDTH outputs some errors and after the following: 2020-02-01 23:55:15 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B]Error#3[/B] Running the Windows batch file at: 2020-02-01 23:55:15 config: -time -iters 10000 -use NO_ASM,UNROLL_HEIGHT outputs some errors and after the following: 2020-02-01 23:55:15 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B] Error#4[/B] Running the Windows batch file at: 2020-02-01 23:55:15 config: -time -iters 10000 -use NO_ASM,UNROLL_MIDDLEMUL1 outputs some errors and after the following: 2020-02-01 23:55:16 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B]Error#5[/B] Running the Windows batch file at: 2020-02-01 23:55:16 config: -time -iters 10000 -use NO_ASM,UNROLL_MIDDLEMUL2 outputs some errors and after the following: 2020-02-01 23:55:16 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build ##################################### [B]Here are some more details on Error#1:[/B] [CODE]2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_NONE 2020-02-01 23:55:14 device 0, unique id '' 2020-02-01 23:55:14 GeForce GTX 1660-0 99753809 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.30 bits/word 2020-02-01 23:55:14 GeForce GTX 1660-0 OpenCL args "-DEXP=99753809u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xd.064531a6f6b48p-3 -DIWEIGHT_STEP=0x9.d3e00e7c301p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DUNROLL_NONE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-02-01 23:55:14 GeForce GTX 1660-0 OpenCL compilation error -11 (args -DEXP=99753809u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xd.064531a6f6b48p-3 -DIWEIGHT_STEP=0x9.d3e00e7c301p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DUNROLL_NONE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DNO_ASM=1) 2020-02-01 23:55:14 GeForce GTX 1660-0 <kernel>:1386:3: error: expected identifier or '(' for (i32 s = 4; s >= 0; s -= 2) { ^ <kernel>:1394:3: error: expected identifier or '(' for (i32 s = 4; s >= 0; s -= 2) { ^ <kernel>:1404:3: error: expected identifier or '(' for (i32 s = 3; s >= 0; s -= 3) { ^ <kernel>:1412:3: error: expected identifier or '(' for (i32 s = 3; s >= 0; s -= 3) { ^ <kernel>:1422:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 2) { ^ <kernel>:1430:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 2) { ^ <kernel>:1440:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 3) { ^ <kernel>:1448:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 3) { ^ <kernel>:1458:3: error: expected identifier or '(' for (i32 s = 5; s >= 2; s -= 3) { ^ <kernel>:1502:3: error: expected identifier or '(' for (i32 s = 5; s >= 2; s -= 3) { ^ <kernel>:2478:3: error: expected identifier or '(' for (i32 i = 0; i < MIDDLE; ++i) { ^ 2020-02-01 23:55:14 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build 2020-02-01 23:55:14 GeForce GTX 1660-0 Bye [/CODE][/QUOTE] |
As the error says, you can't use "WORKINGOUT4" with that FFT size.
Did you try running the program without any -use options? does that work? [QUOTE=JCoveiro;536435]I have found another bug, while trying to test M47 (a lower exponent). [CODE]2020-02-02 01:36:38 gpuowl v6.11-134-g1e0ce1d 2020-02-02 01:36:38 Note: not found 'config.txt' 2020-02-02 01:36:38 config: -use UNROLL_ALL,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE,CARRY64,FANCYMIDDLEMUL1,LESS_ACCURATE 2020-02-02 01:36:38 device 0, unique id '' 2020-02-02 01:36:38 GeForce GTX 1660-0 43112609 FFT 2304K: Width 8x8, Height 256x8, Middle 9; 18.27 bits/word 2020-02-02 01:36:39 GeForce GTX 1660-0 OpenCL args "-DEXP=43112609u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.3ca600d8f455p-3 -DIWEIGHT_STEP=0x9.ab80a96f8aeap-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DCARRY64=1 -DFANCYMIDDLEMUL1=1 -DLESS_ACCURATE=1 -DT2_SHUFFLE=1 -DUNROLL_ALL=1 -DWORKINGIN4=1 -DWORKINGOUT4=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-02-02 01:36:39 GeForce GTX 1660-0 OpenCL compilation error -11 (args -DEXP=43112609u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.3ca600d8f455p-3 -DIWEIGHT_STEP=0x9.ab80a96f8aeap-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DCARRY64=1 -DFANCYMIDDLEMUL1=1 -DLESS_ACCURATE=1 -DT2_SHUFFLE=1 -DUNROLL_ALL=1 -DWORKINGIN4=1 -DWORKINGOUT4=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DNO_ASM=1) 2020-02-02 01:36:39 GeForce GTX 1660-0 <kernel>:2009:2: error: WORKINGOUT4 not compatible with this FFT size #error WORKINGOUT4 not compatible with this FFT size ^ 2020-02-02 01:36:39 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build 2020-02-02 01:36:39 GeForce GTX 1660-0 Bye[/CODE][/QUOTE] |
[QUOTE=preda;536513]As the error says, you can't use "WORKINGOUT4" with that FFT size.
Did you try running the program without any -use options? does that work?[/QUOTE] Yes. It runs without -use options. I was just testing the "optimized settings" for Nvidia cards, but it seems that I can't use WORKINGOUT4. Going to test again and publish the results for the GTX1660. |
[QUOTE=LaurV;536442]
First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else.[/QUOTE] Same here -- CUDALucas is faster than gpuOwL on my EVGA Geforce 1050 2GB and also doesn't slow down Prime95 running on the CPU concurrently. However even with the iteration times being a couple milleseconds slower on gpuOwL versus CUDALucas (plus a couple millesecond slowdown to Prime95 if it is running too) since gpuOwL eliminates the need for a double-check that makes gpuOwL the overall time saver winner over CUDALucas for me. I only did one PRP double-check with gpuOwL and I occasionally do LL double-checks with CUDALucas. Since the 1/32 double-precision ratio is terrible I mostly stick with Trial Factoring using mfaktc. |
[QUOTE=wfgarnett3;536864]since gpuOwL eliminates the need for a double-check[/QUOTE]But it doesn't. There is a PRP DC work type for good reasons;
1) errors may occur outside the code that the GEC occurs, both in the software and in the manual reporting process, and some have already been confirmed to occur; 2) PRP DC guards against someone forging PRP first test submissions; 3) PRP GEC itself has a very low error rate, but not zero. Gerbicz himself has given error rate estimates. [QUOTE]Since the 1/32 double-precision ratio is terrible I mostly stick with Trial Factoring using mfaktc.[/QUOTE]Good choice. |
CUDALucas still has its place;
faster on a few gpu models than gpuowl; will run on older NVIDIA gpus that are entirelly incapable of running gpuowl because they don't support the required OpenCL level for gpuowl; relatively current gpuowl versions don't do LL so can't do LLDC (although v0.5 and v0.6 gpuowl can with 4M fft) It would be great if CUDALucas had the Jacobi check. |
| All times are UTC. The time now is 23:12. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.