mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

preda 2018-08-20 12:52

[QUOTE=kriesel;494261]gets it down to[CODE]C:\users\ken\documents\gpuowl-cuda-build>nvcc -O2 -DREV=\"91c52fa\" -o cudaowl-V38-91c52fa-W64.exe common.cpp gpuowl.cpp CudaGpu.cu NoTF.cpp -lcufft
common.cpp
gpuowl.cpp
c:\users\ken\documents\gpuowl-cuda-build\timeutil.h(13): error C2079: 'tv' uses undefined struct 'Timer::timeMicros::timeval'
c:\users\ken\documents\gpuowl-cuda-build\timeutil.h(14): error C3861: 'gettimeofday': identifier not found[/CODE]And then there's this: [URL]https://blog.habets.se/2010/09/gettimeofday-should-never-be-used-to-measure-time.html[/URL][/QUOTE]

I replaced sys/time.h with std::chrono in timeutil.h; (and added the missing include <string> in checkpoint.h). Please retry.

preda 2018-08-20 13:16

[QUOTE=kriesel;494262]Interesting; 1312/162.4=~ 8.08. On CUDA code I typically see a ratio of 12-15. Higher primality test efficiency would drive the ratio lower.[/QUOTE]
Maybe because the DP/SP ratio is worse on consumer Nvidia vs. AMD? (if that's true, AMD would be better at PRP, and Nvidia better at TF, relatively).

kriesel 2018-08-20 13:33

[QUOTE=preda;494278]Maybe because the DP/SP ratio is worse on consumer Nvidia vs. AMD? (if that's true, AMD would be better at PRP, and Nvidia better at TF, relatively).[/QUOTE]Please provide any references (links or tools) for AMD DP/SP ratios. I've found that GPU-Z gives values for NVIDIA but not for AMD.

preda 2018-08-20 14:08

[QUOTE=kriesel;494281]Please provide any references (links or tools) for AMD DP/SP ratios. I've found that GPU-Z gives values for NVIDIA but not for AMD.[/QUOTE]

[url]https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_RX_Vega_Series[/url]
indicates 1/16 for Vega & Polaris (e.g. for Vega64, 12665 GFlops SP, 792 GFlops DP)

[url]https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_10_series[/url]
indicates 1/32 for "10 series" (e.g. GTX 1080, 8228 GFlops SP, 257 GFlops DP).

kriesel 2018-08-20 14:11

latest CUDA Win64 attempt link error
 
[QUOTE=preda;494277]I replaced sys/time.h with std::chrono in timeutil.h; (and added the missing include <string> in checkpoint.h). Please retry.[/QUOTE]
On CUDA:

[CODE]C:\users\ken\documents\gpuowl-cuda-build>nvcc -O2 -DREV=\"ae3be65\" -o cudaowl-V
38-ae3be65-W64.exe common.cpp gpuowl.cpp CudaGpu.cu NoTF.cpp -lcufft
common.cpp
gpuowl.cpp
CudaGpu.cu
NoTF.cpp
Creating library cudaowl-V38-ae3be65-W64.lib and object cudaowl-V38-ae3be65-W
64.exp
tmpxft_00003d78_00000000-15_gpuowl.obj : error LNK2019: unresolved external symb
ol "class std::unique_ptr<class TF,struct std::default_delete<class TF> > __cdecl makeTF(struct Args &)" (?makeTF@@YA?AV?$unique_ptr@VTF@@U?$default_delete@VTF@
@@std@@@std@@AEAUArgs@@@Z) referenced in function "bool __cdecl doTF(unsigned int,int,int,struct Args &,class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const &)" (?doTF@@YA_NIHHAEAUArgs@@AEBV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@@Z)
cudaowl-V38-ae3be65-W64.exe : fatal error LNK1120: 1 unresolved externals[/CODE]

kriesel 2018-08-20 14:25

[QUOTE=preda;494284][URL]https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_RX_Vega_Series[/URL]
indicates 1/16 for Vega & Polaris (e.g. for Vega64, 12665 GFlops SP, 792 GFlops DP)

[URL]https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_10_series[/URL]
indicates 1/32 for "10 series" (e.g. GTX 1080, 8228 GFlops SP, 257 GFlops DP).[/QUOTE]
GFLOPS ratios there for AMD Vega or RX480 give ~16:1; your earlier posted gpuOwL GhzDay/d outputs gave ~8.08, nearly a 2:1 ratio. Are PRP and TF code counting ops in comparable ways? [url]http://www.mersenneforum.org/showpost.php?p=494260&postcount=625[/url]

preda 2018-08-20 22:51

[QUOTE=kriesel;494287]GFLOPS ratios there for AMD Vega or RX480 give ~16:1; your earlier posted gpuOwL GhzDay/d outputs gave ~8.08, nearly a 2:1 ratio. Are PRP and TF code counting ops in comparable ways? [url]http://www.mersenneforum.org/showpost.php?p=494260&postcount=625[/url][/QUOTE]

No, probably the GFLOPs numbers for TF and for PRP are not comparable and not too meaningful. This is because they are based on the GIMPS reward GFLOPS numbers, which are based on the performance on prime95 on CPU.

Particularly the TF number seems inflated compared to the PRP.

kriesel 2018-08-21 02:04

[QUOTE=preda;494331]No, probably the GFLOPs numbers for TF and for PRP are not comparable and not too meaningful. This is because they are based on the GIMPS reward GFLOPS numbers, which are based on the performance on prime95 on CPU.

Particularly the TF number seems inflated compared to the PRP.[/QUOTE]
FYI, I tabulated the variety of cpus I had, for TF and primality ghzD/day ratings, and found
TF/LL ratios of 0.72 to 1.25 or so (all but one was Intel). Doing the same for gpus, I got 12.-15. All the inputs are GIMPS ratings. The 8.08 TF/PRP on gpu stood out compared to those figures.
That TF ghzD/day indicated performance varies on the same hardware versus exponent and bit level for the same software is known behavior. LL throughput rating also varies with exponent/fft length.

Any insights on the linking problem for my last CUDA build above? [url]http://www.mersenneforum.org/showpost.php?p=494285&postcount=632[/url]

preda 2018-08-21 04:57

[QUOTE=kriesel;494346]
Any insights on the linking problem for my last CUDA build above? [url]http://www.mersenneforum.org/showpost.php?p=494285&postcount=632[/url][/QUOTE]

I started my CUDA GPU and tried it. I don't repro, but I attempted a fix; could you retry with fresh checkout?

preda 2018-08-21 11:33

GpuOwl CUDA plan
 
GpuOwl now has a very basic native-CUDA PRP implementation using cuFFT. This implementation is extremely slow, so slow that it's useless. It could be called a proof-of-concept but not more.

In order to get it faster, I would need to port the OpenCL code to CUDA. I mean, writing a 1:1 equivalent code, but using CUDA syntax and idioms.

Contemplating this task, of porting the OpenCL backend to equivalent CUDA, I realized that this is a task for a compiler, not for a developer. In other words, it's a waste of effort for a human to do such translation that can (and should) be done by compilers/transpilers, simply parsing this-or-that syntax on input.

And it turns out, *in theory*, it does not need to be done. Because in theory, Nvidia GPUs do support OpenCL (being standard and all). *in theory*

So I tried to run GpuOwl's OpenCL on a GTX 1080 with CUDA 9.2 drivers (i.e. recent). And unfortunately it does not work. Like in I tried everything I could think of and it still doesn't work.

Some things do work: the GPU is detected correctly, the kernels are compiled, and I can get the compiled PTX for inspection.

The problems start when executing the kernels. Initially I got CL_OUT_OF_RESOURCES at some point while launching the kernels. To debug this, I chose to wait after each kernel to finish executing (conveniently provided by GpuOwl's "-time" option). And it turns out that CUDA-OpenCL does not like the kernel "tailFused" -- the clFinish() following it fails every time with CL_INVALID_COMMAND_QUEUE.

In the interest of debugging this, I removed this particular kernel from the execution. I started to get the CL_INVALID_COMMAND_QUEUE on clFinish() after another kernel.
[CODE]
2018-08-21 21:15:51 390 gpuowl-OpenCL 3.9-6b6ea9a-mod
2018-08-21 21:15:51 390 FFT 4608K: Width 512 (64x8), Height 512 (64x8), Middle 9; 17.19 bits/word
2018-08-21 21:15:51 390 Note: using short carry kernels
2018-08-21 21:15:51 390 GeForce GTX 1080-20x1797-
2018-08-21 21:15:51 390

2018-08-21 21:15:51 390 OpenCL compilation in 11 ms, with "-DEXP=81103181u -DWIDTH=512u -DSMALL_HEIGHT=512u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0 "
size 443232
2018-08-21 21:15:52 390 PRP M(81103181), FFT 4608K, 17.19 bits/word, 261 GHz-day
run transposeIn
2018-08-21 21:15:52 390 time 884
run fftP
2018-08-21 21:15:52 390 time 30385
run transposeW
2018-08-21 21:15:52 390 time 1010
run fftMiddleIn
2018-08-21 21:15:52 390 time 431
run fftMiddleOut
2018-08-21 21:15:52 390 time 405
run transposeH
2018-08-21 21:15:52 390 time 502
run fftW
2018-08-21 21:15:52 390 time 728
run carryA
2018-08-21 21:15:52 390 time 28783
run carryB
2018-08-21 21:15:52 390 time 71
run fftP
2018-08-21 21:15:52 390 time 748
run transposeW
2018-08-21 21:15:52 390 time 1142
run fftMiddleIn
2018-08-21 21:15:52 390 time 557
run fftP
2018-08-21 21:15:52 390 time 811
run transposeW
2018-08-21 21:15:52 390 time 383
run fftMiddleIn
2018-08-21 21:15:52 390 time 405
run mulFused
2018-08-21 21:15:52 390 error -36
openowl: clwrap.cpp:246: void finish(cl_queue): Assertion `check(clFinish(q))' failed.
[/CODE]

The timings are also worrisome, 30ms for one execution of a single kernel -- but I assume this is because the PTX is compiled to ISA on the first launch of a particular kernel.

Maybe a GPU timer flips, and this cause the clFinish() to bail out with the misleading INVALID_QUEUE.

At this point I reach the conclusion that OpenCL on Nvidia doesn't quite work. I think Nvidia has some serious bugs here that make their OpenCL unusable.

OTOH it does not seem to me a good use of my time, to have to port the OpenCL to CUDA just because Nvidia can't or won't fix their OpenCL implementation.

So at this point I'm ready to give up on attempting to run on Nvidia GPUs. It's not for lack of trying. But really, IMO, porting everything over from OpenCL to CUDA just because Nvidia wants so, is not a good use of my time.

I would gladly revisit this if/when OpenCL starts running on Nvidia.

Fredrik 2018-08-21 19:41

For what it's worth, I for one like the cudaowl project. Before cudaowl, I actually managed to use the OpenCL version on a GTX-960, this was version v1.10-cd3c8ed (we discussed this on GitHub too, issue #3).

Then, I had to use the -longTail option, so this sounds like the tailFused kernel gave trouble also then.

That OpenCL version was a bit faster than the current cudaowl, I got 11.64 ms/it on OpenCL, and 14.34 ms/it on the current cudaowl, for the exponent 75000001, 4096K fft. Your 30 ms/iteration on that much more powerful GTX-1080 is surprisingly bad. Maybe you could test the 1.10 version on it?

Would it be possible to bring back something like the -longTail option?


All times are UTC. The time now is 23:06.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.