![]() |
[QUOTE=kriesel;494261]gets it down to[CODE]C:\users\ken\documents\gpuowl-cuda-build>nvcc -O2 -DREV=\"91c52fa\" -o cudaowl-V38-91c52fa-W64.exe common.cpp gpuowl.cpp CudaGpu.cu NoTF.cpp -lcufft
common.cpp gpuowl.cpp c:\users\ken\documents\gpuowl-cuda-build\timeutil.h(13): error C2079: 'tv' uses undefined struct 'Timer::timeMicros::timeval' c:\users\ken\documents\gpuowl-cuda-build\timeutil.h(14): error C3861: 'gettimeofday': identifier not found[/CODE]And then there's this: [URL]https://blog.habets.se/2010/09/gettimeofday-should-never-be-used-to-measure-time.html[/URL][/QUOTE] I replaced sys/time.h with std::chrono in timeutil.h; (and added the missing include <string> in checkpoint.h). Please retry. |
[QUOTE=kriesel;494262]Interesting; 1312/162.4=~ 8.08. On CUDA code I typically see a ratio of 12-15. Higher primality test efficiency would drive the ratio lower.[/QUOTE]
Maybe because the DP/SP ratio is worse on consumer Nvidia vs. AMD? (if that's true, AMD would be better at PRP, and Nvidia better at TF, relatively). |
[QUOTE=preda;494278]Maybe because the DP/SP ratio is worse on consumer Nvidia vs. AMD? (if that's true, AMD would be better at PRP, and Nvidia better at TF, relatively).[/QUOTE]Please provide any references (links or tools) for AMD DP/SP ratios. I've found that GPU-Z gives values for NVIDIA but not for AMD.
|
[QUOTE=kriesel;494281]Please provide any references (links or tools) for AMD DP/SP ratios. I've found that GPU-Z gives values for NVIDIA but not for AMD.[/QUOTE]
[url]https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_RX_Vega_Series[/url] indicates 1/16 for Vega & Polaris (e.g. for Vega64, 12665 GFlops SP, 792 GFlops DP) [url]https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_10_series[/url] indicates 1/32 for "10 series" (e.g. GTX 1080, 8228 GFlops SP, 257 GFlops DP). |
latest CUDA Win64 attempt link error
[QUOTE=preda;494277]I replaced sys/time.h with std::chrono in timeutil.h; (and added the missing include <string> in checkpoint.h). Please retry.[/QUOTE]
On CUDA: [CODE]C:\users\ken\documents\gpuowl-cuda-build>nvcc -O2 -DREV=\"ae3be65\" -o cudaowl-V 38-ae3be65-W64.exe common.cpp gpuowl.cpp CudaGpu.cu NoTF.cpp -lcufft common.cpp gpuowl.cpp CudaGpu.cu NoTF.cpp Creating library cudaowl-V38-ae3be65-W64.lib and object cudaowl-V38-ae3be65-W 64.exp tmpxft_00003d78_00000000-15_gpuowl.obj : error LNK2019: unresolved external symb ol "class std::unique_ptr<class TF,struct std::default_delete<class TF> > __cdecl makeTF(struct Args &)" (?makeTF@@YA?AV?$unique_ptr@VTF@@U?$default_delete@VTF@ @@std@@@std@@AEAUArgs@@@Z) referenced in function "bool __cdecl doTF(unsigned int,int,int,struct Args &,class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const &)" (?doTF@@YA_NIHHAEAUArgs@@AEBV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@@Z) cudaowl-V38-ae3be65-W64.exe : fatal error LNK1120: 1 unresolved externals[/CODE] |
[QUOTE=preda;494284][URL]https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_RX_Vega_Series[/URL]
indicates 1/16 for Vega & Polaris (e.g. for Vega64, 12665 GFlops SP, 792 GFlops DP) [URL]https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_10_series[/URL] indicates 1/32 for "10 series" (e.g. GTX 1080, 8228 GFlops SP, 257 GFlops DP).[/QUOTE] GFLOPS ratios there for AMD Vega or RX480 give ~16:1; your earlier posted gpuOwL GhzDay/d outputs gave ~8.08, nearly a 2:1 ratio. Are PRP and TF code counting ops in comparable ways? [url]http://www.mersenneforum.org/showpost.php?p=494260&postcount=625[/url] |
[QUOTE=kriesel;494287]GFLOPS ratios there for AMD Vega or RX480 give ~16:1; your earlier posted gpuOwL GhzDay/d outputs gave ~8.08, nearly a 2:1 ratio. Are PRP and TF code counting ops in comparable ways? [url]http://www.mersenneforum.org/showpost.php?p=494260&postcount=625[/url][/QUOTE]
No, probably the GFLOPs numbers for TF and for PRP are not comparable and not too meaningful. This is because they are based on the GIMPS reward GFLOPS numbers, which are based on the performance on prime95 on CPU. Particularly the TF number seems inflated compared to the PRP. |
[QUOTE=preda;494331]No, probably the GFLOPs numbers for TF and for PRP are not comparable and not too meaningful. This is because they are based on the GIMPS reward GFLOPS numbers, which are based on the performance on prime95 on CPU.
Particularly the TF number seems inflated compared to the PRP.[/QUOTE] FYI, I tabulated the variety of cpus I had, for TF and primality ghzD/day ratings, and found TF/LL ratios of 0.72 to 1.25 or so (all but one was Intel). Doing the same for gpus, I got 12.-15. All the inputs are GIMPS ratings. The 8.08 TF/PRP on gpu stood out compared to those figures. That TF ghzD/day indicated performance varies on the same hardware versus exponent and bit level for the same software is known behavior. LL throughput rating also varies with exponent/fft length. Any insights on the linking problem for my last CUDA build above? [url]http://www.mersenneforum.org/showpost.php?p=494285&postcount=632[/url] |
[QUOTE=kriesel;494346]
Any insights on the linking problem for my last CUDA build above? [url]http://www.mersenneforum.org/showpost.php?p=494285&postcount=632[/url][/QUOTE] I started my CUDA GPU and tried it. I don't repro, but I attempted a fix; could you retry with fresh checkout? |
GpuOwl CUDA plan
GpuOwl now has a very basic native-CUDA PRP implementation using cuFFT. This implementation is extremely slow, so slow that it's useless. It could be called a proof-of-concept but not more.
In order to get it faster, I would need to port the OpenCL code to CUDA. I mean, writing a 1:1 equivalent code, but using CUDA syntax and idioms. Contemplating this task, of porting the OpenCL backend to equivalent CUDA, I realized that this is a task for a compiler, not for a developer. In other words, it's a waste of effort for a human to do such translation that can (and should) be done by compilers/transpilers, simply parsing this-or-that syntax on input. And it turns out, *in theory*, it does not need to be done. Because in theory, Nvidia GPUs do support OpenCL (being standard and all). *in theory* So I tried to run GpuOwl's OpenCL on a GTX 1080 with CUDA 9.2 drivers (i.e. recent). And unfortunately it does not work. Like in I tried everything I could think of and it still doesn't work. Some things do work: the GPU is detected correctly, the kernels are compiled, and I can get the compiled PTX for inspection. The problems start when executing the kernels. Initially I got CL_OUT_OF_RESOURCES at some point while launching the kernels. To debug this, I chose to wait after each kernel to finish executing (conveniently provided by GpuOwl's "-time" option). And it turns out that CUDA-OpenCL does not like the kernel "tailFused" -- the clFinish() following it fails every time with CL_INVALID_COMMAND_QUEUE. In the interest of debugging this, I removed this particular kernel from the execution. I started to get the CL_INVALID_COMMAND_QUEUE on clFinish() after another kernel. [CODE] 2018-08-21 21:15:51 390 gpuowl-OpenCL 3.9-6b6ea9a-mod 2018-08-21 21:15:51 390 FFT 4608K: Width 512 (64x8), Height 512 (64x8), Middle 9; 17.19 bits/word 2018-08-21 21:15:51 390 Note: using short carry kernels 2018-08-21 21:15:51 390 GeForce GTX 1080-20x1797- 2018-08-21 21:15:51 390 2018-08-21 21:15:51 390 OpenCL compilation in 11 ms, with "-DEXP=81103181u -DWIDTH=512u -DSMALL_HEIGHT=512u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0 " size 443232 2018-08-21 21:15:52 390 PRP M(81103181), FFT 4608K, 17.19 bits/word, 261 GHz-day run transposeIn 2018-08-21 21:15:52 390 time 884 run fftP 2018-08-21 21:15:52 390 time 30385 run transposeW 2018-08-21 21:15:52 390 time 1010 run fftMiddleIn 2018-08-21 21:15:52 390 time 431 run fftMiddleOut 2018-08-21 21:15:52 390 time 405 run transposeH 2018-08-21 21:15:52 390 time 502 run fftW 2018-08-21 21:15:52 390 time 728 run carryA 2018-08-21 21:15:52 390 time 28783 run carryB 2018-08-21 21:15:52 390 time 71 run fftP 2018-08-21 21:15:52 390 time 748 run transposeW 2018-08-21 21:15:52 390 time 1142 run fftMiddleIn 2018-08-21 21:15:52 390 time 557 run fftP 2018-08-21 21:15:52 390 time 811 run transposeW 2018-08-21 21:15:52 390 time 383 run fftMiddleIn 2018-08-21 21:15:52 390 time 405 run mulFused 2018-08-21 21:15:52 390 error -36 openowl: clwrap.cpp:246: void finish(cl_queue): Assertion `check(clFinish(q))' failed. [/CODE] The timings are also worrisome, 30ms for one execution of a single kernel -- but I assume this is because the PTX is compiled to ISA on the first launch of a particular kernel. Maybe a GPU timer flips, and this cause the clFinish() to bail out with the misleading INVALID_QUEUE. At this point I reach the conclusion that OpenCL on Nvidia doesn't quite work. I think Nvidia has some serious bugs here that make their OpenCL unusable. OTOH it does not seem to me a good use of my time, to have to port the OpenCL to CUDA just because Nvidia can't or won't fix their OpenCL implementation. So at this point I'm ready to give up on attempting to run on Nvidia GPUs. It's not for lack of trying. But really, IMO, porting everything over from OpenCL to CUDA just because Nvidia wants so, is not a good use of my time. I would gladly revisit this if/when OpenCL starts running on Nvidia. |
For what it's worth, I for one like the cudaowl project. Before cudaowl, I actually managed to use the OpenCL version on a GTX-960, this was version v1.10-cd3c8ed (we discussed this on GitHub too, issue #3).
Then, I had to use the -longTail option, so this sounds like the tailFused kernel gave trouble also then. That OpenCL version was a bit faster than the current cudaowl, I got 11.64 ms/it on OpenCL, and 14.34 ms/it on the current cudaowl, for the exponent 75000001, 4096K fft. Your 30 ms/iteration on that much more powerful GTX-1080 is surprisingly bad. Maybe you could test the 1.10 version on it? Would it be possible to bring back something like the -longTail option? |
| All times are UTC. The time now is 23:06. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.