mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

preda 2018-06-14 11:20

Today I start work on a CUDA PRP tester ("port of gpuOwl"). Will keep you posted with the progress :). I am excited. The biggest change (from OpenCL) is that I want to use cuFFT instead of "roll my own". This way I get easy access to the hard work they put in cuFFT, and get NPOT FFT support out of the box.

M344587487 2018-06-14 11:34

[QUOTE=preda;489808]Today I start work on a CUDA PRP tester ("port of gpuOwl"). Will keep you posted with the progress :). I am excited. The biggest change (from OpenCL) is that I want to use cuFFT instead of "roll my own". This way I get easy access to the hard work they put in cuFFT, and get NPOT FFT support out of the box.[/QUOTE]

Cool, please do let us know how you think the APIs compare. Do you plan on integrating CUDA into gpuOwl or it being it's own executable?

preda 2018-06-14 11:39

Own executable. I'll try to share logic related to savefiles and worktodo handling, log format, high-level PRP logic. OTOH the kernels and the memory layout would be quite different. So separate executables, and different FFT sizes offered.

henryzz 2018-06-14 12:51

What is the limitation that stops the gpuOwl OpenCL code running on Nvidia cards? Is it the lack of v 2.0 support?

preda 2018-06-14 13:58

No, gpuOwl does not require 2.0, OpenCL 1.2 should be enough. The problem is that the Nvidia's OpenCL implementation seems to be.. somehow broken, or different enough that it's not working in practice. Not usable.

Also, talking about performance, on Nvidia it looks like the performance is achieved with CUDA not with OpenCL. unfortunately.

M344587487 2018-06-14 14:13

[QUOTE=henryzz;489819]What is the limitation that stops the gpuOwl OpenCL code running on Nvidia cards? Is it the lack of v 2.0 support?[/QUOTE]

Someone correct me if I'm wrong, CUDA works better on nvidia hardware because they designed it around their hardware and put effort into making it good. They could have made OpenCL support better but instead decided to leverage their position in the market to lock people into their proprietary eco-system and stifle competition.

kriesel 2018-06-14 18:04

[QUOTE=preda;489812]Own executable. I'll try to share logic related to savefiles and worktodo handling, log format, high-level PRP logic. OTOH the kernels and the memory layout would be quite different. So separate executables, and different FFT sizes offered.[/QUOTE]
Super news that work is beginning on the CUDA port!

Please add back, the date preceding the time of day. Eventually runs will approach a day per 500,000 iterations check interval and log entry and console update on slow gpus and high exponents. As is, one can usually figure out what day a log entry is from by looking at times, and counting midnight rollovers, but it's a bit of a nuisance to do so, and more so the longer the run.

An RX550 kicks out only about 7 status lines in stabilized runs per day on 8M fft sized exponents ~150M. And takes a lot of days to complete one. (gpuowl V1.9)
[CODE]
OK 105500000 / 149447533 [70.59%], 22.62 ms/it [22.49, 24.27] CV 1.6%, check 15.18s; ETA 11d 12:07; 7d258759562620a5 [23:18:04]
OK 106000000 / 149447533 [70.93%], 22.65 ms/it [22.49, 49.54] CV 4.0%, check 15.29s; ETA 11d 09:19; 5442f469a0cb7552 [02:27:03]
OK 106500000 / 149447533 [71.26%], 22.62 ms/it [22.49, 23.84] CV 1.3%, check 15.10s; ETA 11d 05:52; fdc37179da8eabcb [05:35:49]
OK 107000000 / 149447533 [71.60%], 22.62 ms/it [22.49, 24.27] CV 1.4%, check 15.01s; ETA 11d 02:43; 47ff58ea51bfded6 [08:44:34]
OK 107500000 / 149447533 [71.93%], 22.62 ms/it [22.49, 24.27] CV 1.5%, check 15.16s; ETA 10d 23:35; bd5bd3c72ad1f11d [11:53:20]
OK 108000000 / 149447533 [72.27%], 22.62 ms/it [22.49, 24.24] CV 1.4%, check 15.02s; ETA 10d 20:26; 905dcdd0712f9ec6 [15:02:05]
OK 108500000 / 149447533 [72.60%], 22.62 ms/it [22.49, 24.24] CV 1.4%, check 14.93s; ETA 10d 17:16; 910f8d75624594bc [18:10:50]
OK 109000000 / 149447533 [72.94%], 22.62 ms/it [22.49, 23.84] CV 1.3%, check 14.91s; ETA 10d 14:08; 5dfaf65ac604ce3d [21:19:35]
OK 109500000 / 149447533 [73.27%], 22.62 ms/it [22.49, 24.24] CV 1.5%, check 15.16s; ETA 10d 10:59; 4c6685acfb8b24e2 [00:28:19][/CODE]Seven times 149447533 is 1046M, a bit outside of mersenne.org range but nominally inside of CUDA 64M fft range (1.143G). Reducing the number of bits per word slightly with increased fft length means it will occur at somewhat lower exponents than these linear estimates. There are slower gpus. Such long run times (years) are definitely not recommended, but will probably occasionally occur.

kriesel 2018-06-14 18:16

[QUOTE=henryzz;489819]What is the limitation that stops the gpuOwl OpenCL code running on Nvidia cards? Is it the lack of v 2.0 support?[/QUOTE]
I can confirm from testing, that gpouOwL developed for AMD could run on Intel IGP, and failed to run on several NVIDIA gpu models, that had functioning OpenCl drivers. It seems to be due to something odd about NVIDIA's OpenCl implementation.

Reportedly an early gpuOwL conversion by airsquirrels to CUDA gave indications performance was close between opencl and CUDA. At power of two fft length(s).

The advantage of going CUDA on NVIDIA is the non-power-of-two (NPOT) fft performance seems to be better in CUDA than in OpenCl implementations. CUDALucas and clLucas illustrate this, along with Preda's NPOT testing using OpenCl during development of gpuOwL V2 recently.

preda 2018-06-14 22:52

[QUOTE=kriesel;489839]
Please add back, the date preceding the time of day.[/QUOTE]
Yes, this is already done in recent commits.

LaurV 2018-06-17 05:52

[QUOTE=M344587487;489825]Someone correct me if I'm wrong, CUDA works better on nvidia hardware because they designed it around their hardware and put effort into making it good. They could have made OpenCL support better but instead decided to leverage their position in the market to lock people into their proprietary eco-system and stifle competition.[/QUOTE]
True. OpenCL on NV cards is lousy, no matter how you say it, and we have a lot of bad experiences years ago when we tried to port miners on NV cards, they were about 3 times slower that their AMD counter-parts.

preda 2018-06-25 10:46

cudaOwl
 
I added an initial CUDA backend to gpuOwl. I expect this to be rough, buggy and not-optimized yet, but it's a start.

The approach I ended with was to use most of the same codebase, but split out two backends, OpenCL and CUDA.

[I'm thinking, should I rename the previous gpuOwl to openOwl for symmetry with cudaOwl?]

So, the savefile format, and much of the logic, is shared between the cudaOwl and gpuOwl.

There are some notable differences though:
- gpuOwl supports "offset extension", which means varying the offset (aka "shift") when a PRP error is encountered. Not a big deal unfortunately, this trick achieves about 0.5% exponent extension for a given FFT size. This was motivated by the severe lack of FFT size choice in openOwl. (cudaOwl doesn't have "offset").

- cudaOwl has a rich choice of FFT sizes (unlike openOwl). FFT selection is controlled with the "-fft" argument, allowing to specify hard sizes such as 4096K or 4M, or delta steps from the "default" size for the exponent, such as +1 or -1.

A few nice things:
- it's possible to switch the savefile between CUDA/OpenCL in midflight.
- it's possible to change the FFT size in midflight.

Not so nice:
the performance on GTX 1080 is disappointing. 5.9ms/it at the PRP wavefront, 4480K FFT. (thus I don't think it's such a good idea to do PRP or LL on Nvidia yet. Probably TF is a better fit for the 32bit-oriented hardware).


All times are UTC. The time now is 22:58.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.