mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

airsquirrels 2019-01-09 00:01

[QUOTE=preda;505076]While OpenCL *may* be portable to some degree, the performance is not portable (and thus, IMO, the whole point of OpenCL "portability" is moot). I mean that even if it would run on an FPGA, it would probably run extremely slow before perf tuning.

In practice, I strongly expect GpuOwl to not run at all on an FPGA. It does use LDS (Local Data Share) which likely is not available on FPGA. It uses DP FP heavily, which may not be present as specialized hardware sub-elements on the FPGA, and thus would be very expensive and rather slow to implement on plain FPGA.

For FPGA, I think a different design that plays into FPGA's strengths is needed. And maybe some specialized DP units would help too.[/QUOTE]

I have access to quite a few different FPGAs, and would happily provide them to anyone that wants to develop such a thing for trial factoring or prime searching.

Happy to donate a few hundred of our Acorn boards to the cause, or a dozen VU9Ps. I do also have access to HBM FPGAs, but I don’t expect them to beat nviidia GPUs with the same memory bandwidth - since bandwidth seems to be the issue.

GP2 2019-01-09 00:52

[QUOTE=airsquirrels;505341]but I don’t expect them to beat nviidia GPUs with the same memory bandwidth - since bandwidth seems to be the issue.[/QUOTE]

OK, for LL testing there would be issues, but what about factoring?

Factoring doesn't need memory bandwidth. Doesn't need DP either.

Would there be any hope of running mfakto (OpenCL) on an FPGA?

preda 2019-01-09 03:41

[QUOTE=GP2;505346]OK, for LL testing there would be issues, but what about factoring?

Factoring doesn't need memory bandwidth. Doesn't need DP either.

Would there be any hope of running mfakto (OpenCL) on an FPGA?[/QUOTE]

I don't have experience with FPGA development, so I'm not able to help here. But this is what I would see as an approach:

- extract tiny streamlined, simplified OpenCL components from a trial-factorer. E.g. a very basic and simple sieve, or a simple modular exponentiation.
- test and adapt for the FPGA in separation
- repeat with the next component

- when all the basic simple pieces work, put them together into an FPGA TFer.

Starting with mfackto as a whole.. may not work as easily. Anyway, somebody with more FPGA experience should try I guess.

clarke 2019-01-09 04:58

[QUOTE=kriesel;505338]It's not broken, it's just your setup is incompatible.
[CODE]gpuOwL v1.9- GPU Mersenne primality checker
Radeon 500 Series 8 @f:0.0, gfx804 1203MHz

OpenCL compilation in 2147 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=76812401u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFP_DP=1 "
PRP-3: FFT 4M (1024 * 2048 * 2) of 76812401 (18.31 bits/word) [2018-01-23 12:43:49 Central Standard Time]
Starting at iteration 25373000
OK 25373000 / 76812401 [33.03%], 0.00 ms/it; ETA 0d 00:00; 6d6a6ebc97092826 [12:43:57]
OK 25374000 / 76812401 [33.03%], 11.97 ms/it; ETA 7d 03:05; bb937b8a48c69d60 [12:44:17]
OK 25375000 / 76812401 [33.03%], 11.97 ms/it; ETA 7d 03:04; b81a6f51602c2bd8 [12:44:36]
OK 25380000 / 76812401 [33.04%], 11.96 ms/it; ETA 7d 02:50; 60bcb33b85922094 [12:45:44]
OK 25390000 / 76812401 [33.05%], 12.00 ms/it; ETA 7d 03:26; 516093b7988f8ac4 [12:47:52]
OK 25400000 / 76812401 [33.07%], 12.00 ms/it; ETA 7d 03:23; 5313239afe8bcffe [12:50:00]
OK 25420000 / 76812401 [33.09%], 12.00 ms/it; ETA 7d 03:20; d04bc7fd72b07e36 [12:54:07]
OK 25440000 / 76812401 [33.12%], 12.00 ms/it; ETA 7d 03:15; 679e6f34ac35a983 [/CODE][/QUOTE]
Your setup is RX5xx OpenCL 2.0. Indeed, something is wrong at my end:
[code]
gpuOwL v1.9- GPU Mersenne primality checker
AMD Radeon HD 5800 Series 20 @1:0.0, Cypress 850MHz
OpenCL compilation error -11 (args -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=76812401u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFP_DP=1 )
Error: aclBinary init failure

".\gpuowl.cl", line 67: warning: OpenCL extension is now part of core
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
^


OpenCL compilation in 2771 ms, with "-I. -cl-fast-relaxed-math -DEXP=76812401u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFP_DP=1 "
PRP-3: FFT 4M (1024 * 2048 * 2) of 76812401 (18.31 bits/word) [2019-01-09 07:49:05]
Starting at iteration 0
OK 0 / 76812401 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [07:49:21]
EE 1000 / 76812401 [ 0.00%], 27.18 ms/it; ETA 24d 03:55; c89d15ae90d209ec [07:50:03]
EE 1000 / 76812401 [ 0.00%], 27.18 ms/it; ETA 24d 03:55; c89d15ae90d209ec [07:50:45] (1 errors)
EE 1000 / 76812401 [ 0.00%], 27.15 ms/it; ETA 24d 03:12; c89d15ae90d209ec [07:51:28] (2 errors)
[/code]
Wondering if somebody has run 1.9 with 5xxx/6xxx series successfully.

kriesel 2019-01-09 11:25

[QUOTE=clarke;505365]Wondering if somebody has run 1.9 with 5xxx/6xxx series successfully.[/QUOTE]Have you tried mfakto? (Might keep your HD5870 usefully busy while you look for a solution or save for a new card)

clarke 2019-01-09 19:44

[QUOTE=kriesel;505390]Have you tried mfakto? (Might keep your HD5870 usefully busy while you look for a solution or save for a new card)[/QUOTE]
Yep, thank you, mfakto works well. I'll try to figure out if different 15.7.1/15.11.1 OpenCL releases make a difference for gpuowl for now.

SELROC 2019-01-15 10:23

[QUOTE=clarke;505365]Your setup is RX5xx OpenCL 2.0. Indeed, something is wrong at my end:
[code]
gpuOwL v1.9- GPU Mersenne primality checker
AMD Radeon HD 5800 Series 20 @1:0.0, Cypress 850MHz
OpenCL compilation error -11 (args -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=76812401u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFP_DP=1 )
Error: aclBinary init failure

".\gpuowl.cl", line 67: warning: OpenCL extension is now part of core
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
^


OpenCL compilation in 2771 ms, with "-I. -cl-fast-relaxed-math -DEXP=76812401u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFP_DP=1 "
PRP-3: FFT 4M (1024 * 2048 * 2) of 76812401 (18.31 bits/word) [2019-01-09 07:49:05]
Starting at iteration 0
OK 0 / 76812401 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [07:49:21]
EE 1000 / 76812401 [ 0.00%], 27.18 ms/it; ETA 24d 03:55; c89d15ae90d209ec [07:50:03]
EE 1000 / 76812401 [ 0.00%], 27.18 ms/it; ETA 24d 03:55; c89d15ae90d209ec [07:50:45] (1 errors)
EE 1000 / 76812401 [ 0.00%], 27.15 ms/it; ETA 24d 03:12; c89d15ae90d209ec [07:51:28] (2 errors)
[/code]Wondering if somebody has run 1.9 with 5xxx/6xxx series successfully.[/QUOTE]


It may be possible that your FFT size is too small for the exponent.
Try to specify the argument "-fft 5M".

kriesel 2019-01-15 15:04

[QUOTE=SELROC;505962]It may be possible that your FFT size is too small for the exponent.
Try to specify the argument "-fft 5M".[/QUOTE]
Belay that; V1.9 was before Preda implemented a 5M fft in V2.0. The purpose of running V1.9 was to try to get back to a version not requiring OpenCl V2. If fft size is an issue he could try -fft M61 instead of DP. It's slower than 4M DP but gives about 7% higher max exponent for the 4M size, and is faster than 8M DP. But the OpenCl version appears to still be an issue for his old gpu's driver at V1.9. The 4M DP transform in gpuOwL was capable of 78M exponent as I recall.

SELROC 2019-01-15 15:18

[QUOTE=kriesel;505982]Belay that; V1.9 was before Preda implemented a 5M fft in V2.0. The purpose of running V1.9 was to try to get back to a version not requiring OpenCl V2. If fft size is an issue he could try -fft M61 instead of DP. It's slower than 4M DP but gives about 7% higher max exponent for the 4M size, and is faster than 8M DP. But the OpenCl version appears to still be an issue for his old gpu's driver at V1.9. The 4M DP transform in gpuOwL was capable of 78M exponent as I recall.[/QUOTE]


So there is no hope for this version ?

preda 2019-01-24 09:24

P-1
 
It is my pleasure to announce.. P-1 in GpuOwl. Good old classic P-1.

1. worktodo.txt

PFactor=90551623
PFactor=AID,1,2,90551623,-1,77,2
PFactor=N/A,1,2,90551623,-1,77,2

(in all the PFactor cases above, only the exponent and the AID are used)

By default the P-1 task is processed with B1=1M and B2=30 * B1. These can be overriden by prepending the limits to any PFactor line above, with this syntax:
B1=2000000;PFactor=90551623
B1=500000,B2=10000000;PFactor=90551623


The P-1 in GpuOwl always has E=2 (a parameter in stage2). The D parameter ("block size") is normally computed automatically based on the amount of memory available on the GPU. It can also be specified on the command line e.g. -D 210. The block size D must be a multiple of 210. Good values are D=2310 (but that wouldn't fit in a GPU with 8GB RAM), and D=210 or small multiples of 210.

P-1 does not save the work to a savefile. If stopped (crash etc) the progress is lost.

At this stage I'm very interested in bug reports. Most importantly, it situations where a factor which should be detected given the B1/B2, is not found.

preda 2019-01-24 09:29

GpuOwl v6.1, just commit on github, has P-1. It needs GMP (for the GCD done on the CPU, as was before with PRP-1)

I must say, it was rather hard for me to understand P-1 stage2. (after the fact it doesn't look so terrible, I could explain it simply now I think)

I found useful Alexander Kruppa's thesis:
[url]https://tel.archives-ouvertes.fr/file/index/docid/477005/filename/thesis.ps[/url]
(although even that was not easy reading).


All times are UTC. The time now is 23:12.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.