mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2018-01-19 18:16

gpuowl readme
 
Hi,
I recently pulled down the Windows executable zip file for gpuowl 1.9 from [URL]http://www.mersenneforum.org/showpost.php?p=471663&postcount=226[/URL], unzipped, read its README.md, which says in part:
[CODE]## gpuowl -help outputs:

```
gpuOwL v0.6 GPU Lucas-Lehmer primality checker; Mon Aug 21 23:47:40 2017
Command line options:
-logstep <N> : to log every <N> iterations (default 20000)
-savestep <N> : to persist checkpoint every <N> iterations (default 500*logstep == 10000000)
-checkstep <N> : do Jacobi-symbol check every <N> iterations (default 50*logstep == 1000000)
-uid user/machine : set UID: string to be prepended to the result line
-cl "<OpenCL compiler options>", e.g. -cl "-save-temps=tmp/ -O2"
-selftest : perform self tests from 'selftest.txt'
Self-test mode does not load/save checkpoints, worktodo.txt or results.txt.
-time kernels : to benchmark kernels (logstep must be > 1)
-legacy : use legacy kernels

-device <N> : select specific device among:
0 : 64x1630MHz gfx900; OpenCL 1.2
```[/CODE]
So I wrote a tiny batch file to do:
[CODE]if not exist gpuowlstart.txt gpuowl -selftest >>gpuowlstart.txt
gpuowl -help >>gpuowlstart.txt[/CODE]
which ran in a flash, too fast for a meaningful selftest. Opening the resulting gpuowlstart.txt, I see it contains only:
[CODE]gpuOwL v1.9- GPU Mersenne primality checker
Argument '-selftest' not understood
gpuOwL v1.9- GPU Mersenne primality checker
Argument '-help' not understood
[/CODE]Ok, I go check [URL]https://github.com/preda/gpuowl[/URL] for more current README.md information. It is the same there; not including info specific to V1.0-1.9.

Please update it to cover gpuowl 1.x also.

Thanks!

kriesel 2018-01-19 18:27

Try null parameters:
[CODE]$ gpuowl
gpuOwL v1.9- GPU Mersenne primality checker
Radeon 500 Series 8 @f:0.0, gfx804 1203MHz
Can't open 'worktodo.txt' (mode 'r')

Bye[/CODE]Try --help:

[CODE]$ gpuowl --help
gpuOwL v1.9- GPU Mersenne primality checker
Command line options:

-size 2M|4M|8M : override FFT size.
-fft DP|SP|M61|M31 : choose FFT variant [default DP]:
DP : double precision floating point.
SP : single precision floating point.
M61 : Fast Galois Transform (FGT) modulo M(61).
M31 : FGT modulo M(31).
-user <name> : specify the user name.
-cpu <name> : specify the hardware name.
-legacy : use legacy kernels
-dump <path> : dump compiled ISA to the folder <path> that must exist.
-verbosity <level> : change amount of information logged. [0-2, default 0].
-device <N> : select specific device among:
0 : Radeon 500 Series 8 @f:0.0, gfx804 1203MHz
1 : 12 @0:0.0, Intel(R) Xeon(R) CPU E5645 @ 2.40GHz 2394MHz

$[/CODE]Ah, options are very different from v0.6.

What are the legacy kernels?
What became of the self test?

[CODE]gpuOwL v1.9- GPU Mersenne primality checker
Radeon 500 Series 8 @f:0.0, gfx804 1203MHz
OpenCL compilation error -11 (args -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEX
P=76812401u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFP_DP=1 )
C:\Users\ken\AppData\Local\Temp\\OCL4520T3.cl:1:10: fatal error: 'gpuowl.cl' fil
e not found
#include "gpuowl.cl"
^
1 error generated.

error: Clang front-end compilation failed!
Frontend phase failed compilation.
Error: Compiling CL to IR
[/CODE]Read some more of the forum, download gpuowl.cl from github since it was not in the zip file, try again.

Ok, now it seems to be working, with initial output showing 11.94-12.04 ms/iteration on the RX550 for PRP-3: FFT 4M (1024 * 2048 * 2) of 76812401 (18.31 bits/word)

kriesel 2018-01-19 19:46

(Later...) Whoa, what happened in that middle line? Over a minute per iteration computed (>5000 times the preceding and following). Momentarily projecting runtime of over 150 years!

[CODE]OK 80000 / 76812401 [ 0.10%], 11.99 ms/it; ETA 10d 15:27; 6ee0f8a8a97d7812 [2018-01-19 13:19:36 Central Standard Time]
OK 100000 / 76812401 [ 0.13%], 63362.75 ms/it; ETA 56258d 15:28; 3fb24c04ec7569db [2018-01-19 13:23:43 Central Standard Time]
OK 150000 / 76812401 [ 0.20%], 11.99 ms/it; ETA 10d 15:25; 10bf91703f69c302 [2018-01-19 13:33:50 Central Standard Time][/CODE]

Also, for this run begun at 1:02pm, logging to gpuowl.log appears to not be occurring.

kriesel 2018-01-19 23:29

[QUOTE=kriesel;477927](Later...) Whoa, what happened in that middle line? Over a minute per iteration computed (>5000 times the preceding and following). Momentarily projecting runtime of over 150 years!

[CODE]OK 80000 / 76812401 [ 0.10%], 11.99 ms/it; ETA 10d 15:27; 6ee0f8a8a97d7812 [2018-01-19 13:19:36 Central Standard Time]
OK 100000 / 76812401 [ 0.13%], 63362.75 ms/it; ETA 56258d 15:28; 3fb24c04ec7569db [2018-01-19 13:23:43 Central Standard Time]
OK 150000 / 76812401 [ 0.20%], 11.99 ms/it; ETA 10d 15:25; 10bf91703f69c302 [2018-01-19 13:33:50 Central Standard Time][/CODE]Also, for this run begun at 1:02pm, logging to gpuowl.log appears to not be occurring.[/QUOTE]

More iteration time anomalies are being displayed too as the run continues; around every 70 to 90 minutes:
[CODE]OK 400000 / 76812401 [ 0.52%], 11.99 ms/it; ETA 10d 14:35; 1048549c8bed4e64 [2018-01-19 14:24:25 Central Standard Time]
OK 450000 / 76812401 [ 0.59%], 25352.29 ms/it; ETA 22407d 03:22; 22b21a753ab5f8b5 [2018-01-19 14:34:31 Central Standard Time]
OK 500000 / 76812401 [ 0.65%], 11.99 ms/it; ETA 10d 14:15; a7265bc29bf827f1 [2018-01-19 14:44:39 Central Standard Time][/CODE][CODE]OK 800000 / 76812401 [ 1.04%], 11.99 ms/it; ETA 10d 13:08; 6dd5e2686143154e [2018-01-19 15:44:59 Central Standard Time]
OK 900000 / 76812401 [ 1.17%], 12682.14 ms/it; ETA 11142d 19:40; e6f4ebf42ae5ab20 [2018-01-19 16:05:05 Central Standard Time]
OK 1000000 / 76812401 [ 1.30%], 11.99 ms/it; ETA 10d 12:28; 4562268d31760723 [2018-01-19 16:25:11 Central Standard Time]
[/CODE][CODE]OK 1145000 / 76812401 [ 1.49%], 11.95 ms/it; ETA 10d 11:12; 4243e97c3e1113e1 [2018-01-19 16:56:16 Central Standard Time]
OK 1150000 / 76812401 [ 1.50%], 253415.12 ms/it; ETA 221923d 00:32; 43c530d6fa88a445 [2018-01-19 16:57:24 Central Standard Time]
OK 1160000 / 76812401 [ 1.51%], 11.99 ms/it; ETA 10d 11:56; 059216a2fa1889b9 [2018-01-19 16:59:32 Central Standard Time][/CODE]It's apparently a discrepancy in computing the iteration times; the wall clock interval per 100,000 iterations does not appear to be fluctuating.

Log output apparently commits to disk at program exit. I'd prefer it to flush to disk at least hourly, so that less log output is lost in the case of a system or application crash.

Built in logging is a great feature in my opinion.

For comparison, cllucas at half the fftlength, on the same GPU does:
[CODE]Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2048K, clLucas v1.04 err = 0.0781 (1:20 real, 8.0871 ms/iter, ETA 83:25:53)
[/CODE]Scaling for iteration times is estimated as the 1.1 power of power-of-2 fft length (from a lot of CUDALucas testing). That extrapolates cllucas to about 17.3msec/iter, ~44% longer.

CUDALucas runs around 35. ms/iter for 4M fft length on a Quadro 2000 (2.92 times as long!); the mersenne.ca benchmarks indicate the Quadro 2000 and RX550 cards are quite comparable at LL performance near 75M. Time to check into OpenCL on NVIDIA perhaps.

preda 2018-01-20 09:59

Thanks for the feedback! it seems there is one bug in the time-per-iteration computation, I'll look into that.

What OS and driver are you using? I'll try to investigate why the log is not flushed, but I suspect it's caused by different behavior Linux vs. Windows.

Yes the readme needs updating too. And other important things need to be done:
- new FFT sizes allowing efficient transition to "after 4M FFT"
- automatic communication with primenet.. I don't know how hard this is.

kriesel 2018-01-20 14:00

2 Attachment(s)
[QUOTE=preda;477958]Thanks for the feedback! it seems there is one bug in the time-per-iteration computation, I'll look into that.

What OS and driver are you using? I'll try to investigate why the log is not flushed, but I suspect it's caused by different behavior Linux vs. Windows.

Yes the readme needs updating too. And other important things need to be done:
- new FFT sizes allowing efficient transition to "after 4M FFT"
- automatic communication with primenet.. I don't know how hard this is.[/QUOTE]

You're very welcome. GPUOwL has been an impressively fast development, and that's been fun to watch. Early test results seem to indicate its performance is good. (Does it run on NVIDIA gpus?)

Windows 7 is present for the reported observations, and most of my other GPU equipped systems, while it's Vista for my other GPU systems. See one of the attachments for driver info for what I reported. As often happens, the situation with logging is less simple than I first thought. See the log-lag attachment.

Time and calendar computations have long been the bane of programmers' lives. Counters wrap, standards abound, and timekeeping and the calendars' histories are riddled with rules changes and special cases. (There's an xkcd comic about that, which I can't find at the moment.) Happy hunting.

Aside from the occasional very long ms/iter values, there seems to be a difference of about 7 seconds, beyond ms/iter * iteration delta, to wall clock time difference output by GPUOwL. Perhaps that's the time to update the save files? My impression is other applications include the save file time in determining their iteration rates, while it looks like gpuowl does not. (Are you not doing memory buffer/asynchronous write to allow iteration to continue during save to disk?)

Something else that I find interesting is gpuowl seems resistant to command-line redirection of output to the console. (Is it writing to stderr not stdout?) Builtin logging makes that less important, though the log lag observed makes it more significant.

Documentation of a moving target is not easy. Efforts to catch it up soon (and periodically) would be appreciated. In other applications, authors have moved on, and it did not get done, years later, while the apps are still in use. It does not get easier with the passage of time.

As I understand it, gpuowl now has 2M, 4M, and 8M fft lengths, so one is not stuck doing DC when first time tests suitable for 4M are soon exhausted, but the 4M<l-<6M suitable exponents could benefit a lot from a 6M fft length (or other sufficiently fast intermediate lengths). If implementing many fft lengths, selecting the right one has presented challenges. From what I've read, cudalucas handles fft length selection, for speed within the error constraints, better than cllucas thus far.

When you eventually get around to trying primenet linkage, see
[URL]http://mersenneforum.org/showpost.php?p=430818&postcount=406[/URL]
and there's a tabulation I made of other folks' efforts at
[URL]http://www.mersenneforum.org/showthread.php?t=22450&page=3[/URL]
It may be useful to look at a number of cases of how others have approached it.

There's [URL="https://www.explainxkcd.com/wiki/index.php/1205:_Is_It_Worth_the_Time%3F"]https://www.explainxkcd.com/wiki/index.php/1205:_Is_It_Worth_the_Time%3F[/URL] Of course all the numbers change when considering n users of gpuowl, depending on how you weight the value of your time versus theirs, and n. (It's logical to weight an hour of code authors' time as more valuable than users' time, since the number of coders able to handle gpu programming, ffts, etc, and willing to spend the amounts of time required to make code effective and reliable for mersenne hunting is a small number compared to the user base.)

kriesel 2018-01-20 17:10

log lag update
 
1 Attachment(s)
Now 12+ hours. Values of ms/iter roughly 6347 have become common (9 of the last 16). Note the log contents halt in mid record.
Latest version of GPU-Z and the RX550 seem to be failing to communicate about clocks and temperature.

kriesel 2018-01-20 22:10

[QUOTE=preda;471318]

...
Also, a dynamic step of the Gerbicz verification is implemented, which starts with a very small step of 2K iterations at program start (allowing the user to see that the program functions correctly) and ramps up towards 500K if no errors are encountered, or back down if errors are detected.
...
[/QUOTE]

The behavior I see in my logs is it starts with 1k iterations.
I have sequences that differ near start;
iteration counts differ from line to line, by 1k,4k,5k,10k,20k...,50k,...,100k
[CODE]OK 0 / 76812401 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [2018-01-19 13:02:44 Central Standard Time]
OK 1000 / 76812401 [ 0.00%], 11.96 ms/it; ETA 10d 15:06; aadc1acf24bf7d60 [2018-01-19 13:03:04 Central Standard Time]
OK 5000 / 76812401 [ 0.01%], 11.94 ms/it; ETA 10d 14:44; 3db0edb3db578456 [2018-01-19 13:03:59 Central Standard Time]
OK 10000 / 76812401 [ 0.01%], 12.04 ms/it; ETA 10d 16:46; 261173187b4c2ca3 [2018-01-19 13:05:07 Central Standard Time]
OK 20000 / 76812401 [ 0.03%], 11.99 ms/it; ETA 10d 15:41; 32413ccc78fbad77 [2018-01-19 13:07:14 Central Standard Time]
OK 40000 / 76812401 [ 0.05%], 11.99 ms/it; ETA 10d 15:36; 45e38194e8bad318 [2018-01-19 13:11:21 Central Standard Time]
...[/CODE]Next launch, ramping less rapidly:
1k,2k,5k,10k,..,20k,...40k,50k...,100k...,200k...
[CODE]OK 1142000 / 76812401 [ 1.49%], 0.00 ms/it; ETA 0d 00:00; 3db0227e36cac730 [2018-01-19 16:55:25 Central Standard Time]
OK 1143000 / 76812401 [ 1.49%], 11.96 ms/it; ETA 10d 11:26; 1b8377874165d88d [2018-01-19 16:55:45 Central Standard Time]
OK 1145000 / 76812401 [ 1.49%], 11.95 ms/it; ETA 10d 11:12; 4243e97c3e1113e1 [2018-01-19 16:56:16 Central Standard Time]
OK 1150000 / 76812401 [ 1.50%], 253415.12 ms/it; ETA 221923d 00:32; 43c530d6fa88a445 [2018-01-19 16:57:24 Central Standard Time]
OK 1160000 / 76812401 [ 1.51%], 11.99 ms/it; ETA 10d 11:56; 059216a2fa1889b9 [2018-01-19 16:59:32 Central Standard Time]
OK 1170000 / 76812401 [ 1.52%], 11.99 ms/it; ETA 10d 11:58; f517fce8185f4163 [2018-01-19 17:01:39 Central Standard Time]
OK 1180000 / 76812401 [ 1.54%], 11.99 ms/it; ETA 10d 11:55; 2e9b327003b87b79 [2018-01-19 17:03:47 Central Standard Time]
OK 1200000 / 76812401 [ 1.56%], 11.99 ms/it; ETA 10d 11:50; 505bc0afc72a4862 [2018-01-19 17:07:54 Central Standard Time]
[/CODE]Is it possible to save the ending Gerbicz step size from one run to begin the next? If not, would you consider it as an option, for hardware and installations that are stable? There's a slight execution speed advantage, and it reduces screen clutter. (My test run has worked its way up to 200k in under 24 hours, with no errors flagged yet, but if halted and restarted will reset to 1k interval and start the slow climb again from there. It should be very stable, as it's a brand new GPU on a fresh Windows install, patched to current, then gpuowl installed and run.)

M344587487 2018-01-21 09:19

[QUOTE=preda;477958]...
- automatic communication with primenet.. I don't know how hard this is.[/QUOTE]

Mlucas does it with a babysitting python script, maybe it can be modified to PRP easily enough:
[url]http://www.mersenneforum.org/mayer/README.html#reserve[/url]

kriesel 2018-01-21 18:39

Log lag update and iteration time
 
1 Attachment(s)
Log updated to disk last night, ~21 hours after the previous, and is now 5 million iterations and about17 hours behind the program's progress.

As the Gerbicz check interval increased, ms/iter times became less drastic when off but more frequently off. Now that the Gerbicz check has advanced to 500k intervals, the ms/iter times are consistently too high, with values ~2534.03n+111.99 where n is 1 or 2 and 12 is the expected iteration time, in msec/iteration. That corresponds to time errors of 1,267,015,000 or 1.18 x 2^30 msec (~352 hours or 14 2/3 days) somewhere, occurring once or twice per 500k output interval. Taking the most common values of the 200k intervals, 6347.07-11.99 = 6335.08 ms/iter corresponds to an excess of 1,267,016,000 msec, very close to the 500k value; 1.208*2^20 seconds.

kriesel 2018-01-21 20:45

[QUOTE=preda;471318]...
Anyway, now it's possible to select among these 4 transforms:
-fft DP : the old double precision floating point
-fft SP : simple precision FP
-fft M61 : FGT(M61)
-fft M31 : FGT(M31)

Of these, SP is very fast but also useless at 2M FFT-size and up (it may prove useful for something at lower FFT sizes).

M31 has about 5 bits-per-word usable at 4M FFT size. It's not much use by itself, but can be tested.

M61 has deeper word bits than DP. So it can be used for real work. Unfortunately it's also slower than DP. Part of the slowness may be from poor compiler optimizations and that aspect may improve in the future, hopefully.

...[/QUOTE]

I've found that on NVIDIA CUDA code, often running multiple instances per GPU produces greater overall throughput. Has anyone tried this on gpuowl or OpenCL in general yet (other than mfakto, where it's known to raise throughput and device utilization)?

It seems to me from my limited NVIDIA testing that dissimilar code instances can produce significant increases in total throughput. For GpuOwL, it might be interesting to try one of
-fft DP
with one of
-fft M61
to work both the DP and integer units.


All times are UTC. The time now is 22:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.