mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2019-09-06 15:09

[QUOTE=AlsXZ;525329]Hello! I just started play with gpuOwL and have some questions. As I understood that soft optimized for AMD GPUs. I'm trying to run it on Nvidia GeForce GTX 1050 TI on Linux. It works with -use NO_ASM only without it I have errors) and I got such results for 92M exponent: ETA 16.5 days and 15590 us/sq. What do you think - is it ok speed for this GPU or something going wrong?[/QUOTE]Times don't sound drastically wrong to me without really checking. I get about 9 ms/iter in CUDALucas on 51M on a GTX 1050 Ti so would expect around 17 ms/iter CUDALucas at 92M. Try some other -use choices, like ORIG_X2. Try different ffts -fft +1 etc. The fastest fft is generally but not always the default, in my experience, on AMD or NVIDIA. See the last attachment at [url]https://www.mersenneforum.org/showpost.php?p=488535&postcount=2[/url] for examples (My experience is on Windows.) Run a PRP double check to confirm your setup's reliability. Have fun!

preda 2019-09-06 15:13

[QUOTE=AlsXZ;525329]Hello! I just started play with gpuOwL and have some questions. As I understood that soft optimized for AMD GPUs. I'm trying to run it on Nvidia GeForce GTX 1050 TI on Linux. It works with -use NO_ASM only without it I have errors) and I got such results for 92M exponent: ETA 16.5 days and 15590 us/sq. What do you think - is it ok speed for this GPU or something going wrong?[/QUOTE]

I don't know how fast GTX1050ti is with cudaLucas, that would give a comparison point. GpuOwl is indeed optimized for AMD.

preda 2019-09-06 15:24

[QUOTE=kriesel;525330][LIST][*]save files for P-1 & resume from them. Higher exponents or slower gpus can take days or weeks. Probably save under different extensions than for PRP and versus stage.
- yes, looking into this.
[*]By default, setting P-1 bounds automatically for a fit to gputo72 bounds based on exponent, rather than the current fixed B1 and B2 defaults. See [URL]https://www.mersenneforum.org/showpost.php?p=525112&postcount=1306[/URL] The pdf attachment at [URL]https://www.mersenneforum.org/showpost.php?p=522257&postcount=23[/URL] gives simple power fits for B1 and B2. Rounding would be good. Rounding up might be good. Exponent-scaled bounds seem straightforward to implement. It would make running P-1 more automatic, less user involvement.

- yes it makes sense to provide a sensible default for the bounds
[*]less cpu usage on NVIDIA in P-1. See near end of [URL]https://www.mersenneforum.org/showpost.php?p=525251&postcount=1325[/URL] for a description of considerably higher cpu usage on NVIDIA than on AMD in gpuowl P-1, an additional core continuously.. I have no idea why it does that.

- I have a suspicion why is that: Nvidia chose the "busy-wait" by default for CUDA. i.e. spin a CPU core 100% just to gain a tiny bit of latency when reading events from the GPU. A very wasteful default IMO. Anyway, with CUDA this default can be configured to "non-busy" wait, but I have no idea how to do that for OpenCL-on-Nvidia.
[*]If there's a constraint on stage 1 or stage 2 feasible exponent, due to program limits or gpu memory or whatever, log to console and log file what the issue is and skip a worktodo item that exceeds the limit and try to continue with the next worktodo entry. It would in some cases be possible to run stage 1 and not stage 2.

- a problem is: if that line is not done, it should remain in worktodo.txt. But if it stays it worktodo.txt, GpuOwl will always attempt it first. (i.e. the way worktodo.txt is handled now would need to extended, and it's not clear to me how)[/LIST][/QUOTE]

answered inline

preda 2019-09-06 15:33

Recent changes
 
Some recent changes to savefiles:

- instead of putting all the .owl files in the current directory, now a sub-folder with the name equal to the exponent (e.g. "95123123") is created per exponent, and the savefiles put in that folder.

So, if upgrading with an ongoing exponent, it would be good to move the savefile to this folder in order for GpuOwl to find it and continue it instead of starting from the beginnig because the folder is empty.

- the "previous" savefile is now named 95123123-old.owl instead of 95123123-prev.owl
- the "temporary/new" savefile is now 95123123-new.owl instead of 95123123-temp.owl
- there are no persistent checkpoints created anymore (before they were created every 20M iters)
- if the normal checkpoint 95123123.owl can't be loaded, the 95123123-old.owl is automatically attempted

The "folder per exponent" is a bit in preparation for P-1 savefiles.

kriesel 2019-09-06 15:38

[QUOTE=preda;525335]answered inline[/QUOTE]
Thanks for the update.

On the last one, gpuowl could convert the problematic worktodo entry to a comment. Then it's still there for the user to refer to and react to, and readily and efficiently skipped next time by gpuowl.
Suppose a line which can be run in stage 1 P-1 but not stage 2 is present.
It could run stage one, then convert the line to a comment.

Or if an exponent is mistyped, to too many digits for gpuowl to handle, comment out that worktodo line and go on to the next instead of halting.

A single leading character is already enough to cause gpuowl to not run a worktodo line.
; or # for example.
Although it does complain a bit.
Mfaktc, mfakto, and CUDAPm1 support as comment identifiers
//
\\
#

kriesel 2019-09-06 15:43

[QUOTE=preda;525336]Some recent changes to savefiles:
...
- there are no persistent checkpoints created anymore (before they were created every 20M iters)
...
The "folder per exponent" is a bit in preparation for P-1 savefiles.[/QUOTE]I'm sorry to see the checkpoints go. If/when a mersenne prime is found with gpuowl, they allow rapid parallel confirmation. Other applications give the user a choice to save them or not.
At what commit or version did this change occur?

Glad to hear you're working toward P-1 save files.

preda 2019-09-06 15:57

[QUOTE=kriesel;525339]I'm sorry to see the checkpoints go. If/when a mersenne prime is found with gpuowl, they allow rapid parallel confirmation. Other applications give the user a choice to save them or not.
At what commit or version did this change occur?
[/QUOTE]

most recent commit now, tagged v6.8: v6.8-0-gab732a0

About prime confirmation, that would most likely be done with LL tests. For rapid confirmation of PRP, the VDF proof system is much better.

PontiacGTX 2019-09-06 16:19

[QUOTE=preda;525193]uint64_t is not a type in OpenCL. Use "unsigned long" instead, which is guaranteed (in OpenCL) to be 64bits.[/QUOTE]
Thanks I was trying to figure out why it couldnt use unsigned long long on a opencl kernel I didnt know it used unsigned long as a equivalent 64 bit integer value

kriesel 2019-09-06 16:29

[QUOTE=preda;525335]answered inline[/QUOTE][QUOTE][COLOR=Blue]less cpu usage on NVIDIA in P-1. See near end of [URL="https://www.mersenneforum.org/showpost.php?p=525251&postcount=1325"]https://www.mersenneforum.org/showpo...postcount=1325[/URL] for a description of considerably higher cpu usage on NVIDIA than on AMD in gpuowl P-1, an additional core continuously.. I have no idea why it does that.[/COLOR]

- I have a suspicion why is that: Nvidia chose the "busy-wait" by default for CUDA. i.e. spin a CPU core 100% just to gain a tiny bit of latency when reading events from the GPU. A very wasteful default IMO. Anyway, with CUDA this default can be configured to "non-busy" wait, but I have no idea how to do that for OpenCL-on-Nvidia.[/QUOTE]Impact varies. On a system capable of and configured for hyperthreading it's not too bad. On a system where hyperthreading is either not implemented or not enabled (such as turning it off in the BIOS), the impact can be considerable. Some of my systems are old processors with fast new gpus. These need to run multiple cores per worker in mprime/prime95, to avoid exponent expiration before completion. It's also often more efficient to run multiple cores per worker; there's only so much cache to go around. Without hyperthreading, cpu intensive activities like P-1 gcd or the spin-wait stop an entire worker, multiple cores, completely, for the duration even while they use only one of those cores.

Re mitigating the 100% core with OpenCL, looks like NVIDIA introduced the issue around driver 270, shrugged and left it up to the developers:
[URL]https://github.com/openmm/openmm/issues/1541[/URL] mentions getting it to 10% or 4%;
[URL]https://devtalk.nvidia.com/default/topic/978985/cuda-programming-and-performance/opencl-busy-wait-still-not-fixed/2[/URL]
[URL]https://devtalk.nvidia.com/default/topic/494659/execute-kernels-without-100-cpu-busy-wait-/[/URL]
I'm generally at driver 378 or above; the latest gpus require some 4xx.
It's always something or other.

kriesel 2019-09-06 16:33

[QUOTE=preda;525342]most recent commit now, tagged v6.8: v6.8-0-gab732a0[/QUOTE]Thanks. Will build and try later. Do you regard 6.7-4 as suitable for general use?[QUOTE]About prime confirmation, that would most likely be done with LL tests. For rapid confirmation of PRP, the VDF proof system is much better.[/QUOTE]It's likely that in the case of a PRP-found prime, both PRP and multiple LL on different software and systems would be done. Historically 3 or more confirmation runs are done by different persons.

Are you planning to also implement VDF?

preda 2019-09-06 21:46

[QUOTE=kriesel;525347]Thanks. Will build and try later. Do you regard 6.7-4 as suitable for general use?[/QUOTE] Yes I think it's fine.

[QUOTE]
Are you planning to also implement VDF?[/QUOTE] I would love to. It's a bigger chunk of work, and also requires wider agreement (on data, data formats) in order to allow implementing independent verifiers. (the verifiers would need to be independent to some degree in order to guarantee "no cheating" by the verifier)


All times are UTC. The time now is 23:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.