mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

preda 2018-09-20 15:15

GpuOwl 4.x news (experimental)
 
News about 4.x (the freshest):

First, it's not so tested or stable. Has major changes. And it's slower! So using 4.x at this point would be not recommended. (I'll announce when it becomes stabler and more tuned).

I added a new P-1 first-stage. This is "the first half" of a P-1 test. It's not very strong by itself, without the other half "second-stage". The main reason for the introduction of P-1 is that it's used by PRP-1 (below).
An example P-1 in worktodo.txt would be:
PFactor=B1:1000000,80000001
specifying the B1 bound for P-1, and the exponent. Anyway, the main reason to add such tasks is to debug or self-test.

I'm starting work on PRP-1. A worktodo line looks like:
PRPF=90000001

The PRP-1 test needs a "K set" file which contains a list of Ks (iteration numbers) at which to do the GCD test. The "K set" file is specified with the -kset on the command line, like this:
-kset kset-2M-2M.txt

The Kset file has on the first line the B1 value for which the k set was generated:
$ head -4 kset-2M-2M.txt
B1=2000000
650179
803657
863201

$ head -4 kset-1M-800.txt
B1=1000000
540907
599713
633377

When a PRP-1 test is started, the B1 is read from the kset file given on the command line "-kset" ("kset.txt" by default). Then the PRP-1 task starts a P-1 sub-task with the same B1. When this P-1 completes, the PRP-1 proper starts.

The PRP-1 can only produce "type-4" residues as per [url]http://www.mersenneforum.org/showpost.php?p=468378&postcount=209[/url]

To unify code between PRP-1 and PRP I'm planning to change PRP to type-4 as well (it is type-1 now). This change only affects the value of the final residue, and the double-checking (which must consider the type). The residues at intermediate iterations remain unchanged.

We still have to discuss the details of what a PRP-1 result looks like.
It the lucky case, when a factor is found, it contains just a
factors=["24324"]
together with the exponent and B1. (and the common stuff: user, cpu, AID, timestamp).
The nice thing is that such a "positive" result does not need double-checking.

If a factor is not found, the PRP-1 result would contain:
- PRP status, usually "NP" (not prime)
- B1
- information allowing an exact reconstruction of the exponent of 3^x used by P-1. Normally this is x==powerSmooth(B1), but for mersenne M(p), 2*p is also included. Also in my case I prefer to boost the power of "2", deviating from powerSmooth(B1).
What I propose is:
- the baseline is 2*p*powerSmooth(B1), and a key "B1bias" would indicate which primes deviate from the baseline and by what amount, e.g.
b1bias={"2":20, "7":-1}, meaning:
given p and B1, compute x:
x = 2 * p * powerSmooth(B1) * 2^20 * 7^-1
- res64 of 3^x, allowing to double-check the first-stage P-1
- res64 of base^(2^(p-1)) (this being type-4), where base==3^x above.


About the Kset file, one that is good for B1=1M and exponents around 90M is included as the default "kset.txt" in the sources. Other K-sets may be generated using the tool kselect (in sources).

I'll stop here, it looks rather complicated already.

SELROC 2018-09-21 08:02

[QUOTE=SELROC;496443]Hello Mihai, I am testing version 4.1 from yesterday, yes it is still slower by around 0.30 ms/it. The performance is better in version 3.6, this should help you track down the modifications that made the slowdown.[/QUOTE]


Even more, 0.50-0.80 ms/it slower depending on the exponent.




Side note: In version 3.5 a 332M exponent is really slow to load the checkpoint savefile.

preda 2018-09-21 11:20

[QUOTE=SELROC;496494]Even more, 0.50-0.80 ms/it slower depending on the exponent.[/QUOTE]
Are you using ROCm 1.9? I suppose you're doing normal PRP? (not the experimental new stuff)

[QUOTE]
Side note: In version 3.5 a 332M exponent is really slow to load the checkpoint savefile.[/QUOTE]
That may be normal. It's because the reconstruction of "data" from "check-bits" is done during loading. This does blockSize (used to be default 400) MULs, each MUL is about 2x normal iteration, so the expected time would be about:
800 x time-per-it. Is what you see slower than that?

SELROC 2018-09-21 12:10

[QUOTE=preda;496508]Are you using ROCm 1.9? I suppose you're doing normal PRP? (not the experimental new stuff)[/QUOTE]

This is still amdgpu, PRP3




[QUOTE]That may be normal. It's because the reconstruction of "data" from "check-bits" is done during loading. This does blockSize (used to be default 400) MULs, each MUL is about 2x normal iteration, so the expected time would be about:

800 x time-per-it. Is what you see slower than that?[/QUOTE]


No, but I said it just to be sure I'm not hitting some bug.

preda 2018-09-22 02:35

[QUOTE=SELROC;496514]This is still amdgpu, PRP3[/QUOTE]
OK. Let's wait a bit before investigating this perf regression more -- maybe you'll move to ROCm at some point, and maybe it does not manifest on ROCm anymore, and then there's nothing to do on GpuOwl side.

ROCm 1.9 does seem to be easier to install on recent kernels, such as 4.18.


[QUOTE]No, but I said it just to be sure I'm not hitting some bug.[/QUOTE]
Yes, I'll look into improving loading time (I have an idea, need to implement it)

preda 2018-09-22 02:47

After more thinking about how to integrate PRP-1 (AKA "PRPF"), this is what I think of ahead:

- offer two main task types (in worktodo.txt), PRP and TF. (i.e. do not offer P-1 as a standalone worktype yet).

- add a new command line argument, "-kset", which specifies a "Kset" file. This file contains a list of Ks, and the B1. One such file "kset.txt" is commited in sources.

- if no -kset is passed, implicitly B1=0 and only classic PRP(3) is done. (but the residue reported is type-4, for unification with the new PRP-1)

- if -kset is passed, with B1 > 0, then the PRP task is "upgraded" to PRP-1. First a P-1(B1) is done, followed by PRP-1.

The result from PRP-1 can be either:
- a factor found
- or, no factor found, but PRP completed and shows "composite".
- or, exceptionally, "prime".

The result from PRP-1, being new, is still pending discussion about how to submit to GIMPS and can't be reported yet.

SELROC 2018-09-22 04:55

[QUOTE=preda;496544]OK. Let's wait a bit before investigating this perf regression more -- maybe you'll move to ROCm at some point, and maybe it does not manifest on ROCm anymore, and then there's nothing to do on GpuOwl side.

ROCm 1.9 does seem to be easier to install on recent kernels, such as 4.18.



Yes, I'll look into improving loading time (I have an idea, need to implement it)[/QUOTE]


I think the switch to ROCm for me will not happen soon, the message from ROCm forum was that they do not support gen1 type hardware, so I am unable to use it at the moment.

SELROC 2018-09-22 07:03

As a "workaround" I use version 3.5 for production and version 4.1 for tests.

xx005fs 2018-09-23 16:44

Is there any difference when compiling the CUDA branch of gpuowl than the regular opencl branch and what should I use to compile it.

preda 2018-09-25 13:47

[QUOTE=xx005fs;496634]Is there any difference when compiling the CUDA branch of gpuowl than the regular opencl branch and what should I use to compile it.[/QUOTE]

You should try doing a "make" in that branch. Probably specifying the target like:
"make cudaowl"

Have a look in the Makefile.
In general you need CUDA toolkit, nvcc.
also a general c++ compiler (g++)

I can't help much on the CUDA side.

preda 2018-09-25 14:13

The PRP-1 (head version) of GpuOwl is gradually becoming more stable.

If you want to try PRP-1, you should specify on the command line:
- a B1 bound. (for exponents around 90M I use B1=1000000. But feel free to try what you like. see [url]https://www.mersenne.ca/prob.php[/url] ; for 332M exponents I'd use 2M or 3M).

The exponent will start with a P-1 first stage to B1. Don't interrupt during this as it starts from the beginning (no save), OTOH this should take on the order of 1h - 2h.

At the end of the P-1 first-stage, one GCD is done and a factor may be found (with small probability, around 1.5%).

After that, the PRP-1 proper starts. It needs a "Kset file" which is just a list of iteration numbers to test. A couple are included with the sources (kset.txt, kset-1M-800.txt). The kset-1M-800.txt is good for any B1 >= 1M. These "kset" files contain 800000 lines, i.e. when using them the PRP-1 will do 800000 additional MULs vs. a classic PRP. These files were generated by me with the "kselect" tool, for use with exponents around 90M.

You specify the "kset file" on the command line with "-kset".

Starting from around iteration 3M you should see GCD tests appearing. If one of them finds a factor, you're lucky and the test ends. But don't put you hopes too high -- for a 90M exponent factored to 76bits, the chances of a factor in this "second stage" of the test are around 2 - 2.5%, so not exactly huge. (combining with the P-1 first stage, I would expect "up to 4%" probability of factor).

It makes sense to apply PRP-1 to exponents that didn't have any P-1 test done yet. Otherwise there is duplication of the first stage (coupled with about 0 probability of factors up the the bounds already tested).

The bound B1=0 is used to indicate "classic PRP" (no PRP-1), and is on by default.

All this must be confusing. Also being new many new bugs may be present. Let me know what doesn't work or is not clear.

(and, the first to person ever to find a PRP-1 factor gets.. a mention in this thread! woohoo!)


All times are UTC. The time now is 23:07.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.