mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

preda 2019-05-05 20:58

[QUOTE=SELROC;515840]Experimenting with -block sizes for a 332M exponent:


1. the GEC time with block 400K is 2.11~ sec.
2. the GEC time with block 1000K is 4.25~ sec.


The GEC time varies with block size.[/QUOTE]

Yes because a check involves doing "block-size" additional iterations. E.g. with block=400, 400 additional iterations are done every 400^2==160K iterations, while with block=1000, 1000 additional iterations are done every 1000^2=1M iterations.

SELROC 2019-05-06 06:15

[QUOTE=preda;515866]Yes because a check involves doing "block-size" additional iterations. E.g. with block=400, 400 additional iterations are done every 400^2==160K iterations, while with block=1000, 1000 additional iterations are done every 1000^2=1M iterations.[/QUOTE]


That is if I use a block of 2000 the GEC should durate 8-9~ seconds. There is a trade-off to apply here.
Do I want long unfrequent checks or short frequent checks ?


When using a block of 1000 and a log of 20000, sometimes gpuowl misses to display the OK... (check ...) output, probably because it is in between.

preda 2019-05-06 10:18

[QUOTE=SELROC;515914]That is if I use a block of 2000 the GEC should durate 8-9~ seconds. There is a trade-off to apply here.
Do I want long unfrequent checks or short frequent checks ?


When using a block of 1000 and a log of 20000, sometimes gpuowl misses to display the OK... (check ...) output, probably because it is in between.[/QUOTE]

The check is done at block-size^2 (squared). So if block=1000, the check is done every 1M. With a log of 20000, you will hit every 1M.

But what happens, for example, with a block of 400 and a log of 100K? the check will be done every 160K, and will be displayed correctly even if it doesn't hit a 'log' multiple of 100K. (at least that's the plan)

SELROC 2019-05-06 10:37

[QUOTE=preda;515923]The check is done at block-size^2 (squared). So if block=1000, the check is done every 1M. With a log of 20000, you will hit every 1M.

But what happens, for example, with a block of 400 and a log of 100K? the check will be done every 160K, and will be displayed correctly even if it doesn't hit a 'log' multiple of 100K. (at least that's the plan)[/QUOTE]


Ok so if you confirm that the check is displayed, I may have missed it.
Experimenting further ...

SELROC 2019-05-06 13:01

[QUOTE=preda;515923]The check is done at block-size^2 (squared). So if block=1000, the check is done every 1M. With a log of 20000, you will hit every 1M.

But what happens, for example, with a block of 400 and a log of 100K? the check will be done every 160K, and will be displayed correctly even if it doesn't hit a 'log' multiple of 100K. (at least that's the plan)[/QUOTE]


PS: The Mersenne number [url]https://www.mersenne.org/report_exponent/?exp_lo=332252533&full=1[/url] is composite.
Computation took 15 days 10 hours ~ on Radeon VII.

Prime95 2019-05-06 14:53

[QUOTE=SELROC;515931]PS: The Mersenne number [url]https://www.mersenne.org/report_exponent/?exp_lo=332252533&full=1[/url] is composite.
Computation took 15 days 10 hours ~ on Radeon VII.[/QUOTE]

Impressive speed.

IMO, far too little P-1 factoring was done. Were these bounds chosen to be optimal for all-Radeon testing? Might this indicate that prime95 should be used for P-1 prior to GPU PRP testing?

SELROC 2019-05-06 14:58

[QUOTE=Prime95;515935]Impressive speed.

IMO, far too little P-1 factoring was done. Were these bounds chosen to be optimal for all-Radeon testing? Might this indicate that prime95 should be used for P-1 prior to GPU PRP testing?[/QUOTE]




I am currently using ROCm 2.3 which has a performance regression. I bet that with ROCm 2.4 (if they fix the issue) the ETA for 332M will be around 13-14 days.

SELROC 2019-05-06 15:34

[QUOTE=Prime95;515935]Impressive speed.

IMO, far too little P-1 factoring was done. Were these bounds chosen to be optimal for all-Radeon testing? Might this indicate that prime95 should be used for P-1 prior to GPU PRP testing?[/QUOTE]


At this time GpuOwl supports P-1, so do I better do p-1 before PRP ?

R. Gerbicz 2019-05-06 15:51

[QUOTE=SELROC;515914]That is if I use a block of 2000 the GEC should durate 8-9~ seconds. There is a trade-off to apply here.
Do I want long unfrequent checks or short frequent checks ?
[/QUOTE]

Trade-off is a good word, beacuse larger blocksize=L means more work at rollbacks, because you need to redo L^2 iterations.
It isn't that easy for a not that faulty card/cpu to choose the best L, but for example
if you have 0.2 rollbacks per p (so basically one for 5 tests) at p~1e8 then your optimal L value is 1000.
Just interestingly the exact formula:

L=(2*p/#rollback)^(1/3),
where #rollback is the average number of rollbacks for p (so this could be even higher than 1, for a faulty card).

ps. and don't choose L>sqrt(p), because then you'd not make any check, but this also depends on your implementation.

Prime95 2019-05-06 18:38

[QUOTE=SELROC;515937]At this time GpuOwl supports P-1, so do I better do p-1 before PRP ?[/QUOTE]

Yes, P-1 is highly recommended.

It's a question of "relative" performance. That is, if on your machine GpuOwl is 5x faster than prime95 at PRP but only 2x faster at P-1, then GpuOwl should be doing PRP 100% of the time. Use prime95 on your CPU to do all your P-1 work prior to having the the GPU do the PRP.

Your P-1 bounds are "suspicious" in that when prime95 is used to double-check your result years down the road, I think prime95 will want to redo the P-1 to higher bounds.

IIUC, Preda has a somewhat different view on optimal P-1 bounds in that he believes Gerbicz error checking in the initial PRP test means double-checking is not necessary.

For reference, prime95 would use these bounds (assuming 1.8GB of memory).
Saving 1 LL/PRP test: B1=1,115,000 B2=14,495,000
Saving 2 LL/PRP tests: B1=2,395,000 B2=35,326,000

SELROC 2019-05-06 19:05

[QUOTE=Prime95;515949]Yes, P-1 is highly recommended.

It's a question of "relative" performance. That is, if on your machine GpuOwl is 5x faster than prime95 at PRP but only 2x faster at P-1, then GpuOwl should be doing PRP 100% of the time. Use prime95 on your CPU to do all your P-1 work prior to having the the GPU do the PRP.

Your P-1 bounds are "suspicious" in that when prime95 is used to double-check your result years down the road, I think prime95 will want to redo the P-1 to higher bounds.

IIUC, Preda has a somewhat different view on optimal P-1 bounds in that he believes Gerbicz error checking in the initial PRP test means double-checking is not necessary.

For reference, prime95 would use these bounds (assuming 1.8GB of memory).
Saving 1 LL/PRP test: B1=1,115,000 B2=14,495,000
Saving 2 LL/PRP tests: B1=2,395,000 B2=35,326,000[/QUOTE]


The ETA for 332M exponent is 2 months on Radeon RX580 and 15 days on Radeon VII.


I am using primenet .py to get assignments and return results, gpuowl runs in parallel, and the worktodo.txt file is checked every 2 hours. The assignment type is 153.



To do P-1, do I have to get a different assignment type ?


All times are UTC. The time now is 23:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.