mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
Thread Tools
Old 2019-12-02, 13:41   #1497
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009

13·151 Posts
Default

Quote:
Originally Posted by preda View Post
I re-enabled it for now, as I think I don't have a very strong reason to disable it yet.

I think that a block-size of 400 is a rather nice and overall good value (note, this is a bit smaller than the old default of 500). Why do you need a custom block-size, and to what value do you usually set it to?

As I have 2 GPUs (an XFX and an Asrock) that sometimes generate errors (about 1-2 per day), I come to appreciate a smaller block size, and I added a bit of logic to adaptivelly vary the default check-step depending on the number of errors up to now, by starting with a check-step of 200'000, and roughly halving it for each additional error up to 20'000.

And there is one more reason for the smallish block-size: relative to the PRP-proof (future), the plan right now is to have the proof cover (for exponent E) a region from beginning up to an iteration that is a multiple of 1024 * block-size (such that any halving step in this region hits a block-size boundary and can be checked). This leaves a "tail" of up to 1024 * blockSize iterations at the end that are not covered by the proof, and that will need to be re-run by the checker, thus it's good for the tail to not be too large.
After reading all the above, I don't think I want to change what I have, for now. It runs very well.

I have only used it for P-1 tests. I just have to make sure the "F" in "PFactor" is a capital. I think PrimeNet issues these in lower case. It took me quite a while to figure out how to customize the bounds. Once done, no problems...
storm5510 is offline   Reply With Quote
Old 2019-12-02, 15:31   #1498
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default

Quote:
Originally Posted by preda View Post
I think that a block-size of 400 is a rather nice and overall good value (note, this is a bit smaller than the old default of 500). Why do you need a custom block-size, and to what value do you usually set it to?
I use a block size of 1000. My Radeons have been pretty solid. Most go for a month or more without errors. I increase voltage or reduce mem speed if a Radeon gives me more than a couple of errors in a week.

I chose 1000 as Mr. Gerbicz original threads used that value calculating a 0.2% overhead. A block size of 400 has a 0.5% overhead.

I understand frequent errors make a smaller block size desirable. Prime95 automatically reduces the block size when an error occurs. I'm not suggesting this feature -- a bit of overkill.

The P-1 error I was getting: GPU->host read failed (check 61e4 vs 3f07)
Prime95 is offline   Reply With Quote
Old 2019-12-02, 17:23   #1499
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2·743 Posts
Default

Quote:
Originally Posted by Prime95 View Post
A block size of 400 has a 0.5% overhead.
In these calculations include also the cost of the (possible) rollbacks, when you are redoing the iterations. Ofcourse the task is to minimize this (expected!) cost of error check+rollbacks.
R. Gerbicz is offline   Reply With Quote
Old 2019-12-02, 20:14   #1500
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

25338 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I use a block size of 1000. My Radeons have been pretty solid. Most go for a month or more without errors. I increase voltage or reduce mem speed if a Radeon gives me more than a couple of errors in a week.

I chose 1000 as Mr. Gerbicz original threads used that value calculating a 0.2% overhead. A block size of 400 has a 0.5% overhead.

I understand frequent errors make a smaller block size desirable. Prime95 automatically reduces the block size when an error occurs. I'm not suggesting this feature -- a bit of overkill.

The P-1 error I was getting: GPU->host read failed (check 61e4 vs 3f07)
The difference between 0.2% and 0.5% is minor though. How does Prime95 reduce/change the block size? -- are you sure you're not reducing the "check size" (how often the check is done) while keeping the block size the same?

The P-1 error -- strange, I don't understand why you were getting it, it seems that the memory transfer (reading from the GPU) or the synchronization around it (i.e. waiting for it to finish) was failing.
preda is offline   Reply With Quote
Old 2019-12-02, 21:43   #1501
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19·397 Posts
Default

Quote:
Originally Posted by preda View Post
The difference between 0.2% and 0.5% is minor though. How does Prime95 reduce/change the block size? -- are you sure you're not reducing the "check size" (how often the check is done) while keeping the block size the same?
Once you pass a Gerbicz error check (or fail and rollback to a save file that passed a check) you are essentially in a virgin state where you can select any block size you want going forward.
Prime95 is offline   Reply With Quote
Old 2019-12-02, 21:51   #1502
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2×743 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Once you pass a Gerbicz error check (or fail and rollback to a save file that passed a check) you are essentially in a virgin state where you can select any block size you want going forward.
Yeah, with f(n)=a^(2^n) mod N
it is trivial that f(s+t)=f(t)^(2^s), so you can start a new blocklength=new L at
an error checked residue at iteration=t using a new "base"=f(t).
(why? because you are trusted, that at iteration=t with high probability you have a good residue).

The only difference with this is that at error check you need to multiple with f(t) and not with the
smallish a=3. So in the leading wavefront exponents you'd need ~100-250 more mulmods (almost nothing in computation time).
R. Gerbicz is offline   Reply With Quote
Old 2019-12-02, 22:38   #1503
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19×613 Posts
Default

Quote:
Originally Posted by R. Gerbicz View Post
Yeah, with f(n)=a^(2^n) mod N
it is trivial that f(s+t)=f(t)^(2^s), so you can start a new blocklength=new L at
an error checked residue at iteration=t using a new "base"=f(t).
(why? because you are trusted, that at iteration=t with high probability you have a good residue).

The only difference with this is that at error check you need to multiple with f(t) and not with the smallish a=3. So in the leading wavefront exponents you'd need ~100-250 more mulmods (almost nothing in computation time).
Why so many mulmod-equivalents? Just forward-FFT the pure-integer f(t) read from the savefile and do a 2-input FFT-modmul as usual. Or were you referring to a pure-integer modmul? (If so, why?)
ewmayer is offline   Reply With Quote
Old 2019-12-03, 01:00   #1504
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10101001111012 Posts
Default

Quote:
Originally Posted by ATH View Post
How do you specify PRP type in gpuOwL ?

Just finished my first gpuowl test using Google Colab, but it was a PRP DC and forgot to think of the PRP type, so it finished the wrong type:
https://mersenne.org/M87000929

I found a type 1 result to DC for the next one, so that should be ok, but how do I choose the type? It is fixed in the different versions which type it uses?

Could I continue from the last savefile of 87000929 and finish it as a type 4 if the difference
between types is only at the end?
According to undoc.txt from Prime95:
type 1: a^(n-1)
type 4: a^((n+1)/2)
There's a whole reference thread on gpuowl in https://mersenneforum.org/showthread.php?t=24607
https://www.mersenneforum.org/showpo...83&postcount=7 and https://www.mersenneforum.org/showpo...3&postcount=15 are about gpuowl versions and residue types.

Last fiddled with by kriesel on 2019-12-03 at 01:00
kriesel is offline   Reply With Quote
Old 2019-12-03, 05:18   #1505
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

87816 Posts
Default

I've been having a weird issue with gpuowl, I have a system(RX570) which I run headless most of the time... When an assignment finishes, gpuowl will write the result then do nothing... until I login with RDP, then gpuowl immediately starts the next assignment. Tried running mfakto... zero issues.
Code:
2019-12-02 02:25:51 core 92912081 P2 2880/2880: setup 1128 ms; 5931 us/prime, 9223 primes
2019-12-02 02:25:51 core waiting for background GCDs..
2019-12-02 02:25:51 core 92912087 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.72 bits/word
2019-12-02 02:25:51 core OpenCL args "-DEXP=92912087u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x9.b3f5913600238p-3 -DIWEIGHT_STEP=0xd.311c9cb7274a8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-02 02:25:53 core OpenCL compilation in 2060 ms
2019-12-02 02:26:39 core 92912081 P2 GCD: no factor
2019-12-02 02:26:39 core {"exponent":"92912081", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-11-gfaaa2f2"}, "timestamp":"2019-12-02 10:26:39 UTC", "user":"kracker", "computer":"core", "aid":"----", "fft-length":5242880, "B1":720000, "B2":13680000}
2019-12-02 06:07:51 core 92912087 P1 B1=720000, B2=13680000; 1038539 bits; starting at 0
2019-12-02 06:08:38 core 92912087 P1    10000   0.96%; 4698 us/sq; ETA 0d 01:21; b195a86475b0f7e5
kracker is offline   Reply With Quote
Old 2019-12-03, 09:51   #1506
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

101110011102 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Why so many mulmod-equivalents? Just forward-FFT the pure-integer f(t) read from the savefile and do a 2-input FFT-modmul as usual.
I see it, you're right.
R. Gerbicz is offline   Reply With Quote
Old 2019-12-05, 22:58   #1507
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

153D16 Posts
Default Multiple instances not always better

It's been reported that on Radeon VII, running two instances improves total throughput, and throughput per watt-hour.
I found a case where very different instances result in ~95% of single instance throughput.
This case combines very different gpuowl versions, computation (LL vs. PRP3), exponent and so fft length.
Windows 10, Lenovo Thinkstation D30, XFX Radeon VII; stock settings.

gpuowl v0.6 alone 1.005ms/iter (50330737 LL DC, 4M fft) = 995. iter/sec alone

v6.11 alone 1.193 ms/iter (89260099 PRP, 5M fft) = 838 iter/sec alone

Two disparate instances run together:
gpuowl v0.6 2.161 ms/iter; 463 iter/sec; throughput 463/995 = 0.4651 of solo;
v6.11 2.458 ms/iter; 407 iter/sec; throughput 407/838 = 0.4855 of solo;
combined, 0.4651 + 0.4855 = 0.9506 < 1. Slower running together, noticeably.

Last fiddled with by kriesel on 2019-12-05 at 22:59
kriesel is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 07:16.


Fri Aug 6 07:16:05 UTC 2021 up 14 days, 1:45, 1 user, load averages: 3.36, 2.97, 2.78

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.