![]() |
[QUOTE=preda;531848]I re-enabled it for now, as I think I don't have a very strong reason to disable it yet.
I think that a block-size of 400 is a rather nice and overall good value (note, this is a bit smaller than the old default of 500). Why do you need a custom block-size, and to what value do you usually set it to? As I have 2 GPUs (an XFX and an Asrock) that sometimes generate errors (about 1-2 per day), I come to appreciate a smaller block size, and I added a bit of logic to adaptivelly vary the default check-step depending on the number of errors up to now, by starting with a check-step of 200'000, and roughly halving it for each additional error up to 20'000. And there is one more reason for the smallish block-size: relative to the PRP-proof (future), the plan right now is to have the proof cover (for exponent E) a region from beginning up to an iteration that is a multiple of 1024 * block-size (such that any halving step in this region hits a block-size boundary and can be checked). This leaves a "tail" of up to 1024 * blockSize iterations at the end that are not covered by the proof, and that will need to be re-run by the checker, thus it's good for the tail to not be too large.[/QUOTE] After reading all the above, I don't think I want to change what I have, for now. It runs very well. I have only used it for P-1 tests. I just have to make sure the "F" in "PFactor" is a capital. I think [I]PrimeNet[/I] issues these in lower case. It took me quite a while to figure out how to customize the bounds. Once done, no problems... :smile: |
[QUOTE=preda;531848]
I think that a block-size of 400 is a rather nice and overall good value (note, this is a bit smaller than the old default of 500). Why do you need a custom block-size, and to what value do you usually set it to?[/QUOTE] I use a block size of 1000. My Radeons have been pretty solid. Most go for a month or more without errors. I increase voltage or reduce mem speed if a Radeon gives me more than a couple of errors in a week. I chose 1000 as Mr. Gerbicz original threads used that value calculating a 0.2% overhead. A block size of 400 has a 0.5% overhead. I understand frequent errors make a smaller block size desirable. Prime95 automatically reduces the block size when an error occurs. I'm not suggesting this feature -- a bit of overkill. The P-1 error I was getting: GPU->host read failed (check 61e4 vs 3f07) |
[QUOTE=Prime95;531855] A block size of 400 has a 0.5% overhead.
[/QUOTE] In these calculations include also the cost of the (possible) rollbacks, when you are redoing the iterations. Ofcourse the task is to minimize this (expected!) cost of error check+rollbacks. |
[QUOTE=Prime95;531855]I use a block size of 1000. My Radeons have been pretty solid. Most go for a month or more without errors. I increase voltage or reduce mem speed if a Radeon gives me more than a couple of errors in a week.
I chose 1000 as Mr. Gerbicz original threads used that value calculating a 0.2% overhead. A block size of 400 has a 0.5% overhead. I understand frequent errors make a smaller block size desirable. Prime95 automatically reduces the block size when an error occurs. I'm not suggesting this feature -- a bit of overkill. The P-1 error I was getting: GPU->host read failed (check 61e4 vs 3f07)[/QUOTE] The difference between 0.2% and 0.5% is minor though. How does Prime95 reduce/change the block size? -- are you sure you're not reducing the "check size" (how often the check is done) while keeping the block size the same? The P-1 error -- strange, I don't understand why you were getting it, it seems that the memory transfer (reading from the GPU) or the synchronization around it (i.e. waiting for it to finish) was failing. |
[QUOTE=preda;531872]The difference between 0.2% and 0.5% is minor though. How does Prime95 reduce/change the block size? -- are you sure you're not reducing the "check size" (how often the check is done) while keeping the block size the same?[/QUOTE]
Once you pass a Gerbicz error check (or fail and rollback to a save file that passed a check) you are essentially in a virgin state where you can select any block size you want going forward. |
[QUOTE=Prime95;531878]Once you pass a Gerbicz error check (or fail and rollback to a save file that passed a check) you are essentially in a virgin state where you can select any block size you want going forward.[/QUOTE]
Yeah, with f(n)=a^(2^n) mod N it is trivial that f(s+t)=f(t)^(2^s), so you can start a new blocklength=new L at an error checked residue at iteration=t using a new "base"=f(t). (why? because you are trusted, that at iteration=t with high probability you have a good residue). The only difference with this is that at error check you need to multiple with f(t) and not with the smallish a=3. So in the leading wavefront exponents you'd need ~100-250 more mulmods (almost nothing in computation time). |
[QUOTE=R. Gerbicz;531879]Yeah, with f(n)=a^(2^n) mod N
it is trivial that f(s+t)=f(t)^(2^s), so you can start a new blocklength=new L at an error checked residue at iteration=t using a new "base"=f(t). (why? because you are trusted, that at iteration=t with high probability you have a good residue). The only difference with this is that at error check you need to multiple with f(t) and not with the smallish a=3. So in the leading wavefront exponents you'd need ~100-250 more mulmods (almost nothing in computation time).[/QUOTE] Why so many mulmod-equivalents? Just forward-FFT the pure-integer f(t) read from the savefile and do a 2-input FFT-modmul as usual. Or were you referring to a pure-integer modmul? (If so, why?) |
[QUOTE=ATH;531727]How do you specify PRP type in gpuOwL ?
Just finished my first gpuowl test using Google Colab, but it was a PRP DC and forgot to think of the PRP type, so it finished the wrong type: [URL]https://mersenne.org/M87000929[/URL] I found a type 1 result to DC for the next one, so that should be ok, but how do I choose the type? It is fixed in the different versions which type it uses? Could I continue from the last savefile of 87000929 and finish it as a type 4 if the difference between types is only at the end? According to undoc.txt from Prime95: type 1: a^(n-1) type 4: a^((n+1)/2)[/QUOTE] There's a whole reference thread on gpuowl in [URL]https://mersenneforum.org/showthread.php?t=24607[/URL] [URL]https://www.mersenneforum.org/showpost.php?p=489083&postcount=7[/URL] and [URL]https://www.mersenneforum.org/showpost.php?p=519603&postcount=15[/URL] are about gpuowl versions and residue types. |
I've been having a weird issue with gpuowl, I have a system(RX570) which I run headless most of the time... When an assignment finishes, gpuowl will write the result then do nothing... until I login with RDP, then gpuowl immediately starts the next assignment. Tried running mfakto... zero issues.
[code] 2019-12-02 02:25:51 core 92912081 P2 2880/2880: setup 1128 ms; 5931 us/prime, 9223 primes 2019-12-02 02:25:51 core waiting for background GCDs.. 2019-12-02 02:25:51 core 92912087 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.72 bits/word 2019-12-02 02:25:51 core OpenCL args "-DEXP=92912087u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x9.b3f5913600238p-3 -DIWEIGHT_STEP=0xd.311c9cb7274a8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-12-02 02:25:53 core OpenCL compilation in 2060 ms 2019-12-02 02:26:39 core 92912081 P2 GCD: no factor 2019-12-02 02:26:39 core {"exponent":"92912081", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-11-gfaaa2f2"}, "timestamp":"2019-12-02 10:26:39 UTC", "user":"kracker", "computer":"core", "aid":"----", "fft-length":5242880, "B1":720000, "B2":13680000} 2019-12-02 06:07:51 core 92912087 P1 B1=720000, B2=13680000; 1038539 bits; starting at 0 2019-12-02 06:08:38 core 92912087 P1 10000 0.96%; 4698 us/sq; ETA 0d 01:21; b195a86475b0f7e5 [/code] |
[QUOTE=ewmayer;531880]Why so many mulmod-equivalents? Just forward-FFT the pure-integer f(t) read from the savefile and do a 2-input FFT-modmul as usual.[/QUOTE]
I see it, you're right. |
Multiple instances not always better
It's been reported that on Radeon VII, running two instances improves total throughput, and throughput per watt-hour.
I found a case where very different instances result in ~95% of single instance throughput. This case combines very different gpuowl versions, computation (LL vs. PRP3), exponent and so fft length. Windows 10, Lenovo Thinkstation D30, XFX Radeon VII; stock settings. gpuowl v0.6 alone 1.005ms/iter (50330737 LL DC, 4M fft) = 995. iter/sec alone v6.11 alone 1.193 ms/iter (89260099 PRP, 5M fft) = 838 iter/sec alone Two disparate instances run together: gpuowl v0.6 2.161 ms/iter; 463 iter/sec; throughput 463/995 = 0.4651 of solo; v6.11 2.458 ms/iter; 407 iter/sec; throughput 407/838 = 0.4855 of solo; combined, 0.4651 + 0.4855 = 0.9506 < 1. Slower running together, noticeably. |
| All times are UTC. The time now is 23:15. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.