mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2020-03-27, 21:32   #1992
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

3×2,281 Posts
Default

Quote:
Originally Posted by ewmayer View Post
George, any update on the exponent ranges in question?
Sort of.

Short answer is "maximum exponent for the FFT length" - "log2(3) = 1.585 fewer bits per word" compensates for the mul-by-3 in P-1. Long answer is you can go a little higher than that because the maximum exponent has some carry32 "head room" during PRP.

Here is the long answer:

What you are worried about is the absolute value of the 32-bit carry exceeding 0x80000000. I studied 500K iterations of 24518003 in a 1.25M FFT (18.706 bits-per-word). The maximum carry32 was 0x32420000. Fine for PRP, not so for the mul-by-3 step in P-1.

Next (well actually first) I tried to calculate a reasonable max exponent for 1.25M, 2.5M, 5M, 10M, 20M, 40M, 80M exponents. We can store roughly 0.261 fewer bits per FFT word for each doubling of the FFT length.

The formula for expected max carry32 during the mul-by-3 P-1 step should be:

3 * 0x32420000 * 2^(BPW - 18.706) * 2 ^ (log2(FFTLEN/1.25M) * .261)

If this max exceeds 0x70000000 I'd be worried. I'm thinking less than 0x67000000 should be very safe. It's all a matter of how much protection you want from an outlier value (much the same as protecting against outlier round off errors).

Let's see if an example works. Going for a fairly safe max carry32 of 0x70000000 in a 5M FFT:

0x70000000 = 3 * 0x32420000 * 2^BPW * 2^-18.706 * 2^(2 * .261)
BPW = log2 (0x70000000 / 3 / 0x32420000) + 18.706 - .522
BPW = 17.755
max exp for 5M FFT = 93.1M

similarly for a 5.5M FFT, max exp = 102.2M
Prime95 is offline   Reply With Quote
Old 2020-03-27, 23:33   #1993
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·5,591 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Sort of.

Short answer is "maximum exponent for the FFT length" - "log2(3) = 1.585 fewer bits per word" compensates for the mul-by-3 in P-1. Long answer is you can go a little higher than that because the maximum exponent has some carry32 "head room" during PRP.

Here is the long answer:

What you are worried about is the absolute value of the 32-bit carry exceeding 0x80000000. I studied 500K iterations of 24518003 in a 1.25M FFT (18.706 bits-per-word). The maximum carry32 was 0x32420000. Fine for PRP, not so for the mul-by-3 step in P-1.

Next (well actually first) I tried to calculate a reasonable max exponent for 1.25M, 2.5M, 5M, 10M, 20M, 40M, 80M exponents. We can store roughly 0.261 fewer bits per FFT word for each doubling of the FFT length.

The formula for expected max carry32 during the mul-by-3 P-1 step should be:

3 * 0x32420000 * 2^(BPW - 18.706) * 2 ^ (log2(FFTLEN/1.25M) * .261)

If this max exceeds 0x70000000 I'd be worried. I'm thinking less than 0x67000000 should be very safe. It's all a matter of how much protection you want from an outlier value (much the same as protecting against outlier round off errors).

Let's see if an example works. Going for a fairly safe max carry32 of 0x70000000 in a 5M FFT:

0x70000000 = 3 * 0x32420000 * 2^BPW * 2^-18.706 * 2^(2 * .261)
BPW = log2 (0x70000000 / 3 / 0x32420000) + 18.706 - .522
BPW = 17.755
max exp for 5M FFT = 93.1M
Thanks for the explainer. Looking at my own scalar-double carry macro - here all vars are doubles, x is the convolution output we are normalizing, wi_re is the inverse DWT weight (the 1/n is absorbed into that), prp_mult is your 3, cy is carryin from next-lower iFFT term (and re-used for carryout):
Code:
x *= wi_re;\
temp = DNINT(x);\
frac = fabs(x-temp);\
temp = temp*prp_mult + cy;\
cy   = DNINT(temp*baseinv[i]);\
x = (temp-cy*base[i])*wt_re;\
I'm guessing using all-doubles is not a good option for your target hardware.
Question: In the 4th of those 6 lines, both temp and cy can be of either sign, but temp, the rounded-but-not-yet-wordsize-normalized iFFT term, is nearly always going to much larger in magnitude than the +cy carryin, i.e. we should be able to infer the expected sign of the next line's carryout computation from it, to see whether - in your case - the integer result overlowing into the sign bit, yes?.
Quote:
similarly for a 5.5M FFT, max exp = 102.2M
Ruh-roh - I've been doing p-1 work at 5.5M for exp ~= 103M, using the checkin of last week. Should I halt my runs and restart with CARRY64 enabled?

Last fiddled with by ewmayer on 2020-03-27 at 23:36
ewmayer is offline   Reply With Quote
Old 2020-03-27, 23:39   #1994
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

152738 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Ruh-roh - I've been doing p-1 work at 5.5M for exp ~= 103M, using the checkin of last week. Should I halt my runs and restart with CARRY64 enabled?
Yes, but I would not redo those P-1.
Prime95 is offline   Reply With Quote
Old 2020-03-28, 00:16   #1995
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

256568 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Yes, but I would not redo those P-1.
I don't see CARRY64 in the readme - is that an undocumented cmd-line flag?

One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K?

Oh, small UI suggestion: -fft 6144 for the above gave "FFT too small" error, i.e. the UI needs raw FFT length, in this case 6291456. It was a little annoying to have the resulting run immediately echo to effect of "starting run with FFT length 6144K". Could the -fft option be fiddled to use FFT length in K?
ewmayer is offline   Reply With Quote
Old 2020-03-28, 01:45   #1996
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11010101110112 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I don't see CARRY64 in the readme - is that an undocumented cmd-line flag?
-use CARRY64
Prime95 is offline   Reply With Quote
Old 2020-03-28, 01:48   #1997
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

3×2,281 Posts
Default

Quote:
Originally Posted by ewmayer View Post
One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K?
I don't know.

Quote:
Oh, small UI suggestion: -fft 6144 for the above gave "FFT too small" error, i.e. the UI needs raw FFT length, in this case 6291456. It was a little annoying to have the resulting run immediately echo to effect of "starting run with FFT length 6144K". Could the -fft option be fiddled to use FFT length in K?
-fft 6M works. As well as -fft 6144K
Prime95 is offline   Reply With Quote
Old 2020-03-28, 11:48   #1998
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

100000101002 Posts
Default

Quote:
Originally Posted by ewmayer View Post
One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K?
I suppose you passed a -fft command line argument. Then it will affect all the tasks, thus will affect the PRP as well. (i.e. will not switch back to the default FFT)

Last fiddled with by preda on 2020-03-28 at 11:49
preda is offline   Reply With Quote
Old 2020-03-28, 16:49   #1999
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

74118 Posts
Default 3 strikes you're out, game over until tomorrow

gpuowl could handle error cases more gracefully. Luckily I stumbled across this one while handling something else. Otherwise it could have cost nearly a day's throughput on that gpu.

Please consider commenting out a problematic worktodo line and continuing on with the next in such a case, instead of killing the run.

Also, since config.txt optimization content is fft length dependent, what's optimal for one fft length can be fatal for another.

Please consider fft-length-specific enhancement to config.txt, as mentioned before.
Code:
2020-03-28 10:23:18 condorella/rx480 CC 94418041 / 94418041, 4d816a6edf6393__
2020-03-28 10:23:20 condorella/rx480 {"exponent":"94418041", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "v
ersion":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-03-28 15:23:20 UTC", "user":"kriesel", "computer":"condorella/rx480", "aid":
"(redacted)", "fft-length":5242880, "res64":"4d816a6edf6393__", "residue-type":1, "errors":{"gerbicz":0
}}2020-03-28 10:23:21 condorella/rx480 131500093 FFT 7168K: Width 256x4, Height 64x8, Middle 7; 17.92 bits/word
2020-03-28 10:23:22 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=7u -DWEIGHT_STE
P=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f05
18db8a8p-4 -DAMDGPU=1 -DNO_ASM=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-03-28 10:23:25 condorella/rx480 OpenCL compilation in 3.68 s
2020-03-28 10:23:28 condorella/rx480 131500093 OK        0 loaded: blockSize 400, 0000000000000003
2020-03-28 10:23:35 condorella/rx480 131500093 EE      800   0.00%; 5251 us/it; ETA 7d 23:49; 6781adfa7991c92a (check 2.29s)
2020-03-28 10:23:37 condorella/rx480 131500093 OK        0 loaded: blockSize 400, 0000000000000003
2020-03-28 10:23:44 condorella/rx480 131500093 EE      800   0.00%; 5251 us/it; ETA 7d 23:48; 6781adfa7991c92a (check 2.29s)
1 errors
2020-03-28 10:23:46 condorella/rx480 131500093 OK        0 loaded: blockSize 400, 0000000000000003
2020-03-28 10:23:53 condorella/rx480 131500093 EE      800   0.00%; 5255 us/it; ETA 7d 23:58; 6781adfa7991c92a (check 2.30s)
2 errors
2020-03-28 10:23:53 condorella/rx480 3 sequential errors, will stop.
2020-03-28 10:23:53 condorella/rx480 Exiting because "too many errors"
2020-03-28 10:23:53 condorella/rx480 Bye
C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611

C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>title gpuowl-v6.11-134-g1e0ce1d/rx480

C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win
2020-03-28 11:27:31 gpuowl v6.11-134-g1e0ce1d

Last fiddled with by kriesel on 2020-03-28 at 16:49
kriesel is offline   Reply With Quote
Old 2020-03-28, 20:49   #2000
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101011101011102 Posts
Default

@George - thanks, I missed the K and M suffix options in my perusal of the readme.

Quote:
Originally Posted by preda View Post
I suppose you passed a -fft command line argument. Then it will affect all the tasks, thus will affect the PRP as well. (i.e. will not switch back to the default FFT)
I checked at end of the forced-6144K p-1 run; it indeed started the ensuing PRP at the same FFT length, so killed and restarted sans -fft flag.

All runs now restarted using -use CARRY64 -- thanks, George. Also, you'll be pleased to hear tha after the latest BSOD-style crash of the Haswell system which hosts my Radeon VII I finally got round to trying the disable-C-states trick you recommend in the BIOS Overclock submenu - seem to work like charm, system has been rock-stable since, uptime 4 days and counting, which is really long for this system.

More details on what happens for me with -use CARRY64, in the context of 2 side-by-side PRP runs @5632K:

o Initially, each run going at a steady 1386 us/iter at my sclk=4 setting;
o Stop run 0 & restart with -use CARRY64, after a few more 200Kiter intervals, both jobs are up to 1402 us/iter, which seems weird since only one is using the slower-but-safe carry option;
o Stop run 1 & restart with -use CARRY64, after a few more 200Kiter intervals, both jobs are up to 1420 us/iter, a 2.5% hit to throughput.

Since the bug only affects p-1 runs, would it be difficult to tweak things so that -use CARRY64 invoked-by-user is only operative in p-1 testing? Or maybe allow separate specification by job type, e.g -use CARRY64 means both worktypes, -pm1 CARRY64 means apply to p-1, -prp CARRY64 means apply to prp runs?
ewmayer is offline   Reply With Quote
Old 2020-03-29, 09:04   #2001
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

41416 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Since the bug only affects p-1 runs, would it be difficult to tweak things so that -use CARRY64 invoked-by-user is only operative in p-1 testing? Or maybe allow separate specification by job type, e.g -use CARRY64 means both worktypes, -pm1 CARRY64 means apply to p-1, -prp CARRY64 means apply to prp runs?
I just commited a change that makes CARRY64 the default for P-1, and CARRY32 the default for PRP on AMD. The rationale being that, if CARRY32 is not appropriate, this fact will be visible for PRP, thus safe; on P-1 we use the safe default (i.e. CARRY64) until we have a better solution there.
preda is offline   Reply With Quote
Old 2020-03-29, 10:10   #2002
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22×32×29 Posts
Default

Quote:
Originally Posted by kriesel View Post
Gpuowl stage 1 needs a res64 error check.
But Ken, what is the appropriate action to take on error?

Let's say that during P-1 stage1, a residue==0 is detected. This is not the result of innapropriate FFT-size, it indicates a hardware error. But what to do in this situation, given there's no way to check P-1 -- basically what makes sense is for gpuowl to simply stop doing any P-1 (on that GPU). It can't reliably roll back to any trusted point. Assuming that a residue that is != 0 is a correct one, in the situation where the GPU produces res==0 sometimes, is not a good way to go. I would just discart the whole test as corrupted.
preda is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1616 2020-05-31 16:46
GPUOWL AMD Windows OpenCL issues xx005fs GPU Computing 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 10:53.

Sat Jun 6 10:53:23 UTC 2020 up 73 days, 8:26, 0 users, load averages: 1.68, 1.78, 1.75

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.