mersenneforum.org gpuOwL: an OpenCL program for Mersenne primality testing
 Register FAQ Search Today's Posts Mark Forums Read

2020-03-27, 21:32   #1992
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

3×2,281 Posts

Quote:
 Originally Posted by ewmayer George, any update on the exponent ranges in question?
Sort of.

Short answer is "maximum exponent for the FFT length" - "log2(3) = 1.585 fewer bits per word" compensates for the mul-by-3 in P-1. Long answer is you can go a little higher than that because the maximum exponent has some carry32 "head room" during PRP.

What you are worried about is the absolute value of the 32-bit carry exceeding 0x80000000. I studied 500K iterations of 24518003 in a 1.25M FFT (18.706 bits-per-word). The maximum carry32 was 0x32420000. Fine for PRP, not so for the mul-by-3 step in P-1.

Next (well actually first) I tried to calculate a reasonable max exponent for 1.25M, 2.5M, 5M, 10M, 20M, 40M, 80M exponents. We can store roughly 0.261 fewer bits per FFT word for each doubling of the FFT length.

The formula for expected max carry32 during the mul-by-3 P-1 step should be:

3 * 0x32420000 * 2^(BPW - 18.706) * 2 ^ (log2(FFTLEN/1.25M) * .261)

If this max exceeds 0x70000000 I'd be worried. I'm thinking less than 0x67000000 should be very safe. It's all a matter of how much protection you want from an outlier value (much the same as protecting against outlier round off errors).

Let's see if an example works. Going for a fairly safe max carry32 of 0x70000000 in a 5M FFT:

0x70000000 = 3 * 0x32420000 * 2^BPW * 2^-18.706 * 2^(2 * .261)
BPW = log2 (0x70000000 / 3 / 0x32420000) + 18.706 - .522
BPW = 17.755
max exp for 5M FFT = 93.1M

similarly for a 5.5M FFT, max exp = 102.2M

2020-03-27, 23:33   #1993
ewmayer
2ω=0

Sep 2002
República de California

2·5,591 Posts

Quote:
 Originally Posted by Prime95 Sort of. Short answer is "maximum exponent for the FFT length" - "log2(3) = 1.585 fewer bits per word" compensates for the mul-by-3 in P-1. Long answer is you can go a little higher than that because the maximum exponent has some carry32 "head room" during PRP. Here is the long answer: What you are worried about is the absolute value of the 32-bit carry exceeding 0x80000000. I studied 500K iterations of 24518003 in a 1.25M FFT (18.706 bits-per-word). The maximum carry32 was 0x32420000. Fine for PRP, not so for the mul-by-3 step in P-1. Next (well actually first) I tried to calculate a reasonable max exponent for 1.25M, 2.5M, 5M, 10M, 20M, 40M, 80M exponents. We can store roughly 0.261 fewer bits per FFT word for each doubling of the FFT length. The formula for expected max carry32 during the mul-by-3 P-1 step should be: 3 * 0x32420000 * 2^(BPW - 18.706) * 2 ^ (log2(FFTLEN/1.25M) * .261) If this max exceeds 0x70000000 I'd be worried. I'm thinking less than 0x67000000 should be very safe. It's all a matter of how much protection you want from an outlier value (much the same as protecting against outlier round off errors). Let's see if an example works. Going for a fairly safe max carry32 of 0x70000000 in a 5M FFT: 0x70000000 = 3 * 0x32420000 * 2^BPW * 2^-18.706 * 2^(2 * .261) BPW = log2 (0x70000000 / 3 / 0x32420000) + 18.706 - .522 BPW = 17.755 max exp for 5M FFT = 93.1M
Thanks for the explainer. Looking at my own scalar-double carry macro - here all vars are doubles, x is the convolution output we are normalizing, wi_re is the inverse DWT weight (the 1/n is absorbed into that), prp_mult is your 3, cy is carryin from next-lower iFFT term (and re-used for carryout):
Code:
x *= wi_re;\
temp = DNINT(x);\
frac = fabs(x-temp);\
temp = temp*prp_mult + cy;\
cy   = DNINT(temp*baseinv[i]);\
x = (temp-cy*base[i])*wt_re;\
I'm guessing using all-doubles is not a good option for your target hardware.
Question: In the 4th of those 6 lines, both temp and cy can be of either sign, but temp, the rounded-but-not-yet-wordsize-normalized iFFT term, is nearly always going to much larger in magnitude than the +cy carryin, i.e. we should be able to infer the expected sign of the next line's carryout computation from it, to see whether - in your case - the integer result overlowing into the sign bit, yes?.
Quote:
 similarly for a 5.5M FFT, max exp = 102.2M
Ruh-roh - I've been doing p-1 work at 5.5M for exp ~= 103M, using the checkin of last week. Should I halt my runs and restart with CARRY64 enabled?

Last fiddled with by ewmayer on 2020-03-27 at 23:36

2020-03-27, 23:39   #1994
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

152738 Posts

Quote:
 Originally Posted by ewmayer Ruh-roh - I've been doing p-1 work at 5.5M for exp ~= 103M, using the checkin of last week. Should I halt my runs and restart with CARRY64 enabled?
Yes, but I would not redo those P-1.

2020-03-28, 00:16   #1995
ewmayer
2ω=0

Sep 2002
República de California

256568 Posts

Quote:
 Originally Posted by Prime95 Yes, but I would not redo those P-1.
I don't see CARRY64 in the readme - is that an undocumented cmd-line flag?

One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K?

Oh, small UI suggestion: -fft 6144 for the above gave "FFT too small" error, i.e. the UI needs raw FFT length, in this case 6291456. It was a little annoying to have the resulting run immediately echo to effect of "starting run with FFT length 6144K". Could the -fft option be fiddled to use FFT length in K?

2020-03-28, 01:45   #1996
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

11010101110112 Posts

Quote:
 Originally Posted by ewmayer I don't see CARRY64 in the readme - is that an undocumented cmd-line flag?
-use CARRY64

2020-03-28, 01:48   #1997
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

3×2,281 Posts

Quote:
 Originally Posted by ewmayer One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K?
I don't know.

Quote:
 Oh, small UI suggestion: -fft 6144 for the above gave "FFT too small" error, i.e. the UI needs raw FFT length, in this case 6291456. It was a little annoying to have the resulting run immediately echo to effect of "starting run with FFT length 6144K". Could the -fft option be fiddled to use FFT length in K?
-fft 6M works. As well as -fft 6144K

2020-03-28, 11:48   #1998
preda

"Mihai Preda"
Apr 2015

100000101002 Posts

Quote:
 Originally Posted by ewmayer One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K?
I suppose you passed a -fft command line argument. Then it will affect all the tasks, thus will affect the PRP as well. (i.e. will not switch back to the default FFT)

Last fiddled with by preda on 2020-03-28 at 11:49

 2020-03-28, 16:49 #1999 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 74118 Posts 3 strikes you're out, game over until tomorrow gpuowl could handle error cases more gracefully. Luckily I stumbled across this one while handling something else. Otherwise it could have cost nearly a day's throughput on that gpu. Please consider commenting out a problematic worktodo line and continuing on with the next in such a case, instead of killing the run. Also, since config.txt optimization content is fft length dependent, what's optimal for one fft length can be fatal for another. Please consider fft-length-specific enhancement to config.txt, as mentioned before. Code: 2020-03-28 10:23:18 condorella/rx480 CC 94418041 / 94418041, 4d816a6edf6393__ 2020-03-28 10:23:20 condorella/rx480 {"exponent":"94418041", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "v ersion":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-03-28 15:23:20 UTC", "user":"kriesel", "computer":"condorella/rx480", "aid": "(redacted)", "fft-length":5242880, "res64":"4d816a6edf6393__", "residue-type":1, "errors":{"gerbicz":0 }}2020-03-28 10:23:21 condorella/rx480 131500093 FFT 7168K: Width 256x4, Height 64x8, Middle 7; 17.92 bits/word 2020-03-28 10:23:22 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=7u -DWEIGHT_STE P=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f05 18db8a8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-03-28 10:23:25 condorella/rx480 OpenCL compilation in 3.68 s 2020-03-28 10:23:28 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-03-28 10:23:35 condorella/rx480 131500093 EE 800 0.00%; 5251 us/it; ETA 7d 23:49; 6781adfa7991c92a (check 2.29s) 2020-03-28 10:23:37 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-03-28 10:23:44 condorella/rx480 131500093 EE 800 0.00%; 5251 us/it; ETA 7d 23:48; 6781adfa7991c92a (check 2.29s) 1 errors 2020-03-28 10:23:46 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-03-28 10:23:53 condorella/rx480 131500093 EE 800 0.00%; 5255 us/it; ETA 7d 23:58; 6781adfa7991c92a (check 2.30s) 2 errors 2020-03-28 10:23:53 condorella/rx480 3 sequential errors, will stop. 2020-03-28 10:23:53 condorella/rx480 Exiting because "too many errors" 2020-03-28 10:23:53 condorella/rx480 Bye C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611 C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>title gpuowl-v6.11-134-g1e0ce1d/rx480 C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win 2020-03-28 11:27:31 gpuowl v6.11-134-g1e0ce1d Last fiddled with by kriesel on 2020-03-28 at 16:49
2020-03-28, 20:49   #2000
ewmayer
2ω=0

Sep 2002
República de California

101011101011102 Posts

@George - thanks, I missed the K and M suffix options in my perusal of the readme.

Quote:
 Originally Posted by preda I suppose you passed a -fft command line argument. Then it will affect all the tasks, thus will affect the PRP as well. (i.e. will not switch back to the default FFT)
I checked at end of the forced-6144K p-1 run; it indeed started the ensuing PRP at the same FFT length, so killed and restarted sans -fft flag.

All runs now restarted using -use CARRY64 -- thanks, George. Also, you'll be pleased to hear tha after the latest BSOD-style crash of the Haswell system which hosts my Radeon VII I finally got round to trying the disable-C-states trick you recommend in the BIOS Overclock submenu - seem to work like charm, system has been rock-stable since, uptime 4 days and counting, which is really long for this system.

More details on what happens for me with -use CARRY64, in the context of 2 side-by-side PRP runs @5632K:

o Initially, each run going at a steady 1386 us/iter at my sclk=4 setting;
o Stop run 0 & restart with -use CARRY64, after a few more 200Kiter intervals, both jobs are up to 1402 us/iter, which seems weird since only one is using the slower-but-safe carry option;
o Stop run 1 & restart with -use CARRY64, after a few more 200Kiter intervals, both jobs are up to 1420 us/iter, a 2.5% hit to throughput.

Since the bug only affects p-1 runs, would it be difficult to tweak things so that -use CARRY64 invoked-by-user is only operative in p-1 testing? Or maybe allow separate specification by job type, e.g -use CARRY64 means both worktypes, -pm1 CARRY64 means apply to p-1, -prp CARRY64 means apply to prp runs?

2020-03-29, 09:04   #2001
preda

"Mihai Preda"
Apr 2015

41416 Posts

Quote:
 Originally Posted by ewmayer Since the bug only affects p-1 runs, would it be difficult to tweak things so that -use CARRY64 invoked-by-user is only operative in p-1 testing? Or maybe allow separate specification by job type, e.g -use CARRY64 means both worktypes, -pm1 CARRY64 means apply to p-1, -prp CARRY64 means apply to prp runs?
I just commited a change that makes CARRY64 the default for P-1, and CARRY32 the default for PRP on AMD. The rationale being that, if CARRY32 is not appropriate, this fact will be visible for PRP, thus safe; on P-1 we use the safe default (i.e. CARRY64) until we have a better solution there.

2020-03-29, 10:10   #2002
preda

"Mihai Preda"
Apr 2015

22×32×29 Posts

Quote:
 Originally Posted by kriesel Gpuowl stage 1 needs a res64 error check.
But Ken, what is the appropriate action to take on error?

Let's say that during P-1 stage1, a residue==0 is detected. This is not the result of innapropriate FFT-size, it indicates a hardware error. But what to do in this situation, given there's no way to check P-1 -- basically what makes sense is for gpuowl to simply stop doing any P-1 (on that GPU). It can't reliably roll back to any trusted point. Assuming that a residue that is != 0 is a correct one, in the situation where the GPU produces res==0 sometimes, is not a good way to go. I would just discart the whole test as corrupted.

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1616 2020-05-31 16:46 xx005fs GPU Computing 0 2019-07-26 21:37 1260 Software 17 2015-08-28 01:35 CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12 Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 10:53.

Sat Jun 6 10:53:23 UTC 2020 up 73 days, 8:26, 0 users, load averages: 1.68, 1.78, 1.75