mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2020-02-13 08:23

[QUOTE=preda;537482]Try with a smaller -maxAlloc. Can you check the free memory on the GPU -- how much is free before and during the gpuowl run?[/QUOTE]Other exponents ran ok with -maxAlloc 7500, allocating as much as 7272MB into P2 buffers. The way I run Colab, I normally can't check the gpu ram free during a run. And the time window for doing so during a run that crashes so quickly in P2 is small. It appears from nvidia-smi output at session start, that since Colab gpus are on headless linux VMs, the initial occupied gpu ram is 0.
T4 0/15079 MiB
P100 0/16280 MiB
P4 ?
K80 ?
To get those last two also is a matter of waiting to hit them in the Colab gpu model lottery. Models are listed in probability order, most frequent recently first.
I've added logging idle and active nvidia-smi output to Google drive files into the Colab script. Colab "screen" output to the browser is lost when a new session is launched, the page closed, or the data scrolls out of the 5000 line buffer. Based on my recent experience with my first Colab accounts it could take ~5 weeks to get all the 4 models allocated. Perhaps it will be quicker on this newer account.

preda 2020-02-13 08:46

OK I checked myself, the problem with M15000031 is that by default it gets a too small FFT size for P-1 (it's at the border). If FFT size is manually increased the factor is found. I'll keep an eye on improving the default FFT size.


[QUOTE=kriesel;537444]Latest gpuowl commit is still missing the 15M[/QUOTE]

kriesel 2020-02-13 09:29

[QUOTE=preda;537482]Try with a smaller -maxAlloc. Can you check the free memory on the GPU -- how much is free before and during the gpuowl run?[/QUOTE]Other exponents ran ok with -maxAlloc 7500, allocating as much as 7272MB into P2 buffers. The way I run Colab, I normally can't check the gpu ram free during a gpuowl run. And the time window for doing so during a run that crashes so quickly in P2 is small. It appears from nvidia-smi output at session start, that since Colab gpus are on headless linux VMs, the initial occupied gpu ram is 0.

GPU model Idle Active
T4 0/15079 MiB 5939 gpuowl
P100 0/16280 MiB 293 mfaktc
P4 ? ?
K80 ? ?

To get those last two also is a matter of waiting to hit them in the Colab gpu model lottery. Models are listed in probability order, most frequent recently first.
I've added logging idle and active nvidia-smi output to Google drive files into the Colab script. Colab "screen" output to the browser is lost when a new session is launched, the page closed, or the data scrolls out of the 5000 line buffer. Based on my recent experience with my first Colab accounts it could take ~5 weeks to get all the 4 models allocated. Perhaps it will be quicker on this newer account.


(Moderator please delete my previous similar post and this line; this post replaces the previous post.)

preda 2020-02-13 10:20

Ken, I understand it's not easy to get all this information, and maybe it's not even needed. The situation is that I'm not yet convinced that there is a problem with the way GpuOwl handles maxAlloc or buffer allocation. I'm not convinced because I can imagine alternative explanations for the observed behavior. The alternative explanation is: maybe the GPU, even if it is reporting 8GB, does not have all of that actually available. Maybe it has less than 7.5 actually available to be allocated in contigous blocks of 9MB (for some reason). Thus, GpuOwl will fail if ran with maxAlloc 7.5G, but that's not necessarilly a bug of the program.

(all that because OpenCL does not offer a normal/reliable way to query actual free GPU memory)

[QUOTE=kriesel;537492]Other exponents ran ok with -maxAlloc 7500, allocating as much as 7272MB into P2 buffers. The way I run Colab, I normally can't check the gpu ram free during a gpuowl run. And the time window for doing so during a run that crashes so quickly in P2 is small. It appears from nvidia-smi output at session start, that since Colab gpus are on headless linux VMs, the initial occupied gpu ram is 0.

GPU model Idle Active
T4 0/15079 MiB 5939 gpuowl
P100 0/16280 MiB 293 mfaktc
P4 ? ?
K80 ? ?

To get those last two also is a matter of waiting to hit them in the Colab gpu model lottery. Models are listed in probability order, most frequent recently first.
I've added logging idle and active nvidia-smi output to Google drive files into the Colab script. Colab "screen" output to the browser is lost when a new session is launched, the page closed, or the data scrolls out of the 5000 line buffer. Based on my recent experience with my first Colab accounts it could take ~5 weeks to get all the 4 models allocated. Perhaps it will be quicker on this newer account.


(Moderator please delete my previous similar post and this line; this post replaces the previous post.)[/QUOTE]

kriesel 2020-02-13 11:04

[QUOTE=preda;537493]Ken, I understand it's not easy to get all this information, and maybe it's not even needed. The situation is that I'm not yet convinced that there is a problem with the way GpuOwl handles maxAlloc or buffer allocation. I'm not convinced because I can imagine alternative explanations for the observed behavior. The alternative explanation is: maybe the GPU, even if it is reporting 8GB, does not have all of that actually available. Maybe it has less than 7.5 actually available to be allocated in contigous blocks of 9MB (for some reason). Thus, GpuOwl will fail if ran with maxAlloc 7.5G, but that's not necessarilly a bug of the program.

(all that because OpenCL does not offer a normal/reliable way to query actual free GPU memory)[/QUOTE]Some of the gpu models with ECC have strange actual usable ram amounts, because ECC is implemented with part of the multiple of power of 2 total ram complement; Tesla C2075 nominal 6GB is 5.25GB (5376MB) net, for example.
I have two Colab accounts "instrumented" now to catch idle and activated nvidia-smi output which includes allocated & total MiB gpu ram. If they're not too buggy script additions.
I agree that it seems unlikely it's a memAlloc problem at 20M test exponent; a smaller exponent succeeded on the P4 with 7272MB allocated, 7500 maxAlloc, while the 20M failed with memAlloc set at7500, 7300, and 7000. But we'll see.

FYI, gtx1080, gpuowl v6.11-134 P-1, stage 2, 443M exponent, 18 buffers x 224MB, nvidia-smi shows 7527/8192MiB active, vs 107 idle.

ewmayer 2020-02-13 21:04

[QUOTE=preda;537490]OK I checked myself, the problem with M15000031 is that by default it gets a too small FFT size for P-1 (it's at the border). If FFT size is manually increased the factor is found. I'll keep an eye on improving the default FFT size.[/QUOTE]

Shouldn't the too-small-FFT-size manifest via excessive-fractional-parts (a.k.a. roundoff errors) detected during the round-and-carry step?

mrh 2020-02-13 21:21

[QUOTE=preda;537490]OK I checked myself, the problem with M15000031 is that by default it gets a too small FFT size for P-1 (it's at the border). If FFT size is manually increased the factor is found. I'll keep an eye on improving the default FFT size.[/QUOTE]

Ah, thats good to know. I didn't think of that.

kriesel 2020-02-13 22:04

Are empty worktodo lines (just a newline) not allowed?[CODE]2020-02-13 19:19:40 colab2-TeslaT4 {"exponent":"10000831", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-145-g6146b6d-dirty"}, "timestamp":"2020-02-13 19:19:40 UTC", "user":"kriesel", "computer":"colab2-TeslaT4", "fft-length":524288, "B1":30000, "B2":500000, "factors":["646560662529991467527"]}
2020-02-13 19:19:40 colab2-TeslaT4 worktodo.txt line ignored: ""
terminate called after throwing an instance of 'char const*'
[/CODE]

preda 2020-02-14 07:19

[QUOTE=ewmayer;537526]Shouldn't the too-small-FFT-size manifest via excessive-fractional-parts (a.k.a. roundoff errors) detected during the round-and-carry step?[/QUOTE]

There is no excessive-fractional-parts detection in the round-and-carry. It does have overhead, and for PRP it isn't needed as the GEC provides better cover. This does leave P-1 unprotected.

kriesel 2020-02-14 17:21

[QUOTE=preda;537546]There is no excessive-fractional-parts detection in the round-and-carry. It does have overhead, and for PRP it isn't needed as the GEC provides better cover. This does leave P-1 unprotected.[/QUOTE]Gpuowl P-1 needs error detection. While for production wavefront running on fast gpus the individual run times are short, P-1 run times in the upper reaches of exponent are not (~4 days each on a Tesla P100 or Radeon VII near 10[SUP]9[/SUP]), and the cost of a missed factor due to P-1 computational error is lost factors and needless primality tests.

Error check possibilities include (in random order):
[LIST=1][*]the afore-mentioned excessive-fractional-parts detection in the round-and-carry[*]res64 check for problem values in stage 1 (0 is probably bad, check the full res. I've had this occur with a recent commit of Gpuowl and unsuitable-for-the-fft-length -use options. [URL]https://mersenneforum.org/showpost.php?p=537396&postcount=1838[/URL]) 1 occurs in CUDAPm1. And see #7[*]res64 check repeat same value in stage 1. And see #7[*]Jacobi checks in selected portions of the computation, perhaps as an option since it is expensive, requiring both computing the correct Jacobi and the actual. There is a whole thread about that, at [URL]https://mersenneforum.org/showthread.php?t=23470&highlight=jacobi[/URL][*]screening -use options for suitability for the selected fft length (and this applies also to PRP)[*]initial-powering correctness check (a variation for P-1 of [URL]https://www.mersenneforum.org/showpost.php?p=515172&postcount=9[/URL]) This covers a broader range of erroneous results than merely detecting zero result as in #2 above, but only in the relatively few iterations before the modulo kicks in. This will quickly detect very seriously pathological runs and may detect less drastic issues due to too-small fft length or inappropriate -use option or seriously malfunctioning hardware where the powering is failing within the initial dozens of iterations, by turning the initial iterations into a fast very low cost inline self-test.[*]statistical checks on stage 1 res64, and perhaps on other measures (see #3 of [URL]https://www.mersenneforum.org/showpost.php?p=515641&postcount=10[/URL])[QUOTE]In some pathological cases as little as 5 interim residues may be enough to identify an issue[/QUOTE]The statistical check detects 0 res64, repeating res64, short-period cyclic patterns, and a lot more, although it detects them a bit later than specific checks would.[*]automatic built-in generic short self-test for the same fft length, run with the same -use options as the P-1 worktodo run will use, immediately before the start or resume of the P-1 computation. This is a variation of [URL]https://www.mersenneforum.org/showpost.php?p=517162&postcount=14[/URL][*]"Gerbicz refers to a possible check on P-1, when there's a known factor, which is a very considerable restriction, in [URL="http://www.mersenneforum.org/showpost.php?p=470624&postcount=252"]http://www.mersenneforum.org/showpos...&postcount=252[/URL]" [URL]https://mersenneforum.org/showthread.php?t=23467[/URL][/LIST]What's the P-1 factoring error rate? We don't know. [URL]https://mersenneforum.org/showpost.php?p=509937&postcount=3[/URL]
Are there more P-1 error check possibilities?

ewmayer 2020-02-14 21:44

[QUOTE=preda;537546]There is no excessive-fractional-parts detection in the round-and-carry. It does have overhead, and for PRP it isn't needed as the GEC provides better cover. This does leave P-1 unprotected.[/QUOTE]

Does that imply that you had implemented per-output ROE checking at some point? If so, what kind of performance hit did you see?

I'm also just beginning to delve into the underlying code here, so you can answer this more quickly: what is the underlying hardware instruction set used by your code, and does it include the needed round() instruction? If so, what is the hardware latency and pipelineability of that, and how do you expect it to compare to the old coders' trick (developed before IEEE floating-point standardization and widespread use of dedicated round() instructions) of rnd(x) = (x + c) - c, where c = 0.75*2^[#significand bits in a floating datum] needing just an add and a sub?

If the % hit with-ROE-checking is significant even after choosing the best of the above options, how difficult would it be to deploy a special round-and-carry-with-ROE-checking routine, which would be invoked only during p-1 testing (and any other future modmul sequence for which a Gerbicz-style check is unavailable)? In order to gauge how many PRP tests the addition of ROE checking here would save, we need some data re. missed p-1 factors - perhaps there could be a dedicated near-term QA effort comparing factors found for a decently large representative set of expos, we could compare factors found by trying each expo twice using the same stage bounds:

[1] Using gpuOwl with default settings, i.e. defualt FFT length and no ROE checking;

[2] Using gpuOwl with next-larger-than-default FFT length, or - probably better - Prime95/mprime with default FFT param and same p-1 stage bounds as were used in [1].

Or perhaps there are already some stats in hand here, based on early-DCs of first-time PRP tests? Or do those PRP-DC runs skip the p-1 step?

kriesel 2020-02-15 00:19

[QUOTE=ewmayer;537604]If the % hit with-ROE-checking is significant even after choosing the best of the above options, how difficult would it be to deploy a special round-and-carry-with-ROE-checking routine, which would be invoked only during p-1 testing (and any other future modmul sequence for which a Gerbicz-style check is unavailable)?[/QUOTE]Such as the contemplated return of LL to gpuowl, for instance.
[QUOTE]In order to gauge how many PRP tests the addition of ROE checking here would save, we need some data re. missed p-1 factors - perhaps there could be a dedicated near-term QA effort comparing factors found for a decently large representative set of expos, we could compare factors found by trying each expo twice using the same stage bounds:

[1] Using gpuOwl with default settings, i.e. default FFT length and no ROE checking;

[2] Using gpuOwl with next-larger-than-default FFT length, or - probably better - Prime95/mprime with default FFT param and same p-1 stage bounds as were used in [1].

Or perhaps there are already some stats in hand here, based on early-DCs of first-time PRP tests? Or do those PRP-DC runs skip the p-1 step?[/QUOTE]Madpoo may be able to dig up stats on those, if/when he has the time and inclination. To ordinary users, it's not clear what software was used to produce P-1 results on the server. One approach is to slam all the CUDAPm1 selftest candidates through a gpuowl install. However, above 432M, that's testing gpuowl for finding factors it already found. Also the sample size is too small. Best bet seems to me to determine whether prime95 finds factors that gpuowl does not, by running gpuowl on a suitably sized set of prime95-found-factors exponents.

preda 2020-02-15 06:33

[QUOTE=kriesel;537580]Gpuowl P-1 needs error detection.[/QUOTE]

I'm thinking of a way to use GEC with P-1 first-stage, which would not be a waste if somebody is planning to continue with PRP on the same exponent (if a factor is not found).

The idea is to use right-to-left binary exponentiation, which can use to a large degree the error check. The residue thus computed can be saved and used to start the PRP from this point on.

(right now P-1 first-state uses left-to-right binary exponentiation, which is more efficient but can't use the error check).

[url]https://en.wikipedia.org/wiki/Modular_exponentiation[/url]

Prime95 2020-02-15 07:33

@Ernst: The GCN timings doc is here -- [url]https://github.com/CLRX/CLRX-mirror/wiki/GcnTimings[/url]

ROE error checking would be slower but it would be useful debugging option to sanity check FFT length selections.

kriesel 2020-02-15 16:05

[QUOTE=kriesel;537488]It appears from nvidia-smi output at session start, that since Colab gpus are on headless linux VMs, the initial occupied gpu ram is 0.
T4 0/15079 MiB
P100 0/16280 MiB
K80 ?
[/QUOTE]
P4 [B]0/7611 MiB[/B]
Stay tuned for K80, haven't got one in a while.

kriesel 2020-02-15 17:25

[QUOTE=preda;537628]I'm thinking of a way to use GEC with P-1 first-stage, which would not be a waste if somebody is planning to continue with PRP on the same exponent (if a factor is not found).

The idea is to use right-to-left binary exponentiation, which can use to a large degree the error check. The residue thus computed can be saved and used to start the PRP from this point on.

(right now P-1 first-state uses left-to-right binary exponentiation, which is more efficient but can't use the error check).

[URL]https://en.wikipedia.org/wiki/Modular_exponentiation[/URL][/QUOTE]
Interesting. I had thought from [URL]https://www.mersenneforum.org/showthread.php?p=470624#post470624[/URL] and [URL]https://www.mersenneforum.org/showpost.php?p=468879&postcount=245[/URL] that a known factor was required to apply the GEC to P-1 or LL. Left-to-right exponentiation was also an obstacle for applying the Jacobi symbol check to much of P-1, along with the additional check effort of computing both correct Jacobi and actual Jacobi for P-1 progress to a given point.
If the performance hit when applied to P-1 stage 1 is not too bad, it would be a tremendous advance, since P-1 is currently by its nature quite thin on error checks compared to primality testing.

ewmayer 2020-02-15 19:27

[QUOTE=Prime95;537629]@Ernst: The GCN timings doc is here -- [url]https://github.com/CLRX/CLRX-mirror/wiki/GcnTimings[/url]

ROE error checking would be slower but it would be useful debugging option to sanity check FFT length selections.[/QUOTE]

Thanks, George - you'll need to to let me know if I'm reading that correctly: I see a V_RNDNE_F64 instruction with latency DPFACTOR*4 = 8 cycles on Radeon. All other things being equal, that equals the 4+4 cycle latency needed for the DNINT(x) = (x + c) - c "hand-rolled round" alternative. In practice, other operations (e.g. computing DWT weights) can be interleaved with the round to help hide the latency.

If Mihai could add ROE checking to just the the carry step used in the current p-1 stage, that would be great - even if a GEC-enhanced p-1 is coming down the pike, it's always useful to have multiple checks, to catch both FFT-length-related errors and "other" ones - on flaky hardware, in my case my aging Haswell quad, I've found sudden emission of fatal ROEs, nonreproducible on interval-retry, to be a reliable indicator of upcoming system-needs-rebootness.

Prime95 2020-02-16 07:28

[QUOTE=ewmayer;537652]All other things being equal, that equals the 4+4 cycle latency needed for the DNINT(x) = (x + c) - c "hand-rolled round" alternative.[/QUOTE]

Yes, and you cannot get OpenCL to generate a V_RNDNE_F64 instruction unless you resort to __asm syntax.

kriesel 2020-02-17 02:07

small detail
 
[CODE]2020-02-16 20:01:42 asrock/radeonvii 4444091 FFT 224K: Width 8x8, Height 64x4, Middle 7; 19.37 bits/word
2020-02-16 20:01:42 asrock/radeonvii OpenCL args "-DEXP=4444091u -DWIDTH=64u -DSMALL_HEIGHT=256u -DMIDDLE=7u -DWEIGHT_STEP=0x
c.571b3d76085f8p-3 -DIWEIGHT_STEP=0xa.5f5fa9671576p-4 -DWEIGHT_BIGSTEP=0xc.5672a115506d8p-3 -DIWEIGHT_BIGSTEP=0xa.5fed6a9b151
38p-4 -DAMDGPU=1 -DCHEBYSHEV_METHOD_FMA=1 -DCHEBYSHEV_MIDDLEMUL2=1 -DMERGED_MIDDLE=1 -DMORE_ACCURATE=1 -DNO_ASM=1 -DT2_SHUFFL
E_HEIGHT=1 -DT2_SHUFFLE_MIDDLE=1 -DWORKINGIN1A=1 -DWORKINGOUT1A=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-02-16 20:01:43 asrock/radeonvii OpenCL compilation error -11 (args -DEXP=4444091u -DWIDTH=64u -DSMALL_HEIGHT=256u -DMIDD
LE=7u -DWEIGHT_STEP=0xc.571b3d76085f8p-3 -DIWEIGHT_STEP=0xa.5f5fa9671576p-4 -DWEIGHT_BIGSTEP=0xc.5672a115506d8p-3 -DIWEIGHT_B
IGSTEP=0xa.5fed6a9b15138p-4 -DAMDGPU=1 -DCHEBYSHEV_METHOD_FMA=1 -DCHEBYSHEV_MIDDLEMUL2=1 -DMERGED_MIDDLE=1 -DMORE_ACCURATE=1
-DNO_ASM=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_MIDDLE=1 -DWORKINGIN1A=1 -DWORKINGOUT1A=1 -I. -cl-fast-relaxed-math -cl-std=CL
2.0 -DNO_ASM=1)
2020-02-16 20:01:43 asrock/radeonvii C:\Users\User\AppData\Local\Temp\\OCL8000T3.cl:1942:2: error: WORKINGOUT1 not compatible
with this FFT size
#error [B]WORKINGOUT1[/B] not compatible with this FFT size
^
1 error generated.

error: Clang front-end compilation failed!
Frontend phase failed compilation.
Error: Compiling CL to IR[/CODE]Seems like that error message should refer to WORKINGOUT1[B]A.[/B]


Is there a mapping for which fft lengths are supported by the various -use options?

kriesel 2020-02-17 15:07

[QUOTE=kriesel;537643]P4 [B]0/7611 MiB[/B]
Stay tuned for K80, haven't got one in a while.[/QUOTE]
[B]K80 0/11441 MiB[/B]

kriesel 2020-02-17 17:10

gpuowl v6.11-134 P-11 memory allocation error
 
Win7 x64, on asrock motherboard, asrock Radeon VII shakedown cruise, on known-factor P-1 test candidates, hit an error that stopped the show until found crashed several hours later.[CODE]C:\Users\User\Documents\gpuowl-v6.11-134\radeonvii>gpuowl-win
2020-02-17 01:28:32 gpuowl v6.11-134-g1e0ce1d
2020-02-17 01:28:32 config: -device 1 -user kriesel -cpu asrock/radeonvii -yield -maxAlloc 16000 -use NO_ASM
2020-02-17 01:28:32 config:
2020-02-17 01:28:32 config: :not compatible with 224K fft: ,WORKINGOUT1A
2020-02-17 01:28:32 config: :best for 4608K: ,MERGED_MIDDLE,WORKINGIN1A,WORKINGOUT1A,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,CHEB
YSHEV_METHOD_FMA,CHEBYSHEV_MIDDLEMUL2,MORE_ACCURATE
2020-02-17 01:28:32 device 1, unique id ''
2020-02-17 01:28:32 asrock/radeonvii 150000713 FFT 8192K: Width 256x8, Height 256x8; 17.88 bits/word
2020-02-17 01:28:32 asrock/radeonvii using long carry kernels
2020-02-17 01:28:36 asrock/radeonvii OpenCL args "-DEXP=150000713u -DWIDTH=2048u -DSMALL_HEIGHT=2048u -DMIDDLE=1u -DWEIGHT_ST
EP=0x8.af5a78e9513b8p-3 -DIWEIGHT_STEP=0xe.bcf3fa7f78dc8p-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1
e3ea8bd8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-02-17 01:28:48 asrock/radeonvii OpenCL compilation in 12.10 s
2020-02-17 01:28:49 asrock/radeonvii 150000713 P1 B1=30030, B2=2400000; 43305 bits; starting at 0
2020-02-17 01:29:07 asrock/radeonvii 150000713 P1 10000 23.09%; 1804 us/it; ETA 0d 00:01; 2b8e087d5548b0be
2020-02-17 01:29:25 asrock/radeonvii 150000713 P1 20000 46.18%; 1799 us/it; ETA 0d 00:01; 34138e15eb0339f5
2020-02-17 01:29:43 asrock/radeonvii 150000713 P1 30000 69.28%; 1799 us/it; ETA 0d 00:00; 24160ea83dc597ec
2020-02-17 01:30:01 asrock/radeonvii 150000713 P1 40000 92.37%; 1801 us/it; ETA 0d 00:00; 59711be3f706d044
2020-02-17 01:30:08 asrock/radeonvii saved
2020-02-17 01:30:08 asrock/radeonvii 150000713 P1 43305 100.00%; 2084 us/it; ETA 0d 00:00; 986ed60328d8a1dc
2020-02-17 01:30:08 asrock/radeonvii P-1 (B1=30030, B2=2400000, D=30030): primes 173054, expanded 217731, doubles 36995 (left
102423), singles 99064, total 136059 (79%)
2020-02-17 01:30:08 asrock/radeonvii 150000713 P2 using blocks [1 - 80] to cover 136059 primes
GNU MP: Cannot reallocate memory (old_size=8 new_size=18750104)

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
[/CODE]Then there's a Windows popup containing[CODE]Problem signature:
Problem Event Name: APPCRASH
Application Name: gpuowl-win.exe
Application Version: 0.0.0.0
Application Timestamp: 00000000
Fault Module Name: gpuowl-win.exe
Fault Module Version: 0.0.0.0
Fault Module Timestamp: 00000000
Exception Code: 40000015
Exception Offset: 000000000003eff1
OS Version: 6.1.7601.2.1.0.256.48
Locale ID: 1033
Additional Information 1: 8095
Additional Information 2: 8095155e3f9bfb3a0fc1b40b27c9d8c8
Additional Information 3: ab9e
Additional Information 4: ab9eae2761104200d519e0bef6c90ec9

Read our privacy statement online:
http://go.microsoft.com/fwlink/?linkid=104288&clcid=0x0409

If the online privacy statement is not available, please read our privacy statement offline:
C:\Windows\system32\en-US\erofflps.txt
[/CODE]When that's closed, one more line of console output appears from gpuowl:[CODE]2020-02-17 11:07:12 asrock/radeonvii 150000713 P2 using 231 buffers of 64.0 MB each[/CODE]
It's repeatable. Will try with the latest commit later.

kriesel 2020-02-17 17:51

Radeon VII 48M FFT tune
 
[CODE]Asrock Radeon VII on Win7 X64, Asrock 6-pcie motherboard on open air miner frame
gpuowl V6.11-134-g1e0ce1d
tune 852348659 PRP, 48M fft
stock settings, no OC or undervolt, etc.

NO_ASM 10309
NO_ASM 10302

UNROLL_ALL 10300
UNROLL_NONE 10096
UNROLL_WIDTH 10097
UNROLL_HEIGHT 10095 *
UNROLL_MIDDLEMUL1 10154
UNROLL_MIDDLEMUL2 10158

WORKINGIN 38066
WORKINGIN 38063
WORKINGIN1 10457
WORKINGIN1A 10038 *
WORKINGIN2 11576
WORKINGIN3 10361
WORKINGIN4 10784
WORKINGIN5 10292

WORKINGOUT 25480
WORKINGOUT0 11645
WORKINGOUT1 10202
WORKINGOUT1A 10116 *
WORKINGOUT2 16205
WORKINGOUT3 10207
WORKINGOUT4 10952
WORKINGOUT5 10750

mistakenly used workingout1 a while...
NO_ASM
NO_ASM 10291
...,UNROLL_WIDTH,UNROLL_HEIGHT 10098 *
...,UNROLL_WIDTH,UNROLL_MIDDLEMUL1 10159
...,UNROLL_HEIGHT,UNROLL_MIDDLEMUL1 10157
...,UNROLL_WIDTH,UNROLL_HEIGHT,UNROLL_MIDDLEMUL1 10157

NO_ASM,MERGED_MIDDLE,WORKINGIN1A,WORKINGOUT1 9961
...,T2_SHUFFLE_WIDTH 9906
...,T2_SHUFFLE_MIDDLE 9879
...,T2_SHUFFLE_HEIGHT 9645
...,T2_SHUFFLE_REVERSELINE 9965
...,T2_SHUFFLE 9485 *

NO_ASM 10311
NO_ASM 10296
...,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE 9594
...,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH 9626
...,T2_SHUFFLE_WIDTH,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE 9878
...,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE 9518
...,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,SHUFFLE_REVERSELINE 9484

correct to workingout1a
NO_ASM 10296
NO_ASM 10299
...,CARRY32 9289 *
...,CARRY64 9443

NO_ASM 10301
NO_ASM 10294
...,FANCY_MIDDLEMUL1 error no middlemul1 for the 48M fft
...,MORE_SQUARES_MIDDLEMUL1 9274 *
...,CHEBYSHEV_METHOD EE
...,CHEBYSHEV_METHOD_FMA EE
...,ORIGINAL_METHOD 9290
...,ORIGINAL_TWEAKED 9288

NO_ASM 10310
NO_ASM 10304
...,ORIG_MIDDLEMUL2 9277 *
...,CHEBYSHEV_MIDDLEMUL2 EE

NO_ASM 10288
NO_ASM 10286
...,ORIG_SLOWTRIG 9438
...,NEW_SLOWTRIG 9278
...,MORE_ACCURATE 9278
...,LESS_ACCURATE 9230 *

NO_ASM,MERGED_MIDDLE,WORKINGIN1A,WORKINGOUT1A,UNROLL_HEIGHT,T2_SHUFFLE,CARRY32,MORE_SQUARES_MIDDLEMUL1,ORIG_MIDDLEMUL2,LESS_ACCURATE

gain from tuning
10305/9230 = ~1.1165
without -time, it's a bit faster, ~9161us/it[/CODE]This yields an estimated run time of 3 months (90.3 days to be precise). The upper limit of mersenne.org would take ~4.2 months.
The same gpu in the same conditions but with a different tune for 4.5M fft has produced a matching PRP DC in its first attempt.

kriesel 2020-02-17 18:24

gpuowl-win v6.11-147-g3b8b00e build
 
2 Attachment(s)
Just built, only -h run so far.

kriesel 2020-02-18 00:18

memory allocation error in v6.11-147
 
[CODE]C:\Users\User\Documents\gpuowl-v6.11-147\radeonvii>gpuowl-win
2020-02-17 18:14:27 gpuowl v6.11-147-g3b8b00e
2020-02-17 18:14:27 config: -device 1 -user kriesel -cpu asrock/radeonvii -yield -maxAlloc 155000 -use NO_ASM,MERGED_MIDDLE,UNRO
LL_HEIGHT,T2_SHUFFLE,CARRY32,MORE_SQUARES_MIDDLEMUL1,ORIG_MIDDLEMUL2,LESS_ACCURATE
2020-02-17 18:14:27 config:

2020-02-17 18:14:28 device 1, unique id ''
2020-02-17 18:14:28 asrock/radeonvii 24000577 FFT 1280K: Width 8x8, Height 256x4, Middle 10; 18.31 bits/word
2020-02-17 18:14:29 asrock/radeonvii OpenCL args "-DEXP=24000577u -DWIDTH=64u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DWEIGHT_STEP=0x
c.e5beac96a0b88p-3 -DIWEIGHT_STEP=0x9.eca8ba4660afp-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p
-4 -DPM1=1 -DAMDGPU=1 -DCARRY32=1 -DLESS_ACCURATE=1 -DMERGED_MIDDLE=1 -DMORE_SQUARES_MIDDLEMUL1=1 -DNO_ASM=1 -DORIG_MIDDLEMUL2=1
-DT2_SHUFFLE=1 -DUNROLL_HEIGHT=1 -cl-fast-relaxed-math -cl-std=CL2.0"
2020-02-17 18:14:41 asrock/radeonvii OpenCL compilation in 11.44 s
2020-02-17 18:14:41 asrock/radeonvii 24000577 P1 B1=300000, B2=9000000; 432351 bits; starting at 432350
2020-02-17 18:14:41 asrock/radeonvii 24000577 P1 432351 100.00%; 84768 us/it; ETA 0d 00:00; 55a8d888497469ec
2020-02-17 18:14:41 asrock/radeonvii P-1 (B1=300000, B2=9000000, D=30030): primes 576492, expanded 615799, doubles 105850 (left
373895), singles 364792, total 470642 (82%)
2020-02-17 18:14:41 asrock/radeonvii 24000577 P2 using blocks [10 - 300] to cover 470642 primes
GNU MP: Cannot allocate memory (size=164912)

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
2020-02-17 18:14:42 asrock/radeonvii 24000577 P2 using 1440 buffers of 10.0 MB each[/CODE]It was able to complete after -use was reduced to merely NO_ASM. But it's still missing the 15M test factor.[CODE]
{"exponent":"15000031", "worktype":"PM1", "status":"[B]NF[/B]", "program":{"name":"gpuowl", "version":"v6.11-147-g3b8b00e"}, "timestamp":"2020-02-17 23:16:22 UTC", "user":"kriesel", "computer":"asrock/radeonvii", "fft-length":786432, "B1":180000, "B2":3780000}[/CODE]But tthe 81M test exponent also fails with the cannot allocate memory error (size=2637840)

preda 2020-02-18 11:32

[QUOTE=kriesel;537797]2020-02-17 18:14:27 config: -device 1 -user kriesel -cpu asrock/radeonvii -yield -maxAlloc 155000 -use NO_ASM,MERGED_MIDDLE,UNRO
LL_HEIGHT,T2_SHUFFLE,CARRY32,MORE_SQUARES_MIDDLEMUL1,ORIG_MIDDLEMUL2,LESS_ACCURATE
[/QUOTE]

You realize you're using an unrealistically large maxAlloc; I don't know if this is what's causing the mem alloc error.

I fixed the auto FFT size for P-1.

kriesel 2020-02-18 14:50

[QUOTE=preda;537830]You realize you're using an unrealistically large maxAlloc; I don't know if this is what's causing the mem alloc error.

I fixed the auto FFT size for P-1.[/QUOTE]
I ran into trouble at 16000. 155 was supposed to be a reduction. And 1440 x 10 should fit within 15500 or 16000.

axn 2020-02-18 17:20

[QUOTE=kriesel;537841]I ran into trouble at 16000. 155 was supposed to be a reduction. And 1440 x 10 should fit within 15500 or 16000.[/QUOTE]

You did notice the extra 0, right?

kriesel 2020-02-18 23:54

[QUOTE=axn;537852]You did notice the extra 0, right?[/QUOTE]Yes. Fixed previously. The modern extra-sensitive-touchpad laptops are creating havoc with my interactive use, and in this case, after highlighting 60 for overwrite with 55, I didn't notice that it had changed to just the 6 highlighted, apparently. I learned touch typing around 1970 when most of the typewriters were manual and it was an honor to get to use an IBM Selectric powered typewriter with the flying spinning ball head. One of the things that I remember being taught is thumbs over the space bar. Unfortunately that puts them hovering over the often too sensitive touchpad on a laptop, giving all manner of unintended cursor moves. Normally on my old 17' display laptop I would double-tap the upper left corner and it would indicate touchpad off with a tiny LED there, but that laptop is out of action currently. Touchpad is now turned off by the Windows 10 control on this laptop, which I'm using to access the rest; the wireless mouse is SO much better behaved.

I have one laptop that has a touch screen also, and developed the "bubbles" problem where it senses its own display bezel as touches! It became increasingly sensitive, to the point where it made interactive use almost impossible. Disabling the touch screen device was what made it usable again. [URL]https://forums.lenovo.com/t5/Lenovo-Yoga-Series-Notebooks/Yoga-13-touch-screen-quot-bubble-quot-issue/td-p/1362239[/URL]

ewmayer 2020-02-19 20:36

Just started in on a batch of p's ~= 96.4M on my Radeon 7 ... that is close to he upper limit of what can be done @5120K using Prime95 and Mlucas, but I notice gpuOwl more conservatively defaults to 5632K. Without per-iteration ROE checking, the Gerbicz check should still catch residue corruption by excessive ROE in some output during the current G-check interval, so I'd like to test that out.

Is there a way to force it to try 5120K, and if so can this be done mid-run by ctrl-c and restarting with the needed FFT-length command-line flag?

EDIT: The readme is your friend... just killed current run, restarted with '-fft 5120K' ... that has proceeded for another million iterations, so far, so good. Has anyone reading this seen a case where an exponent close to the gpuOwl-set upper limit for a given FFT length hits an ROE-error-disguised-as-Gerbicz-check-error and causes the run to switch to the next-larger FFT length as a result?

ewmayer 2020-02-19 21:17

[QUOTE=ewmayer;537946]EDIT: The readme is your friend... just killed current run, restarted with '-fft 5120K' ... that has proceeded for another million iterations, so far, so good. Has anyone reading this seen a case where an exponent close to the gpuOwl-set upper limit for a given FFT length hits an ROE-error-disguised-as-Gerbicz-check-error and causes the run to switch to the next-larger FFT length as a result?[/QUOTE]

Once again in answer to my own question - predictably, literally seconds after posting my above edit, my run @5120K hit its first G-check error. The code retried 3 more times, then after the 4th attempt, quit with "3 sequential errors, will stop."

So this seems like a straightforward code fiddle - instead of just barfing, when a run hits a repeatable G-check error as mine did, if the exponent is close to or above the default limit for the FFT length in question, simply switch to the next-larger FFT length.

One related question regarding running near the exponent limit for a given FFT length - the OpenCL args echoed by the program on runstart do not say anything re. the carry-chain length used, but I see a user option "-carry long|short". Which choice gives better accuracy, and how can one tell what the default choice is for a given exponent and FFT length?

And another followup question regarding the "n errors" field output at each checkpoint - my force-5120K run started with "0 errors", then quickly cycled through 1,2,3,4 errors as it hit repeatable G-check errors due to a roundoff-corrupted residue. It then aborted. On restart sans the -fft flag it again defaulted to 5632K and is happily chugging along, but the errors field is now stuck at "2 errors". How did we go from 4 to 2? And shouldn't a repeatable G-check error count as 1 error?

preda 2020-02-20 09:47

[QUOTE=ewmayer;537949]Once again in answer to my own question - predictably, literally seconds after posting my above edit, my run @5120K hit its first G-check error. The code retried 3 more times, then after the 4th attempt, quit with "3 sequential errors, will stop."

So this seems like a straightforward code fiddle - instead of just barfing, when a run hits a repeatable G-check error as mine did, if the exponent is close to or above the default limit for the FFT length in question, simply switch to the next-larger FFT length.

One related question regarding running near the exponent limit for a given FFT length - the OpenCL args echoed by the program on runstart do not say anything re. the carry-chain length used, but I see a user option "-carry long|short". Which choice gives better accuracy, and how can one tell what the default choice is for a given exponent and FFT length?

And another followup question regarding the "n errors" field output at each checkpoint - my force-5120K run started with "0 errors", then quickly cycled through 1,2,3,4 errors as it hit repeatable G-check errors due to a roundoff-corrupted residue. It then aborted. On restart sans the -fft flag it again defaulted to 5632K and is happily chugging along, but the errors field is now stuck at "2 errors". How did we go from 4 to 2? And shouldn't a repeatable G-check error count as 1 error?[/QUOTE]

In general I try to keep things simple, not putting too much smarts in the automatic-dynamic FFT size. For example, in this case, just using the default would have been OK. If the user wants more control, it is possible to be explicit about the desired FFT size as you did. OTOH the behavior "let the user explicitly specify a FFT size, but dynamically increase it as needed" seems too complex (tricky) to me.

About carry size, -long provides better accuracy but it's so much slower that it's practically never used nowadays. Basically moving to the next-upper FFT size might well be faster than the -long carry. The default is always short carry.

About the number of errors (2 vs 4), this is a bit tricky: a savefile is only ever created with valid data that passed the check "right now". The number of errors is incremented in RAM, but can only be written as part of a valid savefile. What probably happened in you case is this: an error is hit (count becomes 1), backtrack, does a check earlier than the error point that passes OK and this saves (with count 1), again hits the error point, backtracks, and eventually hits 3 consecutive errors and stops.

Anyway improvements clearly can be made; but I'd like to identify the changes that have a clear behavior, a clear benefit, and not excessive cost before proceeding.

ewmayer 2020-02-20 19:34

[QUOTE=preda;537979]In general I try to keep things simple, not putting too much smarts in the automatic-dynamic FFT size. For example, in this case, just using the default would have been OK. If the user wants more control, it is possible to be explicit about the desired FFT size as you did. OTOH the behavior "let the user explicitly specify a FFT size, but dynamically increase it as needed" seems too complex (tricky) to me.[/quote]
My own code leverages the restart-from-interrupt logic to do this - hit a ROE, first retry the current iteration interval at same FFT length to see if reproducible, then if larger FFT length is indicated based on that, restart from lst good savefile, at the larger FFT length.

[quote]About carry size, -long provides better accuracy but it's so much slower that it's practically never used nowadays. Basically moving to the next-upper FFT size might well be faster than the -long carry. The default is always short carry.[/quote]
Thanks - your terminology had me confused, because it sounds very similar to a carry-related accuracy-vs-speed option I implement in my code, but apparently refers to a very different thing. In my current code, rather than computing the all of the DWT weights from scratch or via my older 2-small-table-multiply scheme, I start with a high-accuracy DWT weight computed that way, but for the next few outputs use a simply recurrence to generate the needed weights: just "multiply up" each successive weight by the constant 2^(#smallwords/N), if the result >= 2, multiply by 0.5. But accuracy degrades here with increasing recurrence-chain length, so the code allows 3 different chain lengths, long|medium|short. At runstart it uses some simply "how close is the exponent to the upper limit for the given FFT length?" logic to set the initial chain length, if the run hits a dangerous ROE it will first try switching to the next-shorter chain length, and only if it hits an ROE and is already using the short length will it switch to the next-larger FFT, and revert the chain length to long. The performance hit from the shorter chain lengths is small enough - around 2% - that the next-larger FFT should always be a last resort.

You're right regarding the slowness of -carry long for your code - current expo running at 5632K at 755 us/iter. Halting and restarting with -fft 5120K as I did yesterday cuts timing to 708 us, but is more or less guaranteed to abort with G-check error resulting from incorrectly-rounded output due to excessive ROE. Using '-fft 5120K -carry long' might be safe in terms of ROE, but blows up the per-squaring time to 960 us, so I'm back to the default 5632K here.

[quote]About the number of errors (2 vs 4), this is a bit tricky: a savefile is only ever created with valid data that passed the check "right now". The number of errors is incremented in RAM, but can only be written as part of a valid savefile. What probably happened in you case is this: an error is hit (count becomes 1), backtrack, does a check earlier than the error point that passes OK and this saves (with count 1), again hits the error point, backtracks, and eventually hits 3 consecutive errors and stops.[/quote]
Do you expect my run results to be OK, or should I queue it up for early-DC just to be on the safe side?

[quote]Anyway improvements clearly can be made; but I'd like to identify the changes that have a clear behavior, a clear benefit, and not excessive cost before proceeding.[/QUOTE]
Of course - being on both sides of the coder/user divide, I know that my job as a user is to say "gimme, gimme", and yours as a coder is to choose your battles very carefully.

kriesel 2020-02-20 22:27

[QUOTE=ewmayer;538031]You're right regarding the slowness of -carry long for your code - current expo running at 5632K at 755 us/iter. Halting and restarting with -fft 5120K as I did yesterday cuts timing to 708 us, but is more or less guaranteed to abort with G-check error resulting from incorrectly-rounded output due to excessive ROE. Using '-fft 5120K -carry long' might be safe in terms of ROE, but blows up the per-squaring time to 960 us, so I'm back to the default 5632K here..[/QUOTE]You're right to test carry length. It apparently behaves as advertised in Vega and Radeon VII, but in some older gpuowl and older gpu models (RX550, RX480, or both) -carry long was faster; I think for 4M fft length.

kriesel 2020-02-21 03:19

gpuowl-win v6.11-148-gfc93773 build
 
1 Attachment(s)
Here it is, fresh from -h and no more testing than that. This commit should have the P-1 fft size fix mentioned in [url]https://mersenneforum.org/showpost.php?p=537830&postcount=1868[/url]

preda 2020-02-28 12:15

ROCm 3.1
 
I tried ROCm 3.1, and while OpenCL superficially seems to work, the execution is broken. I'm moving back to 2.10 . I could instead attempt to debug what exactly is broken (where is the bug), but that's time intensive and I'm not motivated, as I don't see a significant upside to 3.1. I also have a feeling that chances are >50% that the problem is not in gpuowl, in which case just waiting for ROCm 3.x to mature may fix it.

ewmayer 2020-02-28 20:16

@Mihai: Ken noted he is testing the gpuowl-win v6.11-148-gfc93773 build - I am using v6.11-142-gf54af2e, are there any significant speedups (e.g. from George's sincos-computation work) in the newer build?

preda 2020-02-28 21:16

[QUOTE=ewmayer;538546]@Mihai: Ken noted he is testing the gpuowl-win v6.11-148-gfc93773 build - I am using v6.11-142-gf54af2e, are there any significant speedups (e.g. from George's sincos-computation work) in the newer build?[/QUOTE]

Not much changed, I expect pretty much the same performance between those two versions.

preda 2020-03-01 12:18

ROC 3.1
 
[QUOTE=preda;538501]I tried ROCm 3.1, and while OpenCL superficially seems to work, the execution is broken. I'm moving back to 2.10 . I could instead attempt to debug what exactly is broken (where is the bug), but that's time intensive and I'm not motivated, as I don't see a significant upside to 3.1. I also have a feeling that chances are >50% that the problem is not in gpuowl, in which case just waiting for ROCm 3.x to mature may fix it.[/QUOTE]

Hi, the most recent gpuowl commit may be running correctly on ROCm 3.1 (and in my case I also see about 1% speed-up)
[url]https://github.com/preda/gpuowl/commit/76751fd19dda3c839062225361bfeaa6a496a8df[/url]

It seems there is a codegen bug in 3.1, that I managed to work-around somehow:
[url]https://github.com/RadeonOpenCompute/ROCm/issues/1032[/url]

Prime95 2020-03-01 20:41

Congrats on getting gpuowl working under rocm 3.1.
Warning to rocm 2.10 users: this version is slower. The occupancy of carryFused went from 7 to 6.

Can you post the register usage for carryFused, tailFused, and two fftMiddle routines in rocm 3.1?

P.S. With your last sin/cos change, you can delete the comments on MORE_ACCURATE and LESS_ACCURATE. You should also be able to change P-1 back to using the newer trig code.

Prime95 2020-03-02 04:46

The MIDDLE=1 FFTs appear to be broken.

Question: Would gpuowl benefit from a MIDDLE=8 step? The whole reason for MIDDLE=1 was to do 4 passes over memory instead of 6. Now that we support MERGED_MIDDLE, both MIDDLE=1 and MIDDLE=8 would do 4 passes over memory.

If you think it would help, I'll write a middle=8 routine for you.

preda 2020-03-02 10:11

[QUOTE=Prime95;538656]The occupancy of carryFused went from 7 to 6.

Can you post the register usage for carryFused, tailFused, and two fftMiddle routines in rocm 3.1?
[/QUOTE]

This is the difference in occupancy/VGPRs between 2.10 (on left) and 3.1 (on right)

[CODE]
fftHout : Occupancy: = 7 | fftHout : Occupancy: = 6
fftMiddleOut : Occupancy: = 3 | fftMiddleOut : Occupancy: = 4
carryA : Occupancy: = 10 | carryA : Occupancy: = 9
carryFused : Occupancy: = 7 | carryFused : Occupancy: = 6
square : Occupancy: = 9 | square : Occupancy: = 10

---------------------

isEqual : NumVgprs: = 7 | isEqual : NumVgprs: = 6
isNotZero : NumVgprs: = 5 | isNotZero : NumVgprs: = 4
fftW : NumVgprs: = 32 | fftW : NumVgprs: = 31
fftHin : NumVgprs: = 35 | fftHin : NumVgprs: = 36
fftHout : NumVgprs: = 36 | fftHout : NumVgprs: = 39
k_fftP : NumVgprs: = 33 | k_fftP : NumVgprs: = 34
fftMiddleIn : NumVgprs: = 63 | fftMiddleIn : NumVgprs: = 62
fftMiddleOut : NumVgprs: = 65 | fftMiddleOut : NumVgprs: = 64
carryA : NumVgprs: = 15 | carryA : NumVgprs: = 26
carryM : NumVgprs: = 15 | carryM : NumVgprs: = 18
carryB : NumVgprs: = 15 | carryB : NumVgprs: = 14
carryFused : NumVgprs: = 36 | carryFused : NumVgprs: = 38
carryFusedMul : NumVgprs: = 39 | carryFusedMul : NumVgprs: = 37
transposeW : NumVgprs: = 105 | transposeW : NumVgprs: = 102
transposeH : NumVgprs: = 103 | transposeH : NumVgprs: = 101
square : NumVgprs: = 27 | square : NumVgprs: = 24
tailFusedMul : NumVgprs: = 120 | tailFusedMul : NumVgprs: = 118
tailFusedMulLow : NumVgprs: = 122 | tailFusedMulLow : NumVgprs: = 120
tailFusedMulDelta: NumVgprs: = 122 | tailFusedMulDelta: NumVgprs: = 120
[/CODE]

As you say, the occupancy of carryFused went one notch down. (it is exactly 1 VGPR over, it may be possible to win that one back). OTOH I see a speedup of about 1.2% on ROCm 3.1 compared to 2.10

preda 2020-03-02 10:26

[QUOTE=Prime95;538685]The MIDDLE=1 FFTs appear to be broken.
[/QUOTE]
Do you know why it's broken? (or which change broke it?)

[QUOTE]
Question: Would gpuowl benefit from a MIDDLE=8 step? The whole reason for MIDDLE=1 was to do 4 passes over memory instead of 6. Now that we support MERGED_MIDDLE, both MIDDLE=1 and MIDDLE=8 would do 4 passes over memory.

If you think it would help, I'll write a middle=8 routine for you.[/QUOTE]

Do I understand correctly that only powers-of-two FFTs would benefit from a middle=8? The wavefront is not on a power of two ATM, but will get there. Would a middle=4 make sense? About how well it would work (compared to middle=1), I guess we have to try it to know.

Prime95 2020-03-02 13:39

[QUOTE=preda;538694]Do you know why it's broken? (or which change broke it?)

Do I understand correctly that only powers-of-two FFTs would benefit from a middle=8? The wavefront is not on a power of two ATM, but will get there. Would a middle=4 make sense? About how well it would work (compared to middle=1), I guess we have to try it to know.[/QUOTE]

I did some digging: MIDDLE=1 requires ORIG_SLOWTRIG. Apparently transpose uses k,n values outside the range 0 to pi/2.

I timed a 3.5M, 4M, 4.5M FFT all with ORIG_SLOWTRIG. Times were 509us, 803us, 651us respectively. So we definitely need to do something!

I agree we need both a MIDDLE=4 and MIDDLE=8. This would let us eliminate the fft_HEIGHT and fft_WIDTH routines that are built on fft8. It has been my experience that fft8 is substantially slower than fft4, probably due to the extra VGPRs required.

We also need MIDDLE=13,14,15 so we have a complete set of MIDDLE values from 4 to 15.

Prime95 2020-03-02 13:43

[QUOTE=preda;538692]As you say, the occupancy of carryFused went one notch down. (it is exactly 1 VGPR over, it may be possible to win that one back). [/QUOTE]

I'll upgrade a machine to 3.1 and work on this. I remember it took a full day fighting the optimizer to save that one VGPR register.

kriesel 2020-03-02 17:03

config.txt
 
On a given gpu model in gpuowl, what is optimal for one fft length is not for another, and sometimes breaks correct function in another fft length. That is caught by the PRP GEC, stopping progress until the user intervenes, and sometimes produces all-zeroes meaningless computation in P-1 stage 1.
Perhaps the config.txt syntax could be extended to support a regardless-of-fft-length line, optionally fft-length-specific -use lines, and optionally a default safe but slower line for fft lengths that have not been benchmarked and tuned yet? Maybe something like
[CODE]all: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500
4608K: -use NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1,CARRY32,CHEBYSHEV_METHOD_FMA,CHEBYSHEV_MIDDLEMUL2,LESS_ACCURATE
5120K: -use NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG
default: -use NO_ASM
[/CODE]worktodo line to reproduce (there are others): [CODE]B1=1020000,B2=29580000;PFactor=0,1,2,96580489,-1,77,2[/CODE]Following -use options are result of optimization runs for 5120K but breaks 5632K P-1: [CODE]-device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG[/CODE]results in all-zero repeating res64 in start of stage 1:[CODE]2020-03-02 10:22:05 device 0, unique id ''
2020-03-02 10:22:05 condorella/rx480 96580489 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 16.75 bits/word
2020-03-02 10:22:07 condorella/rx480 OpenCL args "-DEXP=96580489u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STE
P=0x9.893b9e4410c28p-3 -DIWEIGHT_STEP=0xd.6c37c4b92b54p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f051
8db8a8p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_MIDDLEMUL2=1 -DMERGED_MIDDLE=1 -DMORE_SQUARES_MIDDLEMUL1=1 -DNEW_SLOWTRIG=1 -DNO
_ASM=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_WIDTH=1 -DUNROLL_HEIGHT=1 -DUNROLL_WIDTH=1 -DWORKINGIN1=1 -DWORKINGOUT1=1 -I. -cl-
fast-relaxed-math -cl-std=CL2.0"
2020-03-02 10:22:10 condorella/rx480 OpenCL compilation in 3.03 s
2020-03-02 10:22:10 condorella/rx480 96580489 P1 B1=1020000, B2=29580000; 1471504 bits; starting at 0
2020-03-02 10:22:48 condorella/rx480 96580489 P1 10000 0.68%; 3789 us/it; ETA 0d 01:32; 0000000000000000
2020-03-02 10:23:26 condorella/rx480 96580489 P1 20000 1.36%; 3797 us/it; ETA 0d 01:32; 0000000000000000
2020-03-02 10:23:41 condorella/rx480 Stopping, please wait..
2020-03-02 10:23:41 condorella/rx480 Exiting because "stop requested"
2020-03-02 10:23:41 condorella/rx480 Bye[/CODE]This is repeatable for other exponents

Prime95 2020-03-02 21:27

[QUOTE=Prime95;538727]I'll upgrade a machine to 3.1 and work on this..[/QUOTE]

Giant mistake.

Upgrading did not work. Had to reinstall the OS. Rocm 3.1 doesn't work even on a fresh install (clinfo cannot find any of the GPUs). The entire process has bricked one of the GPUs. Not happy.

The dmesg error on the bricked GPU is "Direct firmware load for amdgpu/vega20_ta.bin failed with error -2".

Six hours wasted, more struggles ahead.


Correction: I get the dmesg error on the two working GPUs. No error message for the bricked GPU.

PhilF 2020-03-02 21:51

Holy moly.

This is a perfect example why it's called "the bleeding edge".

This is especially true anytime AMD drivers are involved. I knew their Windows drivers could be radioactive to the point of having to reinstall the OS, but I had no idea Linux could be lethally irradiated also.

preda 2020-03-03 12:06

[QUOTE=Prime95;538726]I did some digging: MIDDLE=1 requires ORIG_SLOWTRIG. Apparently transpose uses k,n values outside the range 0 to pi/2.

I timed a 3.5M, 4M, 4.5M FFT all with ORIG_SLOWTRIG. Times were 509us, 803us, 651us respectively. So we definitely need to do something!

I agree we need both a MIDDLE=4 and MIDDLE=8. This would let us eliminate the fft_HEIGHT and fft_WIDTH routines that are built on fft8. It has been my experience that fft8 is substantially slower than fft4, probably due to the extra VGPRs required.

We also need MIDDLE=13,14,15 so we have a complete set of MIDDLE values from 4 to 15.[/QUOTE]

Thank you! I just merged the changes. Indeed this is a massive speedup on powers-of-2 FFTs probably bringing them in line with the other sizes. (and thanks for the FFT-size display fix)

I added a few asserts() to .cl (enabled with -use DEBUG) that allow to check the angle range.

mrh 2020-03-03 17:37

I've been wanting to do some P-1 runs where the B1/B2 (mostly B2) values are too big for u32. I was thinking of turning them all into u64's, unless this is bad idea or I'm missing something. What do you think?

kriesel 2020-03-03 19:40

[QUOTE=Prime95;538760]Giant mistake.

... The entire process has bricked one of the GPUs. Not happy.
[/QUOTE]What? It damaged a gpu?

Prime95 2020-03-03 20:58

[QUOTE=kriesel;538826]What? It damaged a gpu?[/QUOTE]

Yes. The GPU is no longer recognized at boot. Tomorrow, I'll try moving the card to a different machine. If that machine does not recognize the card, I will have to RMA it.

ewmayer 2020-03-03 21:23

[QUOTE=Prime95;538829]Yes. The GPU is no longer recognized at boot. Tomorrow, I'll try moving the card to a different machine. If that machine does not recognize the card, I will have to RMA it.[/QUOTE]

I'm wondering how that could even happen - you think some kind of GPU firmware corruption may have taken place? (E.g. onboard EPROM gets borked). If so, is a user reflash a possibility?

Or do you think actual hardware damage, excess voltage frying transistors or whatnot, may have occurred?

Prime95 2020-03-03 21:36

[QUOTE=ewmayer;538832]I'm wondering how that could even happen - you think some kind of GPU firmware corruption may have taken place? (E.g. onboard EPROM gets borked). If so, is a user reflash a possibility?

Or do you think actual hardware damage, excess voltage frying transistors or whatnot, may have occurred?[/QUOTE]

I agree -- quite mysterious. I'd guess it was either a coincidence (there were several reboots) or some kind of firmware corruption. Since the card isn't recognized by linux, I do not know how one would re-initialize the firmware.

ewmayer 2020-03-03 21:44

[QUOTE=Prime95;538834]I agree -- quite mysterious. I'd guess it was either a coincidence (there were several reboots) or some kind of firmware corruption. Since the card isn't recognized by linux, I do not know how one would re-initialize the firmware.[/QUOTE]

This might be a fruitful thing to look further into - I would not be surprised if such firmware-restoration were a manufacturer-only thing, but professional refurbisher/resellers of this kind of tech gear probably know more about how one might go about it. Do we have any folks like that on the forum? I recall PhilF [url=https://mersenneforum.org/showthread.php?t=24979&page=2]noted here[/url] that he had snagged a used Radeon VII for $400 on eBay and was not leery about buying used because "I'm very good at refurbishing used computer equipment" - just PMed him to weigh in here.

PhilF 2020-03-03 22:54

[QUOTE=ewmayer;538836]This might be a fruitful thing to look further into - I would not be surprised if such firmware-restoration were a manufacturer-only thing, but professional refurbisher/resellers of this kind of tech gear probably know more about how one might go about it. Do we have any folks like that on the forum? I recall PhilF [url=https://mersenneforum.org/showthread.php?t=24979&page=2]noted here[/url] that he had snagged a used Radeon VII for $400 on eBay and was not leery about buying used because "I'm very good at refurbishing used computer equipment" - just PMed him to weigh in here.[/QUOTE]

Here's my reply to Ernst:

Hello Ernst.

What I am better at is fixing physical damage or on-board power supply issues. However, I did find this:

[url]https://www.asus.com/Graphics-Cards/RADEONVII-16G/HelpDesk_Download/[/url]

This downloadable, stand-alone flash tool may be ASUS specific, I don't know. He may need to seek a similar file for his brand of card. But, the bottom line is that with the monitor hooked up to a working card in the same machine as the bricked card, a stand-alone BIOS flashing tool like this might be able to flash it.

However, this applies only if George is living right, and while flashing it he holds his mouth just right on a Tuesday. :)

Worst case, if it is out of warranty, bricked Radeon VII's go for surprisingly high dollars on eBay.
-Phil

PhilF 2020-03-04 00:23

Sorry, the download link didn't come through in the last message. Here it is, just scroll down a bit and click on Show All to find the BIOS flash tool.

[url]https://www.asus.com/Graphics-Cards/RADEONVII-16G/HelpDesk_Download/[/url]

kriesel 2020-03-04 01:15

Asrock:
[url]https://www.asrock.com/Graphics-Card/AMD/Phantom%20Gaming%20X%20Radeon%20VII%2016G/index.asp#BIOS[/url]
[url]https://www.asrock.com/support/BIOSVGA.asp?cat=VII16G[/url]

LaurV 2020-03-04 10:36

Re-flash may void your warranty.

Please check carefully.

Only re-flash if you know what you are doing and you know this will solve the issue. Otherwise, you may sent the card back and get the answer "it doesn't work because you damaged when re-flash it, therefore you are screwed and need to pay for a new one"...
YMMV

preda 2020-03-04 10:42

[QUOTE=mrh;538817]I've been wanting to do some P-1 runs where the B1/B2 (mostly B2) values are too big for u32. I was thinking of turning them all into u64's, unless this is bad idea or I'm missing something. What do you think?[/QUOTE]

For what exponent range? you should use the P-1 probability calculator
[url]https://www.mersenne.ca/prob.php[/url]
to get an idea of what to expect.

For an intuition, there is a range of the ratio between B1 and B2 that makes sense. I'd say this range is 10 to 100. If B2 is much-much larger that B1, you're wasting power.

Based on that, for a B2=4'000'000'000 (i.e. around 2^32), you'd want a B1 of at least 40'000'000 (but probably B1=100'000'000 would make more sense). Now you see that probably you could do a full PRP for less than that.

So unless you have a very special reason, doing P-1 with B2 > 2^32 seems like a bad idea to me. Maybe you could explain more about what you're trying to achieve.

kriesel 2020-03-04 13:27

untested windows build of gpuowl 6.11-163
 
2 Attachment(s)
[QUOTE=preda;538800]Thank you! I just merged the changes. Indeed this is a massive speedup on powers-of-2 FFTs probably bringing them in line with the other sizes. (and thanks for the FFT-size display fix)

I added a few asserts() to .cl (enabled with -use DEBUG) that allow to check the angle range.[/QUOTE]
Here it is, without any testing beyond -h.

Fan Ming 2020-03-04 15:23

Any plan to add LL back? It would be quicker than CUDALucas thus really helpful in LL double check.

kriesel 2020-03-04 18:05

[QUOTE=Fan Ming;538866]Any plan to add LL back? It would be quicker than CUDALucas thus really helpful in LL double check.[/QUOTE]Preda mused about doing so a while back. One can run gpuowl v0.5 or v0.6 for 4M fft length LL DC on AMD in the meantime. I think wringing more performance out of the ffts is a good use of development time. Those gains should be applicable to LL also if/when it is reimplemented in guOwL.
[URL]https://www.mersenneforum.org/showpost.php?p=489083&postcount=7[/URL]
Prime95/mprime are also quite good at ll DC and have the Jacobi check, which CUDALucas lacks.

kriesel 2020-03-04 23:08

[QUOTE=Prime95;538829]Yes. The GPU is no longer recognized at boot. Tomorrow, I'll try moving the card to a different machine. If that machine does not recognize the card, I will have to RMA it.[/QUOTE]What rating power supply was driving this card that's now bad?
I had a GTX480 fail, a Quadro 2000 fail, two PCIe extender pads become questionable or dead, and an Asrock Radeon VII is now causing systems to fail to start, and bringing up Windows Startup Repair instead, all seen first on the same mining frame powered by a Rosewill Tokamak 1200W (platinum). RMA initiated on the Asrock, that lasted less than 3 weeks.

Prime95 2020-03-05 00:20

The rig has a 1000W power supply. Today I took the card over to another machine that has a speaker. I put it in as the only GPU. Powered on and and got one long, three short beeps -- bad GPU.

My guess is that it died in one of the power off / power on cycles in the upgrading / system reinstall process.

PhilF 2020-03-05 00:33

[QUOTE=Prime95;538760]The dmesg error on the bricked GPU is "Direct firmware load for amdgpu/vega20_ta.bin failed with error -2".[/QUOTE]

I think this message is very telling. An interruption during any firmware update can be (and usually is) fatal.

But that isn't your fault. I can't imagine that a failed firmware update that was silently forced upon you during an AMD software upgrade would void any warranty.

Prime95 2020-03-05 00:39

[QUOTE=PhilF;538910]I think this message is very telling. An interruption during any firmware update can be (and usually is) fatal.

But that isn't your fault. I can't imagine that a failed firmware update that was silently forced upon you during an AMD software upgrade would void any warranty.[/QUOTE]

I posted an update somewhere that the firmware error relates to the two working GPUs

preda 2020-03-06 11:11

[QUOTE=Prime95;538908]The rig has a 1000W power supply. Today I took the card over to another machine that has a speaker. I put it in as the only GPU. Powered on and and got one long, three short beeps -- bad GPU.

My guess is that it died in one of the power off / power on cycles in the upgrading / system reinstall process.[/QUOTE]

I had one R7 die similarly. I didn't exactly understand why, but it did happen when I was rebooting repeteadly while moving GPUs around. Luckily, it was under warranty and was not one of the GPUs that I "enhanced" by removing the logo, changing termal pad etc, so I was able to RMA it.

Fan Ming 2020-03-06 13:17

[QUOTE=kriesel;538885]Preda mused about doing so a while back. One can run gpuowl v0.5 or v0.6 for 4M fft length LL DC on AMD in the meantime. I think wringing more performance out of the ffts is a good use of development time. Those gains should be applicable to LL also if/when it is reimplemented in guOwL.
[URL]https://www.mersenneforum.org/showpost.php?p=489083&postcount=7[/URL]
Prime95/mprime are also quite good at ll DC and have the Jacobi check, which CUDALucas lacks.[/QUOTE]

I know it's not hard to modify the code of gpuowl to do LL tests. I've already modified a LL version of a relative new version of gpuowl (not newest, without Jacobi check, however, if running on Google colab, the result is reliable and thus not a significant problem), and reproduced the LL residue of M1000003 successfully. I'm running a real DC ~50M now. However, the program works well only merged middle was [B]not[/B] used. Once the merged middle option was used, it gave wrong results. IDK what happened since I haven't learn the detail of merged middle, and I also want an official (new) gpuowl with LL test.

Fan Ming 2020-03-06 14:10

[QUOTE=Fan Ming;539006]I know it's not hard to modify the code of gpuowl to do LL tests. I've already modified a LL version of a relative new version of gpuowl (not newest, without Jacobi check, however, if running on Google colab, the result is reliable and thus not a significant problem), and reproduced the LL residue of M1000003 successfully. I'm running a real DC ~50M now. However, the program works well only merged middle was [B]not[/B] used. Once the merged middle option was used, it gave wrong results. IDK what happened since I haven't learn the detail of merged middle, and I also want an official (new) gpuowl with LL test.[/QUOTE]

Reason found. When Middle = 1 the program will not use middle(useMiddle == false and useMergedMiddle == false), however, the macro MERGED_MIDDLE still defined in the OpenCL program. Don't know if the newest commit fixed this problem.
Now the right result can be produced when using merged middle. Significant faster than CUDALucas.

paulunderwood 2020-03-06 17:11

kworker hell fix
 
After running gpuOwl for a week or so Linux "kworker" creeps up and uses nearly a full core which detracts from LLR crunching on the CPU. A simple fix is stop gpuOwl and immediately resume it. :cool:

kriesel 2020-03-06 19:02

[QUOTE=Fan Ming;539006]I know it's not hard to modify the code of gpuowl to do LL tests. I've already modified a LL version of a relative new version of gpuowl (not newest, without Jacobi check[/QUOTE]Mihai showed a way to add the Jacobi check back at V0.6 Addition of Jacobi check to LL flavor of gpuOwL [URL="http://www.mersenneforum.org/showpost.php?p=465145&postcount=46"]http://www.mersenneforum.org/showpos...5&postcount=46[/URL] 2017-08-08

ewmayer 2020-03-06 21:14

[QUOTE=Fan Ming;539006]I know it's not hard to modify the code of gpuowl to do LL tests. I've already modified a LL version of a relative new version of gpuowl (not newest, without Jacobi check, however, if running on Google colab, the result is reliable and thus not a significant problem), and reproduced the LL residue of M1000003 successfully.[/QUOTE]
Including residue shift, or not? Residue shift in context of LL is a bit trickier than in PRP because one needs to efficiently compute the proper bit position where to inject the -2 to each iteration's autosquare result. I've only been running gpuOwl for ~1 month, so perhaps one of the old hands can tell me whether previous versions of the code which supported LL also supported shift.

[QUOTE=paulunderwood;539031]After running gpuOwl for a week or so Linux "kworker" creeps up and uses nearly a full core which detracts from LLR crunching on the CPU. A simple fix is stop gpuOwl and immediately resume it. :cool:[/QUOTE]
Thanks for the reminder - noticed this a.m. that my Mlucas job on the Haswell system which hosts my Radeon VII was running ~5% slower than it should - ad cool weather move in yesterday, so thermal throttling was not the culprit. Just did a 'top', spotted a kworker task eating ~10% CPU time.

Fan Ming 2020-03-07 02:55

Double check succeeded:
[M]50840131[/M]

Fan Ming 2020-03-07 02:58

[QUOTE=ewmayer;539042]Including residue shift, or not? Residue shift in context of LL is a bit trickier than in PRP because one needs to efficiently compute the proper bit position where to inject the -2 to each iteration's autosquare result. I've only been running gpuOwl for ~1 month, so perhaps one of the old hands can tell me whether previous versions of the code which supported LL also supported shift.
[/QUOTE]

IIRC I've ever seen some version of gpuowl support residue shift before shift to gpuowl. Residue shift is indeed a little tricky and I haven't consider about that know. I would like to wait for the "official" add back of LL :).

Fan Ming 2020-03-07 03:00

[QUOTE=kriesel;539036]Mihai showed a way to add the Jacobi check back at V0.6 Addition of Jacobi check to LL flavor of gpuOwL [URL="http://www.mersenneforum.org/showpost.php?p=465145&postcount=46"]http://www.mersenneforum.org/showpos...5&postcount=46[/URL] 2017-08-08[/QUOTE]

Yes, it would be not hard to port the Jacobi check to current versions, I guess.

mrh 2020-03-08 21:41

[QUOTE=preda;538851]For what exponent range? you should use the P-1 probability calculator
[url]https://www.mersenne.ca/prob.php[/url]
to get an idea of what to expect.

For an intuition, there is a range of the ratio between B1 and B2 that makes sense. I'd say this range is 10 to 100. If B2 is much-much larger that B1, you're wasting power.

Based on that, for a B2=4'000'000'000 (i.e. around 2^32), you'd want a B1 of at least 40'000'000 (but probably B1=100'000'000 would make more sense). Now you see that probably you could do a full PRP for less than that.

So unless you have a very special reason, doing P-1 with B2 > 2^32 seems like a bad idea to me. Maybe you could explain more about what you're trying to achieve.[/QUOTE]

Good points. I was trying to factor a few already known composites that I'm interested in factoring. There is no real good reason to do so, other than for fun.

preda 2020-03-08 23:35

[QUOTE=mrh;539185]Good points. I was trying to factor a few already known composites that I'm interested in factoring. There is no real good reason to do so, other than for fun.[/QUOTE]

I heard that elliptic curves (ECM) are a more efficient factorization tool (compared to P-1) past a certain point; but I don't know much about that.

P-1 only finds factors where p-1 is higly composite. If you're unlucky and the factors you search for aren't p-1 higly composite, P-1 simply won't find them.

preda 2020-03-10 02:07

small updates
 
In recent commits, I
- dropped WORKINGOUT0,1,1A,2 because they were not "best" in any setup
- made WORKINGOUT5 the new default on AMD, because it's faster on ROCm 3.1 (one can specify -use WORKINGOUT3 to get the old behavior on AMD)
- factorized the code of fftMiddleOut() which should make future changes easier
- reported another ROCm 3.1 basic bug see [url]https://github.com/RadeonOpenCompute/ROCm/issues/1039[/url]

ewmayer 2020-03-10 02:15

[QUOTE=preda;539192]I heard that elliptic curves (ECM) are a more efficient factorization tool (compared to P-1) past a certain point; but I don't know much about that.

P-1 only finds factors where p-1 is higly composite. If you're unlucky and the factors you search for aren't p-1 higly composite, P-1 simply won't find them.[/QUOTE]

If the number in question has multiple smallish factors, p-1 is a good bet to find at least one of them, though not nec. the smallest. Good way to think about p-1 is that it is like running a single super-efficient ECM curve - thus bounds can be quite a bit deeper than for each ECM curve, but unlike ECM, p-1 depends on a hoped-for smoothness property of one of factors, i.e. can't be Monte-Carlo'd like ECM can, where each distinct curve yields a different underlying group order which turns up a factor if it happens to be smooth.

ewmayer 2020-03-10 21:39

By the way, what does it mean if I get a "exponent/[exponent].owl file invalid" message on program restart-after-interrupt? The run in question seemed to continue OK from where it left off, so the message has me puzzled.

preda 2020-03-10 22:17

[QUOTE=ewmayer;539333]By the way, what does it mean if I get a "exponent/[exponent].owl file invalid" message on program restart-after-interrupt? The run in question seemed to continue OK from where it left off, so the message has me puzzled.[/QUOTE]

What kind of interrupt? -- normal exit after Ctrl-C (or kill -INT) should not produce an invalid savefile.

Maybe the program loaded from the backup savefile (x-old.owl) and continued from there. Can you reproduce it -- do you get the "invalid" every time when you stop/restart?

ewmayer 2020-03-10 22:48

Interrupt was a machine crash - this is my ever-flaky Haswell system. Here the relevant logfile excerpt, with my [crash] annotation and the error message bolded... note said annotation replaces a string of unprintable-chars my editor warns me about on opening the file:
[quote]2020-03-10 11:23:48 gfx906+sram-ecc-0 102991709 OK 10200000 9.90%; 753 us/it; ETA 0d 19:25; 36e1b632ae34878c (check 0.43s)
2020-03-10 11:26:21 gfx906+sram-ecc-0 102991709 OK 10400000 10.10%; 753 us/it; ETA 0d 19:21; 6d3f77a36e3131f3 (check 0.42s)
[crash and reboot]
2020-03-10 11:32:26 Note: not found 'config.txt'
2020-03-10 11:32:26 device 0, unique id ''
2020-03-10 11:32:27 gfx906+sram-ecc-0 102991709 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.86 bits/word
2020-03-10 11:32:27 gfx906+sram-ecc-0 OpenCL args "-DEXP=102991709u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x1.1a6c87b447f22p+0 -DIWEIGHT_STEP=0x1.d018bc69e315ep-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DAMDGPU=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-03-10 11:32:30 gfx906+sram-ecc-0 warning: argument unused during compilation: '-I .'

2020-03-10 11:32:30 gfx906+sram-ecc-0 OpenCL compilation in 2.68 s
[b]2020-03-10 11:32:30 gfx906+sram-ecc-0 '/home/ewmayer/gpuowl/run0/102991709/102991709.owl' invalid[/b]
2020-03-10 11:32:31 gfx906+sram-ecc-0 102991709 OK 10400000 loaded: blockSize 400, 6d3f77a36e3131f3
2020-03-10 11:32:32 gfx906+sram-ecc-0 102991709 OK 10400800 10.10%; 739 us/it; ETA 0d 19:00; 05b55ba4212cf24f (check 0.41s)
2020-03-10 11:35:02 gfx906+sram-ecc-0 102991709 OK 10600000 10.29%; 754 us/it; ETA 0d 19:22; b9ab5fb12a1b7357 (check 0.42s)
2020-03-10 11:37:33 gfx906+sram-ecc-0 102991709 OK 10800000 10.49%; 754 us/it; ETA 0d 19:18; 14d7a3f361dfd1cb (check 0.43s)[/quote]
First time I'd seen the 'invalid' error message, but now that I grep for it in the gpuowl.log file, I see 12 occurrences including the latest one above. These are far from the only programs restarts in the log, but as it happens, each of the 'invalid's coincides with a crash that left one of the aforementioned unprintable-char sequences in the log. So data corruption is likely reaponsible for the symptoms - so the 'invalid' is telling me the program found a corrupted primary restart file and is resorting to using the secondary one as a result?

paulunderwood 2020-03-11 00:59

[QUOTE=ewmayer;539344]Interrupt was a machine crash - this is my ever-flaky Haswell system. [snip][/QUOTE]

My Haswell won't boot unless I give it more voltage: vcore 1v-->1.1v (?)

ewmayer 2020-03-11 02:43

[QUOTE=paulunderwood;539352]My Haswell won't boot unless I give it more voltage: vcore 1v-->1.1v (?)[/QUOTE]

Good thought - mine's never failed to boot, but does not-infrequently freeze immediately on reboot. I'll check this setting after the next BSOD. Especially in very warm weather like we're having in NoCal currently, it shouldn't be very long.

Prime95 2020-03-11 03:14

My first Haswell would crash when going from a high power state to a low power state (e.g. stopping prime95). I worked around the problem by disabling C states.

preda 2020-03-11 03:44

[QUOTE=ewmayer;539344]So data corruption is likely reaponsible for the symptoms - so the 'invalid' is telling me the program found a corrupted primary restart file and is resorting to using the secondary one as a result?[/QUOTE]

Yes. All gpuowl does on savefile is write the file and close it. From this point on, it's the OS's job to persist the file to disk. It turns out often the OS is lazy and prefers to keep the data in RAM for a while longer, and if a OS crash happens in this window, the savefile isn't properly persisted.

ewmayer 2020-03-11 19:44

[QUOTE=Prime95;539358]My first Haswell would crash when going from a high power state to a low power state (e.g. stopping prime95). I worked around the problem by disabling C states.[/QUOTE]

Yes, I get this a lot too with Mlucas on the Haswell - e.g. if I see that a given job has switched to a larger FFT length due to ROE >= 0.4375, but the error is an outlier in the context of the run, go to kill job in order to restart and force the default lower FFT length, crash.

Had another crash overnight, had a look at core voltage in the BIOS, 1.03-something - but I couldn't see a way to change it in the setup menu for [url=https://us.msi.com/Motherboard/support/Z87-G41-PC-Mate]my MSI MoBo[/url] - there is a [manual] setting option for it, but when I enabled that and scrolled to the actual setting field beeath it, the latter permitted no changing. Maybe somewhere in the Overclock submenu? Ah, now that I actually RTFM, under the OC submenu I see a bunch of stuff including

VCCIN voltage - Set the CPU input voltage
CPU core voltage/CPU ring voltage/CPU GT voltage
Overvoltage protection

George, what model MoBo do you have? I see nothing resembling "C states" in the MSI manual.

Prime95 2020-03-11 23:15

[QUOTE=ewmayer;539449]George, what model MoBo do you have? I see nothing resembling "C states" in the MSI manual.[/QUOTE]

I'm not home right now. Probably Gigabyte or MSI.

C states are discussed on page 3-23 of your motherboard's manual. It could be that different Intel chipsets expose different BIOS settings.

Also, look for a setting that boosts voltage when AVX instructions are detected. IIRC, that should be set to add 0.1V.

preda 2020-03-14 06:47

Radeon VII performance comparison
 
Hi, I propose a standard setup for performance numbers on RadeonVII which would make them easier to compare.

So the proposed "standard" RadeonVII gpuowl setup for perf measurements would be:
- sclk 4 (about 1520 Hz)
- memory at 1180

Also, the GPU should not be cold -- the measurements should be in "stable-state", e.g. after at least a few minutes of running. Also the GPU should not be "thermal throttling", the fan should be high enough to keep the GPU relatively cool. I think that a GPU with the hot-spot temperature (the highest temperature among the three reported) less than 90 (possibly even a few degrees higher) would not be throttling.

Please include with perf measurements the FFT size. Of course the most important FFT size is the "wavefront" which ATM is 5M, but the wavefront FFT does change overtime so better to be specific.

For example,
On ROCm 3.1 I get 664 us/it at 5M FFT.

This is with sclk 4, RAM 1180, temperature 90, power about 185W; the fan at about 2500 RPM. Linux kernel 5.4.24.

If 1180 RAM is too fast for some GPUs, we could settle on a lower value that's acceptable for almost everybody (maybe 1150?).

kriesel 2020-03-14 15:28

[QUOTE=preda;539687]Hi, I propose a standard setup for performance numbers on RadeonVII which would make them easier to compare.

So the proposed "standard" RadeonVII gpuowl setup for perf measurements would be:
- sclk 4 (about 1520 [B]M[/B]Hz)
- memory at 1180

Also, the GPU should not be cold -- the measurements should be in "stable-state", e.g. after at least a few minutes of running. Also the GPU should not be "thermal throttling", the fan should be high enough to keep the GPU relatively cool. I think that a GPU with the hot-spot temperature (the highest temperature among the three reported) less than 90 (possibly even a few degrees higher) would not be throttling[/QUOTE]
sclk presumes linux. As far as I know there is not a Windows equivalent.
some gpus can not run 1180Mhz memory clock reliably, or 1520 gpu clock.
Whatever the performance run parameters are, all relevant parameters should be stated along with the timing, so that the timing is not meaningless.

PhilF 2020-03-14 16:08

[QUOTE=preda;539687]Hi, I propose a standard setup for performance numbers on RadeonVII which would make them easier to compare.

So the proposed "standard" RadeonVII gpuowl setup for perf measurements would be:
- sclk 4 (about 1520 Hz)
- memory at 1180

Also, the GPU should not be cold -- the measurements should be in "stable-state", e.g. after at least a few minutes of running. Also the GPU should not be "thermal throttling", the fan should be high enough to keep the GPU relatively cool. I think that a GPU with the hot-spot temperature (the highest temperature among the three reported) less than 90 (possibly even a few degrees higher) would not be throttling.

Please include with perf measurements the FFT size. Of course the most important FFT size is the "wavefront" which ATM is 5M, but the wavefront FFT does change overtime so better to be specific.

For example,
On ROCm 3.1 I get 664 us/it at 5M FFT.

This is with sclk 4, RAM 1180, temperature 90, power about 185W; the fan at about 2500 RPM. Linux kernel 5.4.24.

If 1180 RAM is too fast for some GPUs, we could settle on a lower value that's acceptable for almost everybody (maybe 1150?).[/QUOTE]

My Radeon VII won't run except at stock memory clock speed. Even 1050 Mhz produces errors.

ewmayer 2020-03-14 19:45

[QUOTE=kriesel;539701]sclk presumes linux. As far as I know there is not a Windows equivalent.
some gpus can not run 1180Mhz memory clock reliably, or 1520 gpu clock.
Whatever the performance run parameters are, all relevant parameters should be stated along with the timing, so that the timing is not meaningless.[/QUOTE]

I've still only ever gotten the manual sclk-setting working under Ubunto 19.10 ... Preda may recall [url=https://mersenneforum.org/showthread.php?t=24979&page=9]my flailings-about in trying to fiddle the mem-clocking[/url]. In post #64 of that thread I tabulated per-iter times @5632K FFT on my system, mem-clock at the default 1001MHz:
[code]--setsclk 5: 757 us/iter, temp = 70C, watts = 400 [~120 of those are baseline, including an ongoing 4-thread Mlucas job on the CPU]
--setsclk 4: 792 us/iter, temp = 65C, watts = 350
--setsclk 3: 848 us/iter, temp = 63C, watts = 300[/code]
The temps are not meaningful except in a comparative sense - it seems e.g. Win and Linux interfaces take temps from different sensors on the GPU, I've seen Win users talking about temps being routinely in the 90-100C range, whereas in my setup the card starts throttlig any time whichever sensor is measuring the above rocm-smi-displayed temp gets close to 80C.

kriesel 2020-03-14 21:03

[QUOTE=ewmayer;539719] it seems e.g. Win and Linux interfaces take temps from different sensors on the GPU, I've seen Win users talking about temps being routinely in the 90-100C range, whereas in my setup the card starts throttlig any time whichever sensor is measuring the above rocm-smi-displayed temp gets close to 80C.[/QUOTE]
There are lots of sensors on a RadeonVII. Apparently numerous sensors built into the chip. Similarly for the RX5700. [URL]https://www.tomshardware.com/news/amd-rx-5700-graphics-card-thermal-management,40144.html[/URL]
It throttles by design at 110C on the hottest of the many sensors.

GPU-Z temperature displays for a Radeon VII (this one is cut back considerably on clock rates for reliability, so running cooler than most)
GPU 62C
GPU hot spot 72
memory 64
gpu VRM 63
SOC VRM 58
Mem1 64
Mem2 66
fan speed 33%
CPU temp 63C

ewmayer 2020-03-14 21:45

By way of p.s. to my above post - I finally got around to firing up a 2nd gpuOwl run on my Radeon VII ... when I first installed it Preda told me "based on current data, 1-job running is about the same as 2-job in terms of total throughput per watt", but more recently George suggested that in fact 2-job remains better in those terms. I was unable to use Matt's radeon_setup.sh script to do the manual clock/mem tunings he implements there - even as root get "Permission denied" whenever I try to touch any of the entries under /sys/class/drm/card0/device. So just stuck with my current manual sclk = 5 downclock setting, created a 2nd run-subdir under the gpuowl one, fired up 2nd job. Both running @5632K FFT, here the before and after timings:

Before:
Job 1: 753 us/iter

After:
Job 1: 1407 us/iter
Job 2: 1407 us/iter

Wattage barely budged - up less than 5%, so total throughput up 7%, slightly less than that on a per-watt basis. (But definitely better in per-watt terms than I get from cranking the sclk setting up to 6 for a single run).

preda 2020-03-15 10:39

A heads up for people working with GpuOwl's source code, about some build changes in recent commits.

In gpuowl.cl there is a lot of duplication between similar kernels with a few small changes between them; an example being carryFused and carryFusedMul, which do almost the same thing with the difference that carryFusedMul also does a multiplication-by-3. Unfortunately there is no good mechanism in OpenCL proper to share the common code between the two kernels without a potential performance hit.

Rather that having the code duplicated between the two kernels, I chose to add a simple form of a sort of macro expansion, which is implemented by the python script tools/expand.py . This script interprents a few special-form comments in the gpuowl.cl:
'//{{' : start a named block
'//}}' : end a named block
'//==' : instantiate a block

This scheme allows to define the body of a kernel once, and then instantiate it multiple times in different contexts.

From a build perspective, the path is now:
gpuowl.cl -> gpuowl-expanded.cl -> gpuowl-wrap.cpp

The file gpuowl-expanded.cl is generated; there is no point in editing it, the source is still gpuowl.cl.

gpuowl-expanded.cl, being generated, does not need to be under source control, but for now I added it as a convenience for people who don't have python installed or have difficulty executing tools/expand.py for some reason. (in the future if everybody is fine
building with expand.py I can remove gpuowl-expanded.cl from source control)

LaurV 2020-03-16 02:45

[QUOTE=kriesel;539701]Whatever the performance run parameters are, all relevant parameters should be stated along with the timing, so that the timing is not meaningless.[/QUOTE]
+1

preda 2020-03-16 15:06

ROCm 3.1, sclk 3, mem 1180, FFT 5M: 708us/it. (150W)


All times are UTC. The time now is 21:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.