mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

PhilF 2020-05-07 21:05

[QUOTE=xx005fs;544813]It seems that for factored PM1 results out of GPUOWL, primenet won't be able to understand it.

[CODE]{"status":"F", "exponent":"98141611", "worktype":"PM1", "B1":"750000", "B2":"15000000", "fft-length":"5767168", "factors":"["****"]", "program":{"name":"gpuowl", "version":"v6.11-258-gb92cdfd"}, "computer":"TITAN V-0", "aid":"******", "timestamp":"2020-05-06 07:29:29 UTC"}[/CODE][/QUOTE]

Hmmm. I have reported PM-1 factors from gpuOwL before with no problem, by copying/pasting the result into the manual submission form.

ewmayer 2020-05-07 21:41

I figured us = microseconds was clear from the context.

@PaulU: Just noticed per-iter timings of the 2 jobs on my R7 also went askew early this a.m. ... as of midnight both were PRPing expos ~103.9M @5.5M FFT, each run ~1470 us/iter. Around 1am one job's PRP finished and that task started a PRP of an expo ~104.9M, still at 5.5M FFT, but the per-iter time of that job dropped to 1265 us/iter right from the beginning, at the same time the per-iter times of the other ongoing job with p ~ 103.9M jumped to 1664 us/iter. I killed and restarted both jobs first thing this morning by way of daily kworker-task CPU-cycle parasitism control, the timing disparity continued after both were restarted.

Looking closely at the two OpenCL args lists for the 2 jobs, FFT params same, main diffs are the expected ones in the various DTW-weights-associated consts of the 2 expos ... the only salient-appearing diff I see is that the p ~ 104.9M job sports an extra "-DMM2_CHAIN=1u" arg which the other one lacks. Whatever that means code-branch and memory-map-wise, it caused the ROCm priority management engine to apparently give a higher priority to that job. Total throughput for 2 jobs running ~1470 us/iter each was ~1360 iter/sec, with the timing disparity it is ~1390 us/iter, so I've actually gained a few % total throughput.

preda 2020-05-07 22:22

[QUOTE=xx005fs;544813]It seems that for factored PM1 results out of GPUOWL, primenet won't be able to understand it.

[CODE]{"status":"F", "exponent":"98141611", "worktype":"PM1", "B1":"750000", "B2":"15000000", "fft-length":"5767168", "factors":"["****"]", "program":{"name":"gpuowl", "version":"v6.11-258-gb92cdfd"}, "computer":"TITAN V-0", "aid":"******", "timestamp":"2020-05-06 07:29:29 UTC"}[/CODE][/QUOTE]

There was a bug, an extra set of quotes around the factors array: "["****"]". The bug has been fixed (you can upgrade), and this result can probably be submitted by manually dropping the extra quotes, to: "factors":["****"]

LaurV 2020-05-08 06:24

[QUOTE=S485122;544825]"us" ? Usually it is capitalised as "US", but it is not a unit (AFAIK.) Or do you (and preceding posters) mean µs ?
Jacob[/QUOTE]
I assume that was a nitpicking/joke. If it was not, then you should learn that "u" is the right/standard/accepted abbreviation for "micro" in all domains I ever touched, and where typing µ or µ or \(\mu\) would be tedious. Including computer science and software manufacturing (see the famous uVision from Keil, or uTorrent, etc). In my daily work I measure the electric potential in uV (microvolts), current in uA (microamperes), and thickness of bonding wires in um (micrometers, or microns).

kriesel 2020-05-08 16:12

gpuowl-win v6.11-278-ga39cc1a build
 
2 Attachment(s)
The usual shower of compile warnings, tested only as far as included help output, etc.

kriesel 2020-05-08 16:46

Third RX550 (a 2GB model) showed a transient EE issue on a different system This is on an open frame setup with no temperature issues known.[CODE]2020-05-07 23:10:36 gpuowl v6.11-272-g07718b9
2020-05-07 23:10:36 config: -user kriesel -cpu asr2/rx550 -d 1 -use NO_ASM
2020-05-07 23:10:36 device 1, unique id ''
2020-05-07 23:10:36 asr2/rx550 worktodo.txt line ignored: ""
2020-05-07 23:10:36 asr2/rx550 107000389 FFT: 6M 1K:12:256 (17.01 bpw)
2020-05-07 23:10:36 asr2/rx550 Expected maximum carry32: 25260000
2020-05-07 23:10:37 asr2/rx550 OpenCL args "-DEXP=107000389u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DWEIGHT_STEP=0xf.eb7509fc7be48p-3 -DIWEIGHT_STEP=0x
8.0a52bc152d0dp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-std=CL2.
0 "
2020-05-07 23:10:39 asr2/rx550 OpenCL compilation in 2.29 s
2020-05-07 23:10:48 asr2/rx550 107000389 OK 0 loaded: blockSize 400, 0000000000000003
2020-05-07 23:11:09 asr2/rx550 107000389 OK 800 0.00%; 17645 us/it; ETA 21d 20:27; 4f39fc137c27de54 (check 7.30s)
2020-05-08 00:13:46 asr2/rx550 107000389 EE 200000 0.19%; 18837 us/it; ETA 23d 06:49; 65fe4f6dd6c92d4e (check 8.97s)
2020-05-08 00:13:55 asr2/rx550 107000389 EE 800 loaded: blockSize 400, 79b18fd6bfda22f9 (expected 4f39fc137c27de54)
2020-05-08 00:13:55 asr2/rx550 Exiting because "error on load"
2020-05-08 00:13:55 asr2/rx550 Bye

C:\Users\ken\Documents\gpuowl-v6.11-272>gpuowl-win
2020-05-08 01:03:03 gpuowl v6.11-272-g07718b9
2020-05-08 01:03:03 config: -user kriesel -cpu asr2/rx550 -d 1 -use NO_ASM
2020-05-08 01:03:03 device 1, unique id ''
2020-05-08 01:03:03 asr2/rx550 worktodo.txt line ignored: ""
2020-05-08 01:03:03 asr2/rx550 107000389 FFT: 6M 1K:12:256 (17.01 bpw)
2020-05-08 01:03:03 asr2/rx550 Expected maximum carry32: 25260000
2020-05-08 01:03:04 asr2/rx550 OpenCL args "-DEXP=107000389u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DWEIGHT_STEP=0xf.eb7509fc7be48p-3 -DIWEIGHT_STEP=0x
8.0a52bc152d0dp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-std=CL2.
0 "
2020-05-08 01:03:11 asr2/rx550 OpenCL compilation in 7.20 s
2020-05-08 01:03:20 asr2/rx550 107000389 OK 800 loaded: blockSize 400, 4f39fc137c27de54
2020-05-08 01:03:41 asr2/rx550 107000389 OK 1600 0.00%; 17649 us/it; ETA 21d 20:34; 00cff77f1e4010a4 (check 7.32s)
2020-05-08 02:02:08 asr2/rx550 107000389 OK 200000 0.19%; 17639 us/it; ETA 21d 19:18; 65fe4f6dd6c92d4e (check 7.32s)
2020-05-08 03:01:04 asr2/rx550 107000389 OK 400000 0.37%; 17645 us/it; ETA 21d 18:29; bbdb6f6d3790a362 (check 7.32s)
2020-05-08 04:00:00 asr2/rx550 107000389 OK 600000 0.56%; 17639 us/it; ETA 21d 17:20; 902382f6237a1979 (check 7.32s)
2020-05-08 04:58:56 asr2/rx550 107000389 OK 800000 0.75%; 17645 us/it; ETA 21d 16:31; 8087f982145cff93 (check 7.32s)
2020-05-08 05:57:51 asr2/rx550 107000389 OK 1000000 0.93%; 17639 us/it; ETA 21d 15:22; 6d75bf2bfb36a594 (check 7.32s)
2020-05-08 06:56:47 asr2/rx550 107000389 OK 1200000 1.12%; 17645 us/it; ETA 21d 14:34; 48c73046fff69459 (check 7.32s)
2020-05-08 07:55:43 asr2/rx550 107000389 OK 1400000 1.31%; 17639 us/it; ETA 21d 13:25; e84d6918ae180382 (check 7.32s)
2020-05-08 08:54:39 asr2/rx550 107000389 OK 1600000 1.50%; 17645 us/it; ETA 21d 12:37; 0e04b83a1aa1b2f2 (check 7.32s)
2020-05-08 09:53:34 asr2/rx550 107000389 OK 1800000 1.68%; 17639 us/it; ETA 21d 11:28; 8d7a21d8a97a586f (check 7.32s)
2020-05-08 10:52:31 asr2/rx550 107000389 OK 2000000 1.87%; 17645 us/it; ETA 21d 10:39; c213cfc1386c1fca (check 7.32s)
-[/CODE]

ewmayer 2020-05-09 02:50

1 Attachment(s)
[QUOTE=ewmayer;544843]@PaulU: Just noticed per-iter timings of the 2 jobs on my R7 also went askew early this a.m. ... as of midnight both were PRPing expos ~103.9M @5.5M FFT, each run ~1470 us/iter. Around 1am one job's PRP finished and that task started a PRP of an expo ~104.9M, still at 5.5M FFT, but the per-iter time of that job dropped to 1265 us/iter right from the beginning, at the same time the per-iter times of the other ongoing job with p ~ 103.9M jumped to 1664 us/iter. I killed and restarted both jobs first thing this morning by way of daily kworker-task CPU-cycle parasitism control, the timing disparity continued after both were restarted.

Looking closely at the two OpenCL args lists for the 2 jobs, FFT params same, main diffs are the expected ones in the various DTW-weights-associated consts of the 2 expos ... the only salient-appearing diff I see is that the p ~ 104.9M job sports an extra "-DMM2_CHAIN=1u" arg which the other one lacks. Whatever that means code-branch and memory-map-wise, it caused the ROCm priority management engine to apparently give a higher priority to that job. Total throughput for 2 jobs running ~1470 us/iter each was ~1360 iter/sec, with the timing disparity it is ~1390 us/iter, so I've actually gained a few % total throughput.[/QUOTE]

Let's call the aforementioned 2 instances run0 and run1, using the subdir-names in which I run them. Earlier today run0 finsihed its p ~104.9M job and started one with p ~103.9M, at which point the 2 run timings again equalized at 1470 us/iter. Just now run1 finished a p ~103.9M job and started one with p ~104.9M, at which point I expected the timing-skew to resume, this time in favor of run1 ... but no, timings remain unchanged, identical. But I see this latest p ~ 104.9M job lacks the extra "-DMM2_CHAIN=1u" OpenCL arg of the earlier one ... likely because it has p just below 104.9M, the earlier job had p slightly above 104.9M.

Preda, I'm guessing -DMM2_CHAIN is an accuracy-related flag, which kicks in at the higher p-ranges of each FFT length? If so, what is the precise breakover point at 5.5M FFT?

Prime95 2020-05-09 03:48

[QUOTE=ewmayer;544932]
Preda, I'm guessing -DMM2_CHAIN is an accuracy-related flag, which kicks in at the higher p-ranges of each FFT length? If so, what is the precise breakover point at 5.5M FFT?[/QUOTE]

As you get closer and closer to the FFT limit, there are several improved-accuracy-but-slower versions. The flags are MM2_CHAIN=1,2,3 and MM_CHAIN=1,2,3. At a later date (I hope) there will also be an ULTRA_TRIG=1.

From FFTConfig.h: 5.5M FFT supports 18.489 bits-per-FFT-word which gets the slowest code.

From FFTconfig.cpp: {0.06964, 0.14050, 0.03840, 0.02710, 0.01719, 0.00497},
which says 0.00497 bpw from the max we ease up a little bit, at 0.00497+0.01719 bpw from the max we ease up a little more, and so forth.

ewmayer 2020-05-09 19:57

[QUOTE=Prime95;544933]As you get closer and closer to the FFT limit, there are several improved-accuracy-but-slower versions. The flags are MM2_CHAIN=1,2,3 and MM_CHAIN=1,2,3. At a later date (I hope) there will also be an ULTRA_TRIG=1.

From FFTConfig.h: 5.5M FFT supports 18.489 bits-per-FFT-word which gets the slowest code.

From FFTconfig.cpp: {0.06964, 0.14050, 0.03840, 0.02710, 0.01719, 0.00497},
which says 0.00497 bpw from the max we ease up a little bit, at 0.00497+0.01719 bpw from the max we ease up a little more, and so forth.[/QUOTE]

Thanks, but ITYM e.g. "within 0.00497 bpw of max we ease up a lot, within (0.00497+0.01719) we ease up a little less", etc. Because the math only works for me when I add all 6 ease-up fractions to get 0.29780, and (letting n = 5.5*2^20) observe that (18.489 - 0.29780)*n = 104911706.5..., which lies between the 2 exponents (104892731 and 104972429) just-on-either-side of the first (MM2_CHAIN=1) ease-up threshold.

And as I noted, on my system having one run at MM2_CHAIN=1 and the other with no ease-up counterintuitively gave me 2% more total throughput than with both runs using expos below the threshold, so I'd like to try forcing both of my current runs (which are below-threshold) to use MM2_CHAIN=1 to see what the resulting total throughput is. May I presume that forcing MM2_CHAIN=1 for an expo that does not need it is safe to do?

Prime95 2020-05-09 21:12

[QUOTE=ewmayer;544979]
And as I noted, on my system having one run at MM2_CHAIN=1 and the other with no ease-up counterintuitively gave me 2% more total throughput than with both runs using expos below the threshold, so I'd like to try forcing both of my current runs (which are below-threshold) to use MM2_CHAIN=1 to see what the resulting total throughput is. May I presume that forcing MM2_CHAIN=1 for an expo that does not need it is safe to do?[/QUOTE]

Yes, adding -use MM2_CHAIN=1 is perfectly safe

ewmayer 2020-05-09 21:55

[QUOTE=Prime95;544986]Yes, adding -use MM2_CHAIN=1 is perfectly safe[/QUOTE]
Cool - did this for my run of 104892731, expected timing-skew between the 2 runs resumed, total throughput again went from ~1360 iter/sec to ~1390 iter/sec

Then also switched the other run (p = 103923257) to using the flag, timings again equalize, but at 1410 us/iter, meaning total throughput ~1420 iter/sec, a gain of 4.5% [!] over both runs using default settings. That's nearly as much gain as I get from upping my sclk setting from 4 to 5, but the latter ups the wattage by a massive 60W, temps increase proportionally. Wattage currently is a mere 5-10W higher than before the switch to both runs using MM2_CHAIN=1.

Say I start making it the the default ... if a run hits an expo which needs an even-higher extra-accuracy setting, will that automatically kick in, thus overriding the user's setting of the flag?


All times are UTC. The time now is 23:05.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.