mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-05-07, 21:05   #2157
PhilF
 
PhilF's Avatar
 
Feb 2005
Colorado

5×131 Posts
Default

Quote:
Originally Posted by xx005fs View Post
It seems that for factored PM1 results out of GPUOWL, primenet won't be able to understand it.

Code:
{"status":"F", "exponent":"98141611", "worktype":"PM1", "B1":"750000", "B2":"15000000", "fft-length":"5767168", "factors":"["****"]", "program":{"name":"gpuowl", "version":"v6.11-258-gb92cdfd"}, "computer":"TITAN V-0", "aid":"******", "timestamp":"2020-05-06 07:29:29 UTC"}
Hmmm. I have reported PM-1 factors from gpuOwL before with no problem, by copying/pasting the result into the manual submission form.
PhilF is offline   Reply With Quote
Old 2020-05-07, 21:41   #2158
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D7E16 Posts
Default

I figured us = microseconds was clear from the context.

@PaulU: Just noticed per-iter timings of the 2 jobs on my R7 also went askew early this a.m. ... as of midnight both were PRPing expos ~103.9M @5.5M FFT, each run ~1470 us/iter. Around 1am one job's PRP finished and that task started a PRP of an expo ~104.9M, still at 5.5M FFT, but the per-iter time of that job dropped to 1265 us/iter right from the beginning, at the same time the per-iter times of the other ongoing job with p ~ 103.9M jumped to 1664 us/iter. I killed and restarted both jobs first thing this morning by way of daily kworker-task CPU-cycle parasitism control, the timing disparity continued after both were restarted.

Looking closely at the two OpenCL args lists for the 2 jobs, FFT params same, main diffs are the expected ones in the various DTW-weights-associated consts of the 2 expos ... the only salient-appearing diff I see is that the p ~ 104.9M job sports an extra "-DMM2_CHAIN=1u" arg which the other one lacks. Whatever that means code-branch and memory-map-wise, it caused the ROCm priority management engine to apparently give a higher priority to that job. Total throughput for 2 jobs running ~1470 us/iter each was ~1360 iter/sec, with the timing disparity it is ~1390 us/iter, so I've actually gained a few % total throughput.

Last fiddled with by ewmayer on 2020-05-07 at 22:34
ewmayer is offline   Reply With Quote
Old 2020-05-07, 22:22   #2159
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

Quote:
Originally Posted by xx005fs View Post
It seems that for factored PM1 results out of GPUOWL, primenet won't be able to understand it.

Code:
{"status":"F", "exponent":"98141611", "worktype":"PM1", "B1":"750000", "B2":"15000000", "fft-length":"5767168", "factors":"["****"]", "program":{"name":"gpuowl", "version":"v6.11-258-gb92cdfd"}, "computer":"TITAN V-0", "aid":"******", "timestamp":"2020-05-06 07:29:29 UTC"}
There was a bug, an extra set of quotes around the factors array: "["****"]". The bug has been fixed (you can upgrade), and this result can probably be submitted by manually dropping the extra quotes, to: "factors":["****"]
preda is offline   Reply With Quote
Old 2020-05-08, 06:24   #2160
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

226658 Posts
Default

Quote:
Originally Posted by S485122 View Post
"us" ? Usually it is capitalised as "US", but it is not a unit (AFAIK.) Or do you (and preceding posters) mean µs ?
Jacob
I assume that was a nitpicking/joke. If it was not, then you should learn that "u" is the right/standard/accepted abbreviation for "micro" in all domains I ever touched, and where typing µ or µ or \(\mu\) would be tedious. Including computer science and software manufacturing (see the famous uVision from Keil, or uTorrent, etc). In my daily work I measure the electric potential in uV (microvolts), current in uA (microamperes), and thickness of bonding wires in um (micrometers, or microns).

Last fiddled with by LaurV on 2020-05-08 at 06:29
LaurV is offline   Reply With Quote
Old 2020-05-08, 16:12   #2161
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default gpuowl-win v6.11-278-ga39cc1a build

The usual shower of compile warnings, tested only as far as included help output, etc.
Attached Files
File Type: txt build-log.txt (8.0 KB, 90 views)
File Type: 7z gpuowl-v6.11-272-g07718b9.7z (472.2 KB, 102 views)
kriesel is online now   Reply With Quote
Old 2020-05-08, 16:46   #2162
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

152B16 Posts
Default

Third RX550 (a 2GB model) showed a transient EE issue on a different system This is on an open frame setup with no temperature issues known.
Code:
2020-05-07 23:10:36 gpuowl v6.11-272-g07718b9
2020-05-07 23:10:36 config: -user kriesel -cpu asr2/rx550 -d 1 -use NO_ASM
2020-05-07 23:10:36 device 1, unique id ''
2020-05-07 23:10:36 asr2/rx550 worktodo.txt line ignored: ""
2020-05-07 23:10:36 asr2/rx550 107000389 FFT: 6M 1K:12:256 (17.01 bpw)
2020-05-07 23:10:36 asr2/rx550 Expected maximum carry32: 25260000
2020-05-07 23:10:37 asr2/rx550 OpenCL args "-DEXP=107000389u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DWEIGHT_STEP=0xf.eb7509fc7be48p-3 -DIWEIGHT_STEP=0x
8.0a52bc152d0dp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1  -cl-fast-relaxed-math -cl-std=CL2.
0 "
2020-05-07 23:10:39 asr2/rx550 OpenCL compilation in 2.29 s
2020-05-07 23:10:48 asr2/rx550 107000389 OK        0 loaded: blockSize 400, 0000000000000003
2020-05-07 23:11:09 asr2/rx550 107000389 OK      800   0.00%; 17645 us/it; ETA 21d 20:27; 4f39fc137c27de54 (check 7.30s)
2020-05-08 00:13:46 asr2/rx550 107000389 EE   200000   0.19%; 18837 us/it; ETA 23d 06:49; 65fe4f6dd6c92d4e (check 8.97s)
2020-05-08 00:13:55 asr2/rx550 107000389 EE      800 loaded: blockSize 400, 79b18fd6bfda22f9 (expected 4f39fc137c27de54)
2020-05-08 00:13:55 asr2/rx550 Exiting because "error on load"
2020-05-08 00:13:55 asr2/rx550 Bye

C:\Users\ken\Documents\gpuowl-v6.11-272>gpuowl-win
2020-05-08 01:03:03 gpuowl v6.11-272-g07718b9
2020-05-08 01:03:03 config: -user kriesel -cpu asr2/rx550 -d 1 -use NO_ASM
2020-05-08 01:03:03 device 1, unique id ''
2020-05-08 01:03:03 asr2/rx550 worktodo.txt line ignored: ""
2020-05-08 01:03:03 asr2/rx550 107000389 FFT: 6M 1K:12:256 (17.01 bpw)
2020-05-08 01:03:03 asr2/rx550 Expected maximum carry32: 25260000
2020-05-08 01:03:04 asr2/rx550 OpenCL args "-DEXP=107000389u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DWEIGHT_STEP=0xf.eb7509fc7be48p-3 -DIWEIGHT_STEP=0x
8.0a52bc152d0dp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1  -cl-fast-relaxed-math -cl-std=CL2.
0 "
2020-05-08 01:03:11 asr2/rx550 OpenCL compilation in 7.20 s
2020-05-08 01:03:20 asr2/rx550 107000389 OK      800 loaded: blockSize 400, 4f39fc137c27de54
2020-05-08 01:03:41 asr2/rx550 107000389 OK     1600   0.00%; 17649 us/it; ETA 21d 20:34; 00cff77f1e4010a4 (check 7.32s)
2020-05-08 02:02:08 asr2/rx550 107000389 OK   200000   0.19%; 17639 us/it; ETA 21d 19:18; 65fe4f6dd6c92d4e (check 7.32s)
2020-05-08 03:01:04 asr2/rx550 107000389 OK   400000   0.37%; 17645 us/it; ETA 21d 18:29; bbdb6f6d3790a362 (check 7.32s)
2020-05-08 04:00:00 asr2/rx550 107000389 OK   600000   0.56%; 17639 us/it; ETA 21d 17:20; 902382f6237a1979 (check 7.32s)
2020-05-08 04:58:56 asr2/rx550 107000389 OK   800000   0.75%; 17645 us/it; ETA 21d 16:31; 8087f982145cff93 (check 7.32s)
2020-05-08 05:57:51 asr2/rx550 107000389 OK  1000000   0.93%; 17639 us/it; ETA 21d 15:22; 6d75bf2bfb36a594 (check 7.32s)
2020-05-08 06:56:47 asr2/rx550 107000389 OK  1200000   1.12%; 17645 us/it; ETA 21d 14:34; 48c73046fff69459 (check 7.32s)
2020-05-08 07:55:43 asr2/rx550 107000389 OK  1400000   1.31%; 17639 us/it; ETA 21d 13:25; e84d6918ae180382 (check 7.32s)
2020-05-08 08:54:39 asr2/rx550 107000389 OK  1600000   1.50%; 17645 us/it; ETA 21d 12:37; 0e04b83a1aa1b2f2 (check 7.32s)
2020-05-08 09:53:34 asr2/rx550 107000389 OK  1800000   1.68%; 17639 us/it; ETA 21d 11:28; 8d7a21d8a97a586f (check 7.32s)
2020-05-08 10:52:31 asr2/rx550 107000389 OK  2000000   1.87%; 17645 us/it; ETA 21d 10:39; c213cfc1386c1fca (check 7.32s)
-
kriesel is online now   Reply With Quote
Old 2020-05-09, 02:50   #2163
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·32·647 Posts
Default

Quote:
Originally Posted by ewmayer View Post
@PaulU: Just noticed per-iter timings of the 2 jobs on my R7 also went askew early this a.m. ... as of midnight both were PRPing expos ~103.9M @5.5M FFT, each run ~1470 us/iter. Around 1am one job's PRP finished and that task started a PRP of an expo ~104.9M, still at 5.5M FFT, but the per-iter time of that job dropped to 1265 us/iter right from the beginning, at the same time the per-iter times of the other ongoing job with p ~ 103.9M jumped to 1664 us/iter. I killed and restarted both jobs first thing this morning by way of daily kworker-task CPU-cycle parasitism control, the timing disparity continued after both were restarted.

Looking closely at the two OpenCL args lists for the 2 jobs, FFT params same, main diffs are the expected ones in the various DTW-weights-associated consts of the 2 expos ... the only salient-appearing diff I see is that the p ~ 104.9M job sports an extra "-DMM2_CHAIN=1u" arg which the other one lacks. Whatever that means code-branch and memory-map-wise, it caused the ROCm priority management engine to apparently give a higher priority to that job. Total throughput for 2 jobs running ~1470 us/iter each was ~1360 iter/sec, with the timing disparity it is ~1390 us/iter, so I've actually gained a few % total throughput.
Let's call the aforementioned 2 instances run0 and run1, using the subdir-names in which I run them. Earlier today run0 finsihed its p ~104.9M job and started one with p ~103.9M, at which point the 2 run timings again equalized at 1470 us/iter. Just now run1 finished a p ~103.9M job and started one with p ~104.9M, at which point I expected the timing-skew to resume, this time in favor of run1 ... but no, timings remain unchanged, identical. But I see this latest p ~ 104.9M job lacks the extra "-DMM2_CHAIN=1u" OpenCL arg of the earlier one ... likely because it has p just below 104.9M, the earlier job had p slightly above 104.9M.

Preda, I'm guessing -DMM2_CHAIN is an accuracy-related flag, which kicks in at the higher p-ranges of each FFT length? If so, what is the precise breakover point at 5.5M FFT?
Attached Files
File Type: zip E7A59v1.1.zip (2.88 MB, 76 views)

Last fiddled with by ewmayer on 2020-05-09 at 20:31
ewmayer is offline   Reply With Quote
Old 2020-05-09, 03:48   #2164
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

5×11×137 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Preda, I'm guessing -DMM2_CHAIN is an accuracy-related flag, which kicks in at the higher p-ranges of each FFT length? If so, what is the precise breakover point at 5.5M FFT?
As you get closer and closer to the FFT limit, there are several improved-accuracy-but-slower versions. The flags are MM2_CHAIN=1,2,3 and MM_CHAIN=1,2,3. At a later date (I hope) there will also be an ULTRA_TRIG=1.

From FFTConfig.h: 5.5M FFT supports 18.489 bits-per-FFT-word which gets the slowest code.

From FFTconfig.cpp: {0.06964, 0.14050, 0.03840, 0.02710, 0.01719, 0.00497},
which says 0.00497 bpw from the max we ease up a little bit, at 0.00497+0.01719 bpw from the max we ease up a little more, and so forth.

Last fiddled with by Prime95 on 2020-05-09 at 03:49
Prime95 is online now   Reply With Quote
Old 2020-05-09, 19:57   #2165
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1164610 Posts
Default

Quote:
Originally Posted by Prime95 View Post
As you get closer and closer to the FFT limit, there are several improved-accuracy-but-slower versions. The flags are MM2_CHAIN=1,2,3 and MM_CHAIN=1,2,3. At a later date (I hope) there will also be an ULTRA_TRIG=1.

From FFTConfig.h: 5.5M FFT supports 18.489 bits-per-FFT-word which gets the slowest code.

From FFTconfig.cpp: {0.06964, 0.14050, 0.03840, 0.02710, 0.01719, 0.00497},
which says 0.00497 bpw from the max we ease up a little bit, at 0.00497+0.01719 bpw from the max we ease up a little more, and so forth.
Thanks, but ITYM e.g. "within 0.00497 bpw of max we ease up a lot, within (0.00497+0.01719) we ease up a little less", etc. Because the math only works for me when I add all 6 ease-up fractions to get 0.29780, and (letting n = 5.5*2^20) observe that (18.489 - 0.29780)*n = 104911706.5..., which lies between the 2 exponents (104892731 and 104972429) just-on-either-side of the first (MM2_CHAIN=1) ease-up threshold.

And as I noted, on my system having one run at MM2_CHAIN=1 and the other with no ease-up counterintuitively gave me 2% more total throughput than with both runs using expos below the threshold, so I'd like to try forcing both of my current runs (which are below-threshold) to use MM2_CHAIN=1 to see what the resulting total throughput is. May I presume that forcing MM2_CHAIN=1 for an expo that does not need it is safe to do?

Last fiddled with by ewmayer on 2020-05-09 at 19:57
ewmayer is offline   Reply With Quote
Old 2020-05-09, 21:12   #2166
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11101011011112 Posts
Default

Quote:
Originally Posted by ewmayer View Post
And as I noted, on my system having one run at MM2_CHAIN=1 and the other with no ease-up counterintuitively gave me 2% more total throughput than with both runs using expos below the threshold, so I'd like to try forcing both of my current runs (which are below-threshold) to use MM2_CHAIN=1 to see what the resulting total throughput is. May I presume that forcing MM2_CHAIN=1 for an expo that does not need it is safe to do?
Yes, adding -use MM2_CHAIN=1 is perfectly safe
Prime95 is online now   Reply With Quote
Old 2020-05-09, 21:55   #2167
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101101011111102 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Yes, adding -use MM2_CHAIN=1 is perfectly safe
Cool - did this for my run of 104892731, expected timing-skew between the 2 runs resumed, total throughput again went from ~1360 iter/sec to ~1390 iter/sec

Then also switched the other run (p = 103923257) to using the flag, timings again equalize, but at 1410 us/iter, meaning total throughput ~1420 iter/sec, a gain of 4.5% [!] over both runs using default settings. That's nearly as much gain as I get from upping my sclk setting from 4 to 5, but the latter ups the wattage by a massive 60W, temps increase proportionally. Wattage currently is a mere 5-10W higher than before the switch to both runs using MM2_CHAIN=1.

Say I start making it the the default ... if a run hits an expo which needs an even-higher extra-accuracy setting, will that automatically kick in, thus overriding the user's setting of the flag?

Last fiddled with by ewmayer on 2020-05-09 at 21:56
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 19:06.


Sun Aug 1 19:06:49 UTC 2021 up 9 days, 13:35, 0 users, load averages: 1.79, 2.12, 1.92

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.