mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-04-20, 21:20   #2113
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·32·647 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Over the last few weeks we've managed to increase the maximum exponent that can be tested with a 5M FFT by over a million.

I had to do this because I'm oh so close to being assigned exponents that would have pushed me into the 5.5M FFT. I know, very selfish :)
Nice! So what are the default maxp limits for 5 and 5.5M in the latest commit? And how conservative are those, in your estimation?

[p.s.: It's only selfish if you have said improvements in place in your local dev-branch, and refuse to share. :]

Last fiddled with by ewmayer on 2020-04-20 at 21:21
ewmayer is offline   Reply With Quote
Old 2020-04-20, 22:04   #2114
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default

LaurV:
You have choices. We all do. Shown before, 3 work arounds:
a) backups. User picks how often. Re-start runs from the point in time of last backup with matched res64s.
b) tie-breaker 3rd run. If two or 3 match, great; if none match, some erred.
c) CUDALucas as a run. It has save files each n steps, but requires NVIDIA. It has long been the standard for LL on gpu. It can be rerun from the last save file before the res64 mismatch.
A block of text how CUDALucas does it was mostly meant for Preda, whose time as a great coder is precious. (That fits for a few more people in GIMPS too.) I don't know if Preda has run CUDALucas. I know you have. Others who read this thread may not have. The choice for the gpuowl user to set save step would be good.

And:
d) code the change you want, and give it to Preda, as George, SELROC, chengsun, kracker etc have done for gpuowl, and others have done for other GIMPS software.
e) do single tests and wait for others to double check them, like most users do, with other software and shift.
f) wait until the feature set you want appears

ALL on topic, as was https://www.mersenneforum.org/showpo...postcount=2111

Last fiddled with by kriesel on 2020-04-20 at 22:35
kriesel is online now   Reply With Quote
Old 2020-04-20, 22:28   #2115
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

5·11·137 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Nice! So what are the default maxp limits for 5 and 5.5M in the latest commit? And how conservative are those, in your estimation?
From gpuowl -h

Code:
FFT    5M [  7.86M -   97.42M]  1K:10:256 1K:5:512 256:10:1K 512:10:512 512:5:1K
FFT 5.50M [  8.65M -  106.63M]  1K:11:256 256:11:1K 512:11:512
FFT    6M [  9.44M -  115.86M]  1K:12:256 1K:6:512 1K:3:1K 256:12:1K 512:12:512 512:6:1K 4K:3:256
I'd say the limits are aggressive.
Prime95 is online now   Reply With Quote
Old 2020-04-20, 22:54   #2116
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D7E16 Posts
Default

Quote:
Originally Posted by Prime95 View Post
From gpuowl -h

Code:
FFT    5M [  7.86M -   97.42M]  1K:10:256 1K:5:512 256:10:1K 512:10:512 512:5:1K
FFT 5.50M [  8.65M -  106.63M]  1K:11:256 256:11:1K 512:11:512
FFT    6M [  9.44M -  115.86M]  1K:12:256 1K:6:512 1K:3:1K 256:12:1K 512:12:512 512:6:1K 4K:3:256
I'd say the limits are aggressive.
Indeed - from the version I'm currently on, v6.11-238-g62a3025-dirty:
Code:
FFT    5M [  7.86M -   95.71M]  1K-256-10 256-1K-10 512-512-10
FFT 5632K [  8.65M -  105.06M]  1K-256-11 256-1K-11 512-512-11
FFT    6M [  9.44M -  114.40M]  1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6
Just updated to current... wait. there's an issue related to a small change I made in my local primenet.py, which is to re-add a couple lines (i.e. to match the way the Mlucas primenet.py does things) so that '-t 0' means 'run py-script just once and quit'. Renamed my custom version, now we're good.
ewmayer is offline   Reply With Quote
Old 2020-04-20, 23:42   #2117
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10101001010112 Posts
Default

Quote:
Originally Posted by kriesel View Post
I will swap out the RX550 for a different unit after a trial of v6.11-268 if it also produces such EE occurrences.
Time to swap it out. V6.11-268 had EE #16.
Code:
2020-04-20 13:25:12 condorella/rx550 94741139 OK 41850000  44.17%; 14679 us/it; ETA 8d 23:40; 24439ce356cbcd12 (check 6.02s) 15 errors
2020-04-20 13:37:26 condorella/rx550 Roundoff: N=50525, mean 0.202943, SD 0.012035, CV 0.059305, max 0.507728, pErr 0.000001
2020-04-20 13:37:26 condorella/rx550 Carry: N=50524, max 3ba0c0a4, avg 2b56dd02; CarryM: N=1, max 7ac075bf, avg 7ac075bf
2020-04-20 13:37:32 condorella/rx550 94741139 EE 41900000  44.23%; 14680 us/it; ETA 8d 23:29; 6dead1fc3993bd7b (check 6.01s) 15 errors
2020-04-20 13:37:39 condorella/rx550 94741139 OK 41850000 loaded: blockSize 400, 24439ce356cbcd12
2020-04-20 13:49:52 condorella/rx550 Roundoff: N=50953, mean 0.202905, SD 0.012028, CV 0.059281, max 0.299187, pErr 0.000001
2020-04-20 13:49:52 condorella/rx550 Carry: N=50951, max 3ba0c0a4, avg 2b54ff6f; CarryM: N=2, max 825305ba, avg 6b44b674
2020-04-20 13:49:58 condorella/rx550 94741139 OK 41900000  44.23%; 14670 us/it; ETA 8d 23:20; 6dead1fc3993bd7b (check 6.03s) 16 errors
2020-04-20 14:02:12 condorella/rx550 Roundoff: N=50525, mean 0.203002, SD 0.012050, CV 0.059358, max 0.305012, pErr 0.000001
2020-04-20 14:02:12 condorella/rx550 Carry: N=50524, max 3b45831d, avg 2b4ff588; CarryM: N=1, max 814cae6d, avg 814cae6d
kriesel is online now   Reply With Quote
Old 2020-04-21, 00:26   #2118
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Indeed - from the version I'm currently on, v6.11-238-g62a3025-dirty:
Code:
FFT    5M [  7.86M -   95.71M]  1K-256-10 256-1K-10 512-512-10
FFT 5632K [  8.65M -  105.06M]  1K-256-11 256-1K-11 512-512-11
FFT    6M [  9.44M -  114.40M]  1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6
Just updated to current... wait. there's an issue related to a small change I made in my local primenet.py, which is to re-add a couple lines (i.e. to match the way the Mlucas primenet.py does things) so that '-t 0' means 'run py-script just once and quit'. Renamed my custom version, now we're good.
Feel free to submit a pull request with the "-t 0" change.

The current upper bound for 5M (97.4M) looks fine to me.
preda is offline   Reply With Quote
Old 2020-04-21, 01:28   #2119
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default

Quote:
Originally Posted by kriesel View Post
Time to swap it out. V6.11-268 had EE #16.
The issue appears at the moment to be a bad memory fan in this HP Z600, resulting in hotter than operating spec for half the system ram. That fan if it died or is spinning too slowly would leave the air in the memory fan duct pretty stagnant and warm. I don't know why that would create issues in one gpu's gpuowl run but not the prime95 runs saturating the cpus. There were no GEC errors on that system's prime95's GUI display, or in its log files, going back months. Nor has it affected that system's RX480 gpuowl runs.

Symptoms:
"514 Memory fan not detected" message from BIOS on startup, which re-seating did not cure.
HWMonitor showed system ram, bank of 3 nearer the closer Xeon 90C+, other bank in the 70s.
Other Z600s in the same large room running similar workloads on cpus and NVIDIA gpus had memory temps in the 50s.
Experimenting with the prime95 instance, turning off half the workers to reduce power at the nearer Xeon, lowered the hotter ram into the 70s.

Replacement fan on the way.

Last fiddled with by kriesel on 2020-04-21 at 01:31
kriesel is online now   Reply With Quote
Old 2020-04-21, 07:20   #2120
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default

Quote:
Originally Posted by kriesel View Post
The issue appears at the moment to be a bad memory fan ...
Experimenting with the prime95 instance, turning off half the workers to reduce power at the nearer Xeon, lowered the hotter ram into the 70s.

Replacement fan on the way.
Even after dropping cpu heat, and swapping the gpu for another, it's still getting EEs.
Code:
2020-04-20 19:22:40 gpuowl v6.11-268-g0d07d21
2020-04-20 19:22:40 config: -device 1 -user kriesel -cpu condorella/rx550 -yield -maxAlloc 3600 -use NO_ASM
2020-04-20 19:22:40 device 1, unique id ''
2020-04-20 19:22:40 condorella/rx550 94741139 FFT: 5M 1K:10:256 (18.07 bpw)
2020-04-20 19:22:40 condorella/rx550 Expected maximum carry32: 461E0000
2020-04-20 19:22:41 condorella/rx550 OpenCL args "-DEXP=94741139u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xf.3cd1fc0411148p-3 -DIWEIGHT_ST
EP=0x8.66790bf53aca8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1  -cl-fast-relaxed-math -cl-st
d=CL2.0 "
2020-04-20 19:22:47 condorella/rx550 OpenCL compilation in 5.58 s
2020-04-20 19:22:53 condorella/rx550 94741139 OK 43154000 loaded: blockSize 400, 850d5d673cf6ad49
2020-04-20 19:23:10 condorella/rx550 94741139 OK 43154800  45.55%; 13701 us/it; ETA 8d 04:20; e0021e93eddece6a (check 5.65s) 16 errors
2020-04-20 19:33:38 condorella/rx550 94741139 OK 43200000  45.60%; 13772 us/it; ETA 8d 05:10; 18847855ef4addd5 (check 5.70s) 16 errors
2020-04-20 19:45:13 condorella/rx550 94741139 OK 43250000  45.65%; 13775 us/it; ETA 8d 05:01; c8b93071fb167821 (check 5.64s) 16 errors
2020-04-20 19:56:47 condorella/rx550 94741139 OK 43300000  45.70%; 13770 us/it; ETA 8d 04:45; e36f93de9f65e252 (check 5.64s) 16 errors
2020-04-20 20:08:21 condorella/rx550 94741139 OK 43350000  45.76%; 13771 us/it; ETA 8d 04:35; db7548eeff7fd82d (check 5.64s) 16 errors
2020-04-20 20:19:55 condorella/rx550 94741139 OK 43400000  45.81%; 13766 us/it; ETA 8d 04:19; d5890f6f7bc3bb62 (check 5.64s) 16 errors
2020-04-20 20:31:29 condorella/rx550 94741139 OK 43450000  45.86%; 13763 us/it; ETA 8d 04:05; a47eafb785a71fa4 (check 5.64s) 16 errors
2020-04-20 20:43:02 condorella/rx550 94741139 EE 43500000  45.91%; 13759 us/it; ETA 8d 03:50; 9c0cad0c6879242b (check 5.73s) 16 errors
2020-04-20 20:43:09 condorella/rx550 94741139 OK 43450000 loaded: blockSize 400, a47eafb785a71fa4
2020-04-20 20:54:42 condorella/rx550 94741139 OK 43500000  45.91%; 13759 us/it; ETA 8d 03:50; 9c0cad0c6879242b (check 5.64s) 17 errors
2020-04-20 21:06:16 condorella/rx550 94741139 OK 43550000  45.97%; 13765 us/it; ETA 8d 03:44; 80f24faafac9b03a (check 5.64s) 17 errors
2020-04-20 21:17:50 condorella/rx550 94741139 OK 43600000  46.02%; 13762 us/it; ETA 8d 03:30; 45d1a03b9cb91819 (check 5.64s) 17 errors
2020-04-20 21:29:24 condorella/rx550 94741139 OK 43650000  46.07%; 13766 us/it; ETA 8d 03:22; fac79b7ec0105d01 (check 5.64s) 17 errors
2020-04-20 21:40:57 condorella/rx550 94741139 OK 43700000  46.13%; 13759 us/it; ETA 8d 03:04; a66ca92be5e6dbb6 (check 5.64s) 17 errors
2020-04-20 21:52:31 condorella/rx550 94741139 OK 43750000  46.18%; 13764 us/it; ETA 8d 02:58; 3740bb97fee487d0 (check 5.64s) 17 errors
2020-04-20 22:04:05 condorella/rx550 94741139 OK 43800000  46.23%; 13764 us/it; ETA 8d 02:46; db25fa854c5db484 (check 5.65s) 17 errors
2020-04-20 22:15:39 condorella/rx550 94741139 OK 43850000  46.28%; 13764 us/it; ETA 8d 02:35; e69e2dbf65d78b2a (check 5.64s) 17 errors
2020-04-20 22:27:13 condorella/rx550 94741139 EE 43900000  46.34%; 13762 us/it; ETA 8d 02:21; 1f68378b7c6fc404 (check 5.63s) 17 errors
2020-04-20 22:27:19 condorella/rx550 94741139 OK 43850000 loaded: blockSize 400, e69e2dbf65d78b2a
2020-04-20 22:38:53 condorella/rx550 94741139 OK 43900000  46.34%; 13761 us/it; ETA 8d 02:20; 1f68378b7c6fc404 (check 5.68s) 18 errors
2020-04-20 22:50:26 condorella/rx550 94741139 OK 43950000  46.39%; 13759 us/it; ETA 8d 02:08; 31bdbf61721379f5 (check 5.68s) 18 errors
2020-04-20 23:02:00 condorella/rx550 94741139 OK 44000000  46.44%; 13762 us/it; ETA 8d 01:58; ab5f29aa5e0616d4 (check 5.64s) 18 errors
2020-04-20 23:13:34 condorella/rx550 94741139 OK 44050000  46.50%; 13764 us/it; ETA 8d 01:49; d15a6b5993812fc4 (check 5.64s) 18 errors
2020-04-20 23:25:08 condorella/rx550 94741139 OK 44100000  46.55%; 13761 us/it; ETA 8d 01:35; 72acbd04b3d43f04 (check 5.64s) 18 errors
2020-04-20 23:36:41 condorella/rx550 94741139 OK 44150000  46.60%; 13761 us/it; ETA 8d 01:23; 2894cbff475de263 (check 5.64s) 18 errors
2020-04-20 23:48:15 condorella/rx550 94741139 OK 44200000  46.65%; 13764 us/it; ETA 8d 01:15; d3091a2a24f15d8b (check 5.64s) 18 errors
2020-04-20 23:59:49 condorella/rx550 94741139 OK 44250000  46.71%; 13761 us/it; ETA 8d 01:00; d35597a77e451f9b (check 5.64s) 18 errors
2020-04-21 00:11:23 condorella/rx550 94741139 OK 44300000  46.76%; 13762 us/it; ETA 8d 00:50; 092708b97dc11cf0 (check 5.64s) 18 errors
2020-04-21 00:22:56 condorella/rx550 94741139 OK 44350000  46.81%; 13757 us/it; ETA 8d 00:34; a55be7644c8914ff (check 5.64s) 18 errors
2020-04-21 00:34:30 condorella/rx550 94741139 OK 44400000  46.86%; 13761 us/it; ETA 8d 00:26; 6c9cb184d9ae9fb9 (check 5.67s) 18 errors
2020-04-21 00:46:03 condorella/rx550 94741139 OK 44450000  46.92%; 13757 us/it; ETA 8d 00:11; 440bf81e51efd1b8 (check 5.64s) 18 errors
2020-04-21 00:57:37 condorella/rx550 94741139 OK 44500000  46.97%; 13760 us/it; ETA 8d 00:02; 4e2721d94c80f9a9 (check 5.67s) 18 errors
2020-04-21 01:09:11 condorella/rx550 94741139 OK 44550000  47.02%; 13758 us/it; ETA 7d 23:49; acc59d938a878840 (check 5.67s) 18 errors
2020-04-21 01:20:44 condorella/rx550 94741139 OK 44600000  47.08%; 13760 us/it; ETA 7d 23:39; e8ae6b2e1342173a (check 5.64s) 18 errors
2020-04-21 01:32:18 condorella/rx550 94741139 OK 44650000  47.13%; 13758 us/it; ETA 7d 23:26; 7738e5de79a41988 (check 5.64s) 18 errors
2020-04-21 01:43:51 condorella/rx550 94741139 OK 44700000  47.18%; 13754 us/it; ETA 7d 23:11; 0325e62041e2ef93 (check 5.66s) 18 errors
2020-04-21 01:55:25 condorella/rx550 94741139 OK 44750000  47.23%; 13757 us/it; ETA 7d 23:03; ac90cc4d821b536d (check 5.67s) 18 errors
2020-04-21 02:06:58 condorella/rx550 94741139 OK 44800000  47.29%; 13758 us/it; ETA 7d 22:52; 96fdda068a85c0ec (check 5.64s) 18 errors
Next level is in effect now, stop and close prime95.
kriesel is online now   Reply With Quote
Old 2020-04-21, 20:53   #2121
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×32×647 Posts
Default

Quote:
Originally Posted by preda View Post
Feel free to submit a pull request with the "-t 0" change.
That's what I did, but without intending to commit my local change - got this error:
Code:
git pull https://github.com/preda/gpuowl && make
remote: Enumerating objects: 119, done.
remote: Counting objects: 100% (119/119), done.
remote: Compressing objects: 100% (46/46), done.
remote: Total 136 (delta 96), reused 89 (delta 73), pack-reused 17
Receiving objects: 100% (136/136), 83.73 KiB | 2.20 MiB/s, done.
Resolving deltas: 100% (96/96), completed with 22 local objects.
From https://github.com/preda/gpuowl
 * branch            HEAD       -> FETCH_HEAD
Updating 62a3025..f1fd1f7
error: Your local changes to the following files would be overwritten by merge:
	tools/primenet.py
Please commit your changes or stash them before you merge.
Aborting
So this seems a good baby-step introduction to the rev-control setup ... what is the procedure for checking out a file, then testing and submitting a modified version? And what is the code review process you and George have in place?

Oh, another Q re. the latest primenet.py - just tried to use it with same flags I'd always used, -w 150 --tasks 10, to queue up new PRPs, but with the latest got

primenet.py: error: argument -w: invalid choice: '150' (choose from 'PRP', 'PM1', 'LL_DC', 'PRP_DC', 'PRP_WORLD_RECORD', 'PRP_100M')

That "numeric value no longer works" appears to be due to a change in the choice=list(..) command - did you deliberately mean to disable numeric-server-worktype code support?

Last fiddled with by ewmayer on 2020-04-21 at 21:29
ewmayer is offline   Reply With Quote
Old 2020-04-22, 09:32   #2122
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Quote:
Originally Posted by ewmayer View Post
That's what I did, but without intending to commit my local change - got this error:
Code:
git pull https://github.com/preda/gpuowl && make
remote: Enumerating objects: 119, done.
remote: Counting objects: 100% (119/119), done.
remote: Compressing objects: 100% (46/46), done.
remote: Total 136 (delta 96), reused 89 (delta 73), pack-reused 17
Receiving objects: 100% (136/136), 83.73 KiB | 2.20 MiB/s, done.
Resolving deltas: 100% (96/96), completed with 22 local objects.
From https://github.com/preda/gpuowl
 * branch            HEAD       -> FETCH_HEAD
Updating 62a3025..f1fd1f7
error: Your local changes to the following files would be overwritten by merge:
	tools/primenet.py
Please commit your changes or stash them before you merge.
Aborting
So this seems a good baby-step introduction to the rev-control setup ... what is the procedure for checking out a file, then testing and submitting a modified version?
I wouldn't dare to write a git/github how-to here -- it's too large a subject, and there already are good tutorials out there. But the basic step sequence is:

1. create a github account
2. fork the project to your account (using github interface)
3. "git clone": check out locally *your* clone of the project (because you have write rights on your clone)
4. make local changes
5. "git commit": commit local changes
6. "git push": publish your local commits to your fork
7. using the github interface, create a pull request from your fork to the main project
8. I see the pull request, and I can merge it

Quote:
And what is the code review process you and George have in place?
It's extremely light right now:
- I commit without any reviews. Sometimes George detects errors I make, and notifies me (so, that's a form of post-commit review :).
- George sends me pull requests. I usually verify them before merging (by compiling and running an exponent for a bit). (the goal of my testing is mainly to detect performance differences between our respective setups)

Quote:
Oh, another Q re. the latest primenet.py - just tried to use it with same flags I'd always used, -w 150 --tasks 10, to queue up new PRPs, but with the latest got

primenet.py: error: argument -w: invalid choice: '150' (choose from 'PRP', 'PM1', 'LL_DC', 'PRP_DC', 'PRP_WORLD_RECORD', 'PRP_100M')

That "numeric value no longer works" appears to be due to a change in the choice=list(..) command - did you deliberately mean to disable numeric-server-worktype code support?
No, disabling the numeric values was unintentional (the goal of the change was to make the help less confusing by not displaying the numeric values there). But, why do you prefer using the numeric value (150) vs. the symbolic name "PRP"? Anyway, I'm fine with adding the numeric ids back if they're useful.
preda is offline   Reply With Quote
Old 2020-04-22, 13:13   #2123
garo
 
garo's Avatar
 
Aug 2002
Termonfeckin, IE

22×691 Posts
Default

Ernst do git stash before doing the pull and then do git stash pop If you haven’t pulled any conflicting changes the pop will replay your own changes on the latest from Github
garo is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 19:07.


Sun Aug 1 19:07:56 UTC 2021 up 9 days, 13:36, 0 users, load averages: 2.01, 2.12, 1.93

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.