![]() |
[QUOTE=Prime95;543226]Over the last few weeks we've managed to increase the maximum exponent that can be tested with a 5M FFT by over a million.
I had to do this because I'm oh so close to being assigned exponents that would have pushed me into the 5.5M FFT. I know, very selfish :)[/QUOTE] Nice! So what are the default maxp limits for 5 and 5.5M in the latest commit? And how conservative are those, in your estimation? [p.s.: It's only selfish if you have said improvements in place in your local dev-branch, and refuse to share. :] |
LaurV:
You have choices. We all do. Shown before, 3 work arounds: a) backups. User picks how often. Re-start runs from the point in time of last backup with matched res64s. b) tie-breaker 3rd run. If two or 3 match, great; if none match, some erred. c) CUDALucas as a run. It has save files each n steps, but requires NVIDIA. It has long been the standard for LL on gpu. It can be rerun from the last save file before the res64 mismatch. A block of text how CUDALucas does it was mostly meant for Preda, whose time as a great coder is precious. (That fits for a few more people in GIMPS too.) I don't know if Preda has run CUDALucas. I know you have. Others who read this thread may not have. The choice for the gpuowl user to set save step would be good. And: d) code the change you want, and give it to Preda, as George, SELROC, chengsun, kracker etc have done for gpuowl, and others have done for other GIMPS software. e) do single tests and wait for others to double check them, like most users do, with other software and shift. f) wait until the feature set you want appears ALL on topic, as was [URL]https://www.mersenneforum.org/showpost.php?p=543260&postcount=2111[/URL] |
[QUOTE=ewmayer;543305]Nice! So what are the default maxp limits for 5 and 5.5M in the latest commit? And how conservative are those, in your estimation?[/QUOTE]
From gpuowl -h [CODE]FFT 5M [ 7.86M - 97.42M] 1K:10:256 1K:5:512 256:10:1K 512:10:512 512:5:1K FFT 5.50M [ 8.65M - 106.63M] 1K:11:256 256:11:1K 512:11:512 FFT 6M [ 9.44M - 115.86M] 1K:12:256 1K:6:512 1K:3:1K 256:12:1K 512:12:512 512:6:1K 4K:3:256 [/CODE] I'd say the limits are aggressive. |
[QUOTE=Prime95;543316]From gpuowl -h
[CODE]FFT 5M [ 7.86M - 97.42M] 1K:10:256 1K:5:512 256:10:1K 512:10:512 512:5:1K FFT 5.50M [ 8.65M - 106.63M] 1K:11:256 256:11:1K 512:11:512 FFT 6M [ 9.44M - 115.86M] 1K:12:256 1K:6:512 1K:3:1K 256:12:1K 512:12:512 512:6:1K 4K:3:256 [/CODE] I'd say the limits are aggressive.[/QUOTE] Indeed - from the version I'm currently on, v6.11-238-g62a3025-dirty: [code]FFT 5M [ 7.86M - 95.71M] 1K-256-10 256-1K-10 512-512-10 FFT 5632K [ 8.65M - 105.06M] 1K-256-11 256-1K-11 512-512-11 FFT 6M [ 9.44M - 114.40M] 1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6[/code] Just updated to current... wait. there's an issue related to a small change I made in my local primenet.py, which is to re-add a couple lines (i.e. to match the way the Mlucas primenet.py does things) so that '-t 0' means 'run py-script just once and quit'. Renamed my custom version, now we're good. |
[QUOTE=kriesel;543240]I will swap out the RX550 for a different unit after a trial of v6.11-268 if it also produces such EE occurrences.[/QUOTE]Time to swap it out. V6.11-268 had EE #16.
[CODE]2020-04-20 13:25:12 condorella/rx550 94741139 OK 41850000 44.17%; 14679 us/it; ETA 8d 23:40; 24439ce356cbcd12 (check 6.02s) 15 errors 2020-04-20 13:37:26 condorella/rx550 Roundoff: N=50525, mean 0.202943, SD 0.012035, CV 0.059305, [B]max 0.507728[/B], pErr 0.000001 2020-04-20 13:37:26 condorella/rx550 Carry: N=50524, max 3ba0c0a4, avg 2b56dd02; CarryM: N=1, max 7ac075bf, avg 7ac075bf 2020-04-20 13:37:32 condorella/rx550 94741139 [B]EE[/B] 41900000 44.23%; 14680 us/it; ETA 8d 23:29; [B]6dead1fc3993bd7b[/B] (check 6.01s) 15 errors 2020-04-20 13:37:39 condorella/rx550 94741139 OK 41850000 loaded: blockSize 400, 24439ce356cbcd12 2020-04-20 13:49:52 condorella/rx550 Roundoff: N=50953, mean 0.202905, SD 0.012028, CV 0.059281, max 0.299187, pErr 0.000001 2020-04-20 13:49:52 condorella/rx550 Carry: N=50951, max 3ba0c0a4, avg 2b54ff6f; CarryM: N=2, max 825305ba, avg 6b44b674 2020-04-20 13:49:58 condorella/rx550 94741139 [B]OK[/B] 41900000 44.23%; 14670 us/it; ETA 8d 23:20; [B]6dead1fc3993bd7b[/B] (check 6.03s) 16 errors 2020-04-20 14:02:12 condorella/rx550 Roundoff: N=50525, mean 0.203002, SD 0.012050, CV 0.059358, max 0.305012, pErr 0.000001 2020-04-20 14:02:12 condorella/rx550 Carry: N=50524, max 3b45831d, avg 2b4ff588; CarryM: N=1, max 814cae6d, avg 814cae6d[/CODE] |
[QUOTE=ewmayer;543318]Indeed - from the version I'm currently on, v6.11-238-g62a3025-dirty:
[code]FFT 5M [ 7.86M - 95.71M] 1K-256-10 256-1K-10 512-512-10 FFT 5632K [ 8.65M - 105.06M] 1K-256-11 256-1K-11 512-512-11 FFT 6M [ 9.44M - 114.40M] 1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6[/code] Just updated to current... wait. there's an issue related to a small change I made in my local primenet.py, which is to re-add a couple lines (i.e. to match the way the Mlucas primenet.py does things) so that '-t 0' means 'run py-script just once and quit'. Renamed my custom version, now we're good.[/QUOTE] Feel free to submit a pull request with the "-t 0" change. The current upper bound for 5M (97.4M) looks fine to me. |
[QUOTE=kriesel;543322]Time to swap it out. V6.11-268 had EE #16.[/QUOTE]The issue appears at the moment to be a bad memory fan in this HP Z600, resulting in hotter than operating spec for half the system ram. That fan if it died or is spinning too slowly would leave the air in the memory fan duct pretty stagnant and warm. I don't know why that would create issues in one gpu's gpuowl run but not the prime95 runs saturating the cpus. There were no GEC errors on that system's prime95's GUI display, or in its log files, going back months. Nor has it affected that system's RX480 gpuowl runs.
Symptoms: "514 Memory fan not detected" message from BIOS on startup, which re-seating did not cure. HWMonitor showed system ram, bank of 3 nearer the closer Xeon 90C+, other bank in the 70s. Other Z600s in the same large room running similar workloads on cpus and NVIDIA gpus had memory temps in the 50s. Experimenting with the prime95 instance, turning off half the workers to reduce power at the nearer Xeon, lowered the hotter ram into the 70s. Replacement fan on the way. |
[QUOTE=kriesel;543333]The issue appears at the moment to be a bad memory fan ...
Experimenting with the prime95 instance, turning off half the workers to reduce power at the nearer Xeon, lowered the hotter ram into the 70s. Replacement fan on the way.[/QUOTE]Even after dropping cpu heat, and swapping the gpu for another, it's still getting EEs.[CODE]2020-04-20 19:22:40 gpuowl v6.11-268-g0d07d21 2020-04-20 19:22:40 config: -device 1 -user kriesel -cpu condorella/rx550 -yield -maxAlloc 3600 -use NO_ASM 2020-04-20 19:22:40 device 1, unique id '' 2020-04-20 19:22:40 condorella/rx550 94741139 FFT: 5M 1K:10:256 (18.07 bpw) 2020-04-20 19:22:40 condorella/rx550 Expected maximum carry32: 461E0000 2020-04-20 19:22:41 condorella/rx550 OpenCL args "-DEXP=94741139u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xf.3cd1fc0411148p-3 -DIWEIGHT_ST EP=0x8.66790bf53aca8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-st d=CL2.0 " 2020-04-20 19:22:47 condorella/rx550 OpenCL compilation in 5.58 s 2020-04-20 19:22:53 condorella/rx550 94741139 OK 43154000 loaded: blockSize 400, 850d5d673cf6ad49 2020-04-20 19:23:10 condorella/rx550 94741139 OK 43154800 45.55%; 13701 us/it; ETA 8d 04:20; e0021e93eddece6a (check 5.65s) 16 errors 2020-04-20 19:33:38 condorella/rx550 94741139 OK 43200000 45.60%; 13772 us/it; ETA 8d 05:10; 18847855ef4addd5 (check 5.70s) 16 errors 2020-04-20 19:45:13 condorella/rx550 94741139 OK 43250000 45.65%; 13775 us/it; ETA 8d 05:01; c8b93071fb167821 (check 5.64s) 16 errors 2020-04-20 19:56:47 condorella/rx550 94741139 OK 43300000 45.70%; 13770 us/it; ETA 8d 04:45; e36f93de9f65e252 (check 5.64s) 16 errors 2020-04-20 20:08:21 condorella/rx550 94741139 OK 43350000 45.76%; 13771 us/it; ETA 8d 04:35; db7548eeff7fd82d (check 5.64s) 16 errors 2020-04-20 20:19:55 condorella/rx550 94741139 OK 43400000 45.81%; 13766 us/it; ETA 8d 04:19; d5890f6f7bc3bb62 (check 5.64s) 16 errors 2020-04-20 20:31:29 condorella/rx550 94741139 OK 43450000 45.86%; 13763 us/it; ETA 8d 04:05; a47eafb785a71fa4 (check 5.64s) 16 errors 2020-04-20 20:43:02 condorella/rx550 94741139 EE 43500000 45.91%; 13759 us/it; ETA 8d 03:50; 9c0cad0c6879242b (check 5.73s) 16 errors 2020-04-20 20:43:09 condorella/rx550 94741139 OK 43450000 loaded: blockSize 400, a47eafb785a71fa4 2020-04-20 20:54:42 condorella/rx550 94741139 OK 43500000 45.91%; 13759 us/it; ETA 8d 03:50; 9c0cad0c6879242b (check 5.64s) 17 errors 2020-04-20 21:06:16 condorella/rx550 94741139 OK 43550000 45.97%; 13765 us/it; ETA 8d 03:44; 80f24faafac9b03a (check 5.64s) 17 errors 2020-04-20 21:17:50 condorella/rx550 94741139 OK 43600000 46.02%; 13762 us/it; ETA 8d 03:30; 45d1a03b9cb91819 (check 5.64s) 17 errors 2020-04-20 21:29:24 condorella/rx550 94741139 OK 43650000 46.07%; 13766 us/it; ETA 8d 03:22; fac79b7ec0105d01 (check 5.64s) 17 errors 2020-04-20 21:40:57 condorella/rx550 94741139 OK 43700000 46.13%; 13759 us/it; ETA 8d 03:04; a66ca92be5e6dbb6 (check 5.64s) 17 errors 2020-04-20 21:52:31 condorella/rx550 94741139 OK 43750000 46.18%; 13764 us/it; ETA 8d 02:58; 3740bb97fee487d0 (check 5.64s) 17 errors 2020-04-20 22:04:05 condorella/rx550 94741139 OK 43800000 46.23%; 13764 us/it; ETA 8d 02:46; db25fa854c5db484 (check 5.65s) 17 errors 2020-04-20 22:15:39 condorella/rx550 94741139 OK 43850000 46.28%; 13764 us/it; ETA 8d 02:35; e69e2dbf65d78b2a (check 5.64s) 17 errors 2020-04-20 22:27:13 condorella/rx550 94741139 EE 43900000 46.34%; 13762 us/it; ETA 8d 02:21; 1f68378b7c6fc404 (check 5.63s) 17 errors 2020-04-20 22:27:19 condorella/rx550 94741139 OK 43850000 loaded: blockSize 400, e69e2dbf65d78b2a 2020-04-20 22:38:53 condorella/rx550 94741139 OK 43900000 46.34%; 13761 us/it; ETA 8d 02:20; 1f68378b7c6fc404 (check 5.68s) 18 errors 2020-04-20 22:50:26 condorella/rx550 94741139 OK 43950000 46.39%; 13759 us/it; ETA 8d 02:08; 31bdbf61721379f5 (check 5.68s) 18 errors 2020-04-20 23:02:00 condorella/rx550 94741139 OK 44000000 46.44%; 13762 us/it; ETA 8d 01:58; ab5f29aa5e0616d4 (check 5.64s) 18 errors 2020-04-20 23:13:34 condorella/rx550 94741139 OK 44050000 46.50%; 13764 us/it; ETA 8d 01:49; d15a6b5993812fc4 (check 5.64s) 18 errors 2020-04-20 23:25:08 condorella/rx550 94741139 OK 44100000 46.55%; 13761 us/it; ETA 8d 01:35; 72acbd04b3d43f04 (check 5.64s) 18 errors 2020-04-20 23:36:41 condorella/rx550 94741139 OK 44150000 46.60%; 13761 us/it; ETA 8d 01:23; 2894cbff475de263 (check 5.64s) 18 errors 2020-04-20 23:48:15 condorella/rx550 94741139 OK 44200000 46.65%; 13764 us/it; ETA 8d 01:15; d3091a2a24f15d8b (check 5.64s) 18 errors 2020-04-20 23:59:49 condorella/rx550 94741139 OK 44250000 46.71%; 13761 us/it; ETA 8d 01:00; d35597a77e451f9b (check 5.64s) 18 errors 2020-04-21 00:11:23 condorella/rx550 94741139 OK 44300000 46.76%; 13762 us/it; ETA 8d 00:50; 092708b97dc11cf0 (check 5.64s) 18 errors 2020-04-21 00:22:56 condorella/rx550 94741139 OK 44350000 46.81%; 13757 us/it; ETA 8d 00:34; a55be7644c8914ff (check 5.64s) 18 errors 2020-04-21 00:34:30 condorella/rx550 94741139 OK 44400000 46.86%; 13761 us/it; ETA 8d 00:26; 6c9cb184d9ae9fb9 (check 5.67s) 18 errors 2020-04-21 00:46:03 condorella/rx550 94741139 OK 44450000 46.92%; 13757 us/it; ETA 8d 00:11; 440bf81e51efd1b8 (check 5.64s) 18 errors 2020-04-21 00:57:37 condorella/rx550 94741139 OK 44500000 46.97%; 13760 us/it; ETA 8d 00:02; 4e2721d94c80f9a9 (check 5.67s) 18 errors 2020-04-21 01:09:11 condorella/rx550 94741139 OK 44550000 47.02%; 13758 us/it; ETA 7d 23:49; acc59d938a878840 (check 5.67s) 18 errors 2020-04-21 01:20:44 condorella/rx550 94741139 OK 44600000 47.08%; 13760 us/it; ETA 7d 23:39; e8ae6b2e1342173a (check 5.64s) 18 errors 2020-04-21 01:32:18 condorella/rx550 94741139 OK 44650000 47.13%; 13758 us/it; ETA 7d 23:26; 7738e5de79a41988 (check 5.64s) 18 errors 2020-04-21 01:43:51 condorella/rx550 94741139 OK 44700000 47.18%; 13754 us/it; ETA 7d 23:11; 0325e62041e2ef93 (check 5.66s) 18 errors 2020-04-21 01:55:25 condorella/rx550 94741139 OK 44750000 47.23%; 13757 us/it; ETA 7d 23:03; ac90cc4d821b536d (check 5.67s) 18 errors 2020-04-21 02:06:58 condorella/rx550 94741139 OK 44800000 47.29%; 13758 us/it; ETA 7d 22:52; 96fdda068a85c0ec (check 5.64s) 18 errors[/CODE]Next level is in effect now, stop and close prime95. |
[QUOTE=preda;543327]Feel free to submit a pull request with the "-t 0" change.[/QUOTE]
That's what I did, but without intending to commit my local change - got this error: [code]git pull https://github.com/preda/gpuowl && make remote: Enumerating objects: 119, done. remote: Counting objects: 100% (119/119), done. remote: Compressing objects: 100% (46/46), done. remote: Total 136 (delta 96), reused 89 (delta 73), pack-reused 17 Receiving objects: 100% (136/136), 83.73 KiB | 2.20 MiB/s, done. Resolving deltas: 100% (96/96), completed with 22 local objects. From https://github.com/preda/gpuowl * branch HEAD -> FETCH_HEAD Updating 62a3025..f1fd1f7 error: Your local changes to the following files would be overwritten by merge: tools/primenet.py Please commit your changes or stash them before you merge. Aborting[/code] So this seems a good baby-step introduction to the rev-control setup ... what is the procedure for checking out a file, then testing and submitting a modified version? And what is the code review process you and George have in place? Oh, another Q re. the latest primenet.py - just tried to use it with same flags I'd always used, -w 150 --tasks 10, to queue up new PRPs, but with the latest got [i]primenet.py: error: argument -w: invalid choice: '150' (choose from 'PRP', 'PM1', 'LL_DC', 'PRP_DC', 'PRP_WORLD_RECORD', 'PRP_100M')[/i] That "numeric value no longer works" appears to be due to a change in the choice=list(..) command - did you deliberately mean to disable numeric-server-worktype code support? |
[QUOTE=ewmayer;543392]That's what I did, but without intending to commit my local change - got this error:
[code]git pull https://github.com/preda/gpuowl && make remote: Enumerating objects: 119, done. remote: Counting objects: 100% (119/119), done. remote: Compressing objects: 100% (46/46), done. remote: Total 136 (delta 96), reused 89 (delta 73), pack-reused 17 Receiving objects: 100% (136/136), 83.73 KiB | 2.20 MiB/s, done. Resolving deltas: 100% (96/96), completed with 22 local objects. From https://github.com/preda/gpuowl * branch HEAD -> FETCH_HEAD Updating 62a3025..f1fd1f7 error: Your local changes to the following files would be overwritten by merge: tools/primenet.py Please commit your changes or stash them before you merge. Aborting[/code] So this seems a good baby-step introduction to the rev-control setup ... what is the procedure for checking out a file, then testing and submitting a modified version? [/QUOTE] I wouldn't dare to write a git/github how-to here -- it's too large a subject, and there already are good tutorials out there. But the basic step sequence is: 1. create a github account 2. fork the project to your account (using github interface) 3. "git clone": check out locally *your* clone of the project (because you have write rights on your clone) 4. make local changes 5. "git commit": commit local changes 6. "git push": publish your local commits to your fork 7. using the github interface, create a pull request from your fork to the main project 8. I see the pull request, and I can merge it [QUOTE] And what is the code review process you and George have in place? [/QUOTE] It's extremely light right now: - I commit without any reviews. Sometimes George detects errors I make, and notifies me (so, that's a form of post-commit review :). - George sends me pull requests. I usually verify them before merging (by compiling and running an exponent for a bit). (the goal of my testing is mainly to detect performance differences between our respective setups) [QUOTE] Oh, another Q re. the latest primenet.py - just tried to use it with same flags I'd always used, -w 150 --tasks 10, to queue up new PRPs, but with the latest got [i]primenet.py: error: argument -w: invalid choice: '150' (choose from 'PRP', 'PM1', 'LL_DC', 'PRP_DC', 'PRP_WORLD_RECORD', 'PRP_100M')[/i] That "numeric value no longer works" appears to be due to a change in the choice=list(..) command - did you deliberately mean to disable numeric-server-worktype code support?[/QUOTE] No, disabling the numeric values was unintentional (the goal of the change was to make the help less confusing by not displaying the numeric values there). But, why do you prefer using the numeric value (150) vs. the symbolic name "PRP"? Anyway, I'm fine with adding the numeric ids back if they're useful. |
Ernst do git stash before doing the pull and then do git stash pop If you haven’t pulled any conflicting changes the pop will replay your own changes on the latest from Github
|
| All times are UTC. The time now is 23:06. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.