![]() |
[QUOTE=kriesel;534721]Make gpuown-win again generated the usual shower of warnings; see build-log.txt attached.[/QUOTE]
Ken, I'm aware of your complaint agains those warnings, and I did look into them. IMO those warnings are invalid, a compiler problem. They could be silenced with some effort, but again IMO that effort is not worth expending because the [invalid] warnings are an incovenience only for the person building the program (Ken) but not for the users. |
-yield effect varies; Windows version?
2 Attachment(s)
[QUOTE=wfgarnett3;534757]See attached screenshots -- gpuOwL-only runs are basically using all of the 27% CPU usage[/QUOTE]I don't see any of the recent performance-enhancing -USE options on your runs. See [URL]https://www.mersenneforum.org/showpost.php?p=533378&postcount=1654[/URL] for tuning data on a GTX1050Ti
-CARRY32 may also help. A couple of my systems' cpu overhead for gpuowl-win are shown in the screen captures. Roa (Windows 10) is running just over a full HT "core" of overhead (of total 24 cores plus HT, one "core" = 1/48 =2.08%) while condorella (Windows 7) is running a tiny fraction of that. I don't know why there's such a difference. wfgarnett3's 27% utilization is also ~one HT "core" on a dual-core HT Windows 10 system IIRC. Does -yield work on Window 7 and not on Windows 10? Any Windows 8.x users out there? |
[QUOTE=preda;534752]- should I put the AID (of the PRP) on the P-1 factor-found result?[/QUOTE]It might be the easiest way to tell the PrimeNet server what to clean up.
It's sort of analogous to the case of a multibitlevel factoring assignment returning a factor in an early bit level; no point in soldiering on needlessly once the exponent has a factor discovered and reported. |
[QUOTE=preda;534763]Ken, I'm aware of your complaint against those warnings, and I did look into them. IMO those warnings are invalid, a compiler problem. They could be silenced with some effort, but again IMO that effort is not worth expending because the [invalid] warnings are an inconvenience only for the person building the program (Ken) but not for the users.[/QUOTE]Better you than me to look at whether the warnings may indicate possible improper program execution that could cause problems. And they will reoccur for anyone else doing a similar build a similar way, perhaps on a commit that I skip.
|
[QUOTE=Prime95;534723]If P-1 finds a factor are both the P-1 and PRP lines deleted?[/QUOTE]
Related to the "implicit P-1 before PRP" feature, would it be possible to request manual assignments "first-time PRP without P-1 done", for somebody who's willing to do both P-1 and PRP. Right now for my own testing, I request P-1 and fudge the type to PRP. |
[QUOTE=preda;534788]Related to the "implicit P-1 before PRP" feature, would it be possible to request manual assignments "first-time PRP without P-1 done", for somebody who's willing to do both P-1 and PRP.
Right now for my own testing, I request P-1 and fudge the type to PRP.[/QUOTE] Try requesting a P-1 assignment, then unreserve it, and then request a PRP assignment with a range of one exponent (the exponent that P-1 returned). |
[QUOTE=preda;534788]Right now for my own testing, I request P-1 and fudge the type to PRP.[/QUOTE]
I've been doing just the opposite. I request a first-time PRP, but I fudge the type to P-1 to report the P-1 results. Seems to work, but that may be only because I am getting CAT3 and CAT4 exponents. None of them seems to have had any P-1 factoring done on them at all. |
[QUOTE=preda;534752]if I simply drop the PRP assignment from worktodo.txt on P-1 factor found, it would still be assigned on the server even if the factor is reported?[/QUOTE]
I had an exponent (M103464293) that I had manually checked out as a PRP test. When a P-1 factor was found and the results from gpuowl were submitted, the result was accepted and the PRP test was removed. |
[QUOTE=PhilF;534800]I had an exponent (M103464293) that I had manually checked out as a PRP test. When a P-1 factor was found and the results from gpuowl were submitted, the result was accepted and the PRP test was removed.[/QUOTE]
Good, this is the behavior we want from the server (i.e. to drop the PRP assignment when a factor-found is submitted by the same user). Do you remember the AID behavior -- did you submit the P-1 result with the AID of the PRP, or with no AID? (AID == Asisgnment ID) |
[QUOTE=preda;534831]Good, this is the behavior we want from the server (i.e. to drop the PRP assignment when a factor-found is submitted by the same user). Do you remember the AID behavior -- did you submit the P-1 result with the AID of the PRP, or with no AID? (AID == Asisgnment ID)[/QUOTE]
I'm pretty sure in that case I submitted it with the AID. But I have also submitted a few P-1 no factor results without the AID, which were also accepted and properly credited. While I have your attention, I have a question. If a factor is found during the stage 1 GCD, does that stop the already-running stage 2 so it can move on to the next exponent? |
[QUOTE=PhilF;534833]If a factor is found during the stage 1 GCD, does that stop the already-running stage 2 so it can move on to the next exponent?[/QUOTE]
Yes, that's how it's supposed to work. If it doesn't it's a bug. |
[QUOTE=PhilF;534833]If a factor is found during the stage 1 GCD, does that stop the already-running stage 2 so it can move on to the next exponent?[/QUOTE]
Yes. And I've observed it to correctly report B1 and not B2 in that case. (IIRC, at an earlier version still reporting B2 for a stage 1 factor was a bug, subsequently fixed.) [CODE]2019-11-23 07:54:23 414000127 P1 4500000 99.97%; 23332 us/sq; ETA 0d 00:00; 2fe7a97d66c4de7a 2019-11-23 07:54:53 414000127 P1 4501162 100.00%; 24325 us/sq; ETA 0d 00:00; ca35d9148b84d827 2019-11-23 07:54:53 414000127 P2 using blocks [104 - 2494] to cover 3507310 primes 2019-11-23 07:54:55 414000127 P2 using 25 buffers of 192.0 MB each 2019-11-23 08:09:29 414000127 P2 25/2880: setup 2950 ms; 28503 us/prime, 30578 primes 2019-11-23 08:09:29 414000127 P1 GCD: 17000407212943276068260591201 2019-11-23 08:09:30 {"exponent":"414000127", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-9-g9ae3189"}, "timestamp":"2019-11-23 14:09:30 UTC", "user":"kriesel", "computer":"emu/gtx1080", "aid":"0", "fft-length":25165824, "B1":3120000, "factors":["17000407212943276068260591201"]} 2019-11-23 08:09:30 419000017 FFT 24576K: Width 256x4, Height 256x4, Middle 12; 16.65 bits/word 2019-11-23 08:09:31 OpenCL args "-DEXP=419000017u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=12u -DWEIGHT_STEP=0xa.33167595f77ap-3 -DIWEIGHT_STEP=0xc.8cafe8fb59668p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-11-23 08:09:34 2019-11-23 08:09:34 OpenCL compilation in 3547 ms 2019-11-23 08:09:41 419000017 P1 B1=3140000, B2=75360000; 4529903 bits; starting at 0 2019-11-23 08:13:36 419000017 P1 10000 0.22%; 23475 us/sq; ETA 1d 05:28; 52356fa6dbf7ef75 [/CODE] |
I have an interesting observation concerning gpuowl and thermal throttling with the Radeon VII.
It is so cold here that I decided today to crank my Radeon VII to my "high power" setting, which pulls 185W and has the fan set at 95%. The junction temperature was hovering around 92 degrees. This gave me a consistent 893us timing on a 102M exponent, but the "check" timing was varying between 0.48s and 0.54s. That was strange, because on my cooler "medium power" setting of 165W and 85 degrees the check timing was always quite consistent. On a whim I decided to crank the fan up to 99%, giving me only an extra 100 RPM , but that dropped the junction temperature 2 degrees, down to 90. Now the check timing has returned to a consistent 0.53s, but the iteration timing is unchanged at 893us. I can only assume that 893us is too short of a sample time to detect the onset of thermal throttling, meaning the "check" timing code can detect it first. I submit that if your check time is varying but the iteration time is consistent, then your card is at the onset of thermal throttling. Thoughts? |
Ok some benchmarks on RTX 2080 again clock locked to 1920 MHz. Testing 94000013 -fft +2, that still seems to be the fastest combination on this card (width 512, height 512, middle 10)
On the 2020 Jan 02 version (commits up to f1b00d1) the big difference was between CARRY32 and CARRY64. Oddly enough, on this card, the performance was the opposite of what was expected : CARRY64: 3.618 ms/iter CARRY32: 3.792 ms/iter But even with CARRY64 it was a couple percent faster than when I tested it the last time (2019 Dec 11) so there's definitely improvements here and there. On the freshest 2020 Jan 11 version (commits up to 61f00d9) with the default options, and with either CARRY64 or CARRY32, the program exits with errors: [CODE]2020-01-11 20:10:25 GeForce RTX 2080-0 94000013 OK 0 loaded: blockSize 400, 0000000000000003 2020-01-11 20:10:29 GeForce RTX 2080-0 94000013 EE 800 0.00%; 3474 us/it; ETA 3d 18:42; 599534af704d7e17 (check 1.43s) 2020-01-11 20:10:30 GeForce RTX 2080-0 94000013 OK 0 loaded: blockSize 400, 0000000000000003 2020-01-11 20:10:35 GeForce RTX 2080-0 94000013 EE 800 0.00%; 3473 us/it; ETA 3d 18:42; 599534af704d7e17 (check 1.43s) 1 errors 2020-01-11 20:10:36 GeForce RTX 2080-0 94000013 OK 0 loaded: blockSize 400, 0000000000000003 2020-01-11 20:10:40 GeForce RTX 2080-0 94000013 EE 800 0.00%; 3473 us/it; ETA 3d 18:42; 599534af704d7e17 (check 1.43s) 2 errors 2020-01-11 20:10:40 GeForce RTX 2080-0 3 sequential errors, will stop. 2020-01-11 20:10:40 GeForce RTX 2080-0 Exiting because "too many errors" 2020-01-11 20:10:40 GeForce RTX 2080-0 Bye [/CODE] This seems to be connected to the MiddleMul1 implementations somehow. I first thought it came from the trig routines, but changing them only reduced the errors, and didn't remove them completely. So now, one at a time, everything still with CARRY64... - trig options, with ORIGINAL_METHOD: ORIG_SLOWTRIG: 3.618 ms (which matches the Jan 02 performance as it should) NEW_SLOWTRIG,MORE_ACCURATE: 3.527 ms NEW_SLOWTRIG,LESS_ACCURATE: 3.513 ms - MiddleMul1 options, with ORIG_SLOWTRIG: ORIGINAL_TWEAKED: 3.618 ms (same as ORIGINAL_METHOD, weird?) FANCY_MIDDLEMUL1: 3.605 ms MORE_SQUARES_MIDDLEMUL1: 3.620 ms CHEBYSHEV_METHOD: 3.563 ms (wow!) CHEBYSHEV_METHOD_FMA: Fails checks, but gives the same 3.563 ms timing, so never mind. So, everything combined: [C]./gpuowl -use NO_ASM,CARRY64,NEW_SLOWTRIG,LESS_ACCURATE,CHEBYSHEV_METHOD -yield -log 10000 -prp 94000013 -iters 100000 -fft +2[/C] Hmm... 3.463 ms/iter but fails checks after 20k iterations and then gets stuck there and exits after three errors. Try again: [C]./gpuowl -use NO_ASM,CARRY64,NEW_SLOWTRIG,MORE_ACCURATE,CHEBYSHEV_METHOD -yield -log 10000 -prp 94000013 -iters 100000 -fft +2[/C] Now it's 3.471 ms/iter but it still fails checks after 30k iterations. Crap. Third time, perhaps? [C]./gpuowl -use NO_ASM,CARRY64,NEW_SLOWTRIG,MORE_ACCURATE,FANCY_MIDDLEMUL1 -yield -log 10000 -prp 94000013 -iters 100000 -fft +2[/C] Now we're back to 3.514 ms... and runs smoothly for 100k iters at least. Let's see if the trig options still help a bit? [C]./gpuowl -use NO_ASM,CARRY64,NEW_SLOWTRIG,LESS_ACCURATE,FANCY_MIDDLEMUL1 -yield -log 10000 -prp 94000013 -iters 100000 -fft +2[/C] Works, 3.500 ms/iter pretty much spot on. (in 100k iters, lowest 3.496 highest 3.500) So that's still more than 3% off, neat. |
@nomead:
You may want to try the four combinations of CARRY32/CARRY64 with and without OLD_CARRY_LAYOUT. @everyone: Treat the new sin/cos and middlemul1 implementations (and now the new middlemul2 implementation) as test code. Clearly we need to do more analysis on the accuracy of these functions. For me, these new options yield a 25us (3.5%) improvement on Radon VII. No errors the last 12 hours, but I am not operating near the upper limit of the FFT size. |
It seems the Chebyshev method has accuracy issues. Until preda checks in the code that selects new defaults, "-use ORIGINAL_TWEAKED,ORIG_MIDDLEMUL2" is recommended.
|
[QUOTE=Prime95;534886]It seems the Chebyshev method has accuracy issues. Until preda checks in the code that selects new defaults, "-use ORIGINAL_TWEAKED,ORIG_MIDDLEMUL2" is recommended.[/QUOTE]
Do "accuracy issues" reveal themselves as Gerbicz errors? |
[QUOTE=PhilF;534888]Do "accuracy issues" reveal themselves as Gerbicz errors?[/QUOTE]
Yes. |
[QUOTE=Prime95;534882]@nomead:
You may want to try the four combinations of CARRY32/CARRY64 with and without OLD_CARRY_LAYOUT. [/QUOTE] Ok, I tested that with several different FFT sizes, and at least on that card, OLD_CARRY_LAYOUT or not, didn't make any difference (the performance is exactly the same to the microsecond level... both on CARRY64 and CARRY32.) [QUOTE=Prime95;534882] @everyone: Treat the new sin/cos and middlemul1 implementations (and now the new middlemul2 implementation) as test code. Clearly we need to do more analysis on the accuracy of these functions. For me, these new options yield a 25us (3.5%) improvement on Radon VII. No errors the last 12 hours, but I am not operating near the upper limit of the FFT size.[/QUOTE] Good hint there about the accuracy, I backed the exponent off a bit From 94000013 to 93000067 wasn't enough, errors still occurred but less often than before. But 92000059 was (still at FFT 5120K)-, now with NO_ASM,CARRY64,LESS_ACCURATE I reliably get that 3.463 ms / iter. So in total that's over 4% faster than where I started. Didn't build a version with the MiddleMul2 options yet. |
Going from ORIG_MIDDLEMUL2 to CHEBYSHEV_MIDDLEMUL2 again improved the timing from 3.463 to 3.411 ms. Accuracy perhaps degraded a little bit more; now 93000067 exits immediately due to check errors (first check fails three times in a row) while with ORIG_MIDDLEMUL2 it got up to 30k iterations before failing. But 92000059 still works fine for at least 100k iterations.
|
[QUOTE=preda;534762]I added some untested code that is supposed to:
1. when a P-1 factor is found, all PRP entries from worktodo.txt for the same exponent are removed. No result is written (to results.txt) for these deleted tasks. 2. when a P-1 factor is found in the background (GCD) while a PRP test for the same exponent is ongoing, the PRP test is aborted early and the point 1. above is applied. I think this solution [in addition to bugs] has the problem of leaving PRP assignments "hanging" on primenet. Maybe the server could implement auto-release of the PRP assignments of a user when that user submits a factor for the same exponent (because, after a factor found, it does not make sense for the user that found the factor to pursue the PRP tests)[/QUOTE] A possible bug: I updated to the latest commit (267cc60). I was in the middle of a PRP test. I had previously completed a P-1 test on this exponent, so the p1.owl and p2.owl save files were already in the exponent's save folder, which may or may not be relevant. But what happened is the new gpuowl wiped out the save files and started the PRP test over. In the log file is the line: [code] 'worktodo.txt': Could not find the line 'PRP=<AID>,1,2,101949599,-1,76,2' to delete[/code] So it looks like it thought there was a factor? Regardless, it appears to have wiped out the save files and started from scratch without checking whether or not the test already had some progress made. |
[QUOTE=PhilF;534924]A possible bug:
I updated to the latest commit (267cc60). I was in the middle of a PRP test. I had previously completed a P-1 test on this exponent, so the p1.owl and p2.owl save files were already in the exponent's save folder, which may or may not be relevant. But what happened is the new gpuowl wiped out the save files and started the PRP test over. In the log file is the line: [code] 'worktodo.txt': Could not find the line 'PRP=<AID>,1,2,101949599,-1,76,2' to delete[/code] So it looks like it thought there was a factor? Regardless, it appears to have wiped out the save files and started from scratch without checking whether or not the test already had some progress made.[/QUOTE] Was there a factor? (this would help diagnose the situation. If there was a factor it's mostly fine, if there wasn't it's more serious) -- you could look towards the end of your results.txt and see whether the P-1 reported a factor found ("F"). Also, you were running with -cleanup? Until the problem is fixed (investigating), I'd recommend running without -cleanup ; also make sure you have a newline on the last line of your worktodo.txt . What dose your worktodo.txt look line now? |
[QUOTE=preda;534929]Was there a factor? (this would help diagnose the situation. If there was a factor it's mostly fine, if there wasn't it's more serious) -- you could look towards the end of your results.txt and see whether the P-1 reported a factor found ("F"). Also, you were running with -cleanup?[/QUOTE]
No, there was no factor. The PRP test was about halfway finished. I was not running with -cleanup. The command line I am using is: gpuowl -device 0 -user pfrakes -cpu i7-4790 -B1 1000000 -B2 32000000 Worktodo.txt contained: PRP=<aid redacted>,1,2,101949599,-1,76,0 I just realized that worktodo.txt contains a PFactor= line for this exponent which may have already been there when I updated gpuowl (or gpuowl added it, I'm not sure). If it was already there maybe it confused the program. EDIT: There is not a newline at the end of the worktodo.txt file. Could that be the problem? |
Could you double check whether you actually lost the PRP savefiles? that's higly surprising, because gpuOwl does not delete the content of the past exponents ever, except when using -cleanup (which you aren't using).
So, please track down the exponent on which you were PRP half-way (from gpuowl.log). Next look in the folder for that exponent, you should have the savefiles safely there -- not deleted and not lost. What I think happened is this: you simply started a new exponent (a different one) from worktodo.txt. The order of worktodo entries changed, and the exponent you were 50% through is still there. Maybe it even has an entry in the worktodo.txt. An extended excerpt of gpuowl.log would help with understanding what happened. [QUOTE=PhilF;534930]No, there was no factor. The PRP test was about halfway finished. I was not running with -cleanup. The command line I am using is: gpuowl -device 0 -user pfrakes -cpu i7-4790 -B1 1000000 -B2 32000000 Worktodo.txt contained: PRP=<aid redacted>,1,2,101949599,-1,76,0 I just realized that worktodo.txt contains a PFactor= line for this exponent which may have already been there when I updated gpuowl (or gpuowl added it, I'm not sure). If it was already there maybe it confused the program. EDIT: There is not a newline at the end of the worktodo.txt file. Could that be the problem?[/QUOTE] |
[QUOTE=paulunderwood;534226]I don't understand it. I git cloned gpuowl and compiled, and it runs slower than before 1240 us. vs. 750 us. What am I doing wrong?[/QUOTE]
The latest cloned version compiles just fine. I was running at 760 us and now it is 709 us -- another amazing speed-up. Plus I can do the P-1 pre-factoring :tu: |
[QUOTE=preda;534937]Could you double check whether you actually lost the PRP savefiles? that's higly surprising, because gpuOwl does not delete the content of the past exponents ever, except when using -cleanup (which you aren't using).
So, please track down the exponent on which you were PRP half-way (from gpuowl.log). Next look in the folder for that exponent, you should have the savefiles safely there -- not deleted and not lost. What I think happened is this: you simply started a new exponent (a different one) from worktodo.txt. The order of worktodo entries changed, and the exponent you were 50% through is still there. Maybe it even has an entry in the worktodo.txt. An extended excerpt of gpuowl.log would help with understanding what happened.[/QUOTE] I wish that were the case, but I believe there was only 1 line in worktodo.txt at the time. There is a possibility that I messed up and kept the wrong save files because I had a number of folders in that same range with very similar names. That might be a better explanation since it does not appear to be the code. I think rather than trying to track this down, I should try it again. If I can reproduce this I'll let you know. The worktodo.txt-bak is a mess, partially because I didn't know about the needed newline (AIDs are not real). Gpuowl must have added the duplicate lines: [code]PRP=764DD1319D71BA1AE73B4D7C415C22EF,1,2,101949599,-1,76,2PFactor=764DD1319D71BA1AE73B4D7C415C22EF,1,2,101949599,-1,2,0 PRP=764DD1319D71BA1AE73B4D7C415C22EF,1,2,101949599,-1,76,0 PFactor=764DD1319D71BA1AE73B4D7C415C22EF,1,2,101949599,-1,2,0 PRP=764DD1319D71BA1AE73B4D7C415C22EF,1,2,101949599,-1,76,0[/code] Here are the last 20 lines of gpuowl.log: [code]2020-01-11 18:58:15 i7-4790 101949599 OK 48200000 47.28%; 893 us/it; ETA 0d 13:20; 52a583a2a885b208 (check 0.53s) 2020-01-11 19:01:15 i7-4790 101949599 OK 48400000 47.47%; 893 us/it; ETA 0d 13:17; 88403b125b19d22a (check 0.53s) 2020-01-11 19:04:14 i7-4790 101949599 OK 48600000 47.67%; 893 us/it; ETA 0d 13:14; 8eb6c84a2f34b07b (check 0.53s) 2020-01-11 19:06:59 i7-4790 Stopping, please wait.. 2020-01-11 19:06:59 i7-4790 101949599 OK 48784800 47.85%; 893 us/it; ETA 0d 13:11; e0868a0077e6cd96 (check 0.50s) 2020-01-11 19:06:59 i7-4790 Exiting because "stop requested" 2020-01-11 19:06:59 i7-4790 Bye 2020-01-11 19:33:20 Note: not found 'config.txt' 2020-01-11 19:33:20 config: -device 0 -user pfrakes -cpu i7-4790 -B1 1000000 -B2 32000000 2020-01-11 19:33:20 device 0, unique id '' 2020-01-11 19:33:20 i7-4790 'worktodo.txt': could not find the line 'PRP=764DD1319D71BA1AE73B4D7C415C22EF,1,2,101949599,-1,76,2' to delete 2020-01-11 19:33:20 i7-4790 101949599 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.68 bits/word 2020-01-11 19:33:21 i7-4790 OpenCL args "-DEXP=101949599u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x1.401bafea92a09p+0 -DIWEIGHT_STEP=0x1.99762c21e62cp-1 -DWEIGHT_BIGSTEP=0x1.306fe0a31b715p+0 -DIWEIGHT_BIGSTEP=0x1.ae89f995ad3adp-1 -DAMDGPU=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-01-11 19:33:22 i7-4790 OpenCL compilation in 1.70 s 2020-01-11 19:33:23 i7-4790 101949599 OK 11200 loaded: blockSize 400, 55afd3a9f362e204 2020-01-11 19:33:24 i7-4790 101949599 OK 12000 0.01%; 882 us/it; ETA 1d 00:58; 4dcb47cf0ec6fab2 (check 0.48s) 2020-01-11 19:33:36 i7-4790 Stopping, please wait.. 2020-01-11 19:33:37 i7-4790 101949599 OK 26000 0.03%; 880 us/it; ETA 1d 00:55; 26a557d2e852e785 (check 0.48s) 2020-01-11 19:33:37 i7-4790 Exiting because "stop requested" 2020-01-11 19:33:37 i7-4790 Bye [/code] |
Thank you for the info. In the meantime I added detection of the missing newline at the end of worktodo.txt, presumably the merged lines you saw should not happen anymore.
[QUOTE=PhilF;534957] The worktodo.txt-bak is a mess, partially because I didn't know about the needed newline (AIDs are not real). Gpuowl must have added the duplicate lines: [code]PRP=764DD1319D71BA1AE73B4D7C415C22EF,1,2,101949599,-1,76,2PFactor=764DD1319D71BA1AE73B4D7C415C22EF,1,2,101949599,-1,2,0 PRP=764DD1319D71BA1AE73B4D7C415C22EF,1,2,101949599,-1,76,0 PFactor=764DD1319D71BA1AE73B4D7C415C22EF,1,2,101949599,-1,2,0 PRP=764DD1319D71BA1AE73B4D7C415C22EF,1,2,101949599,-1,76,0[/code] [/QUOTE] |
nVidia change coming (pending preda's approval of my last commit).
I've gone through all the nVidia timings posted the last 2 months in an attempt to come up with reasonable default settings for nVidia GPUs. The new defaults will be: WORKINGIN4 (was WORKINGIN5) WORKINGOUT4 (was WORKINGOUT3) T2_SHUFFLE (was T2_SHUFFLE_REVERSELINE) CARRY64 (was CARRY32) FANCY_MIDDLEMUL1 (was ORIGINAL_TWEAKED) LESS_ACCURATE (was MORE_ACCURATE) The UNROLL_ALL default was not changed Note FANCY_MIDDLEMUL1 is only implemented for MIDDLE=10,11. Otherwise, the default is ORIGINAL_TWEAKED. |
gpuowl v6.11-104 hang observed on RX550
1 Attachment(s)
Gpuowl waiting for gpu, and gpu waiting for something to do? Note it went to almost no gpu ram committed. Spinner motion stopped.
[CODE]2020-01-11 19:27:14 condorella/rx550 90709987 OK 62600000 69.01%; 15528 us/it; ETA 5d 01:15; 3a0d2997b51f9d09 (check 6.36s) 2020-01-11 20:19:05 condorella/rx550 90709987 OK 62800000 69.23%; 15527 us/it; ETA 5d 00:23; 8952aef2e247dec3 (check 6.35s) 2020-01-11 21:10:57 condorella/rx550 90709987 OK 63000000 69.45%; 15527 us/it; ETA 4d 23:31; 5da17c923a0ce57b (check 6.69s) 2020-01-11 22:02:49 condorella/rx550 90709987 OK 63200000 69.67%; 15527 us/it; ETA 4d 22:39; bb3eec63b136a9c6 (check 6.34s) 2020-01-13 07:52:11 config.txt: -device 1 -user kriesel -cpu condorella/rx550 -use NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_REVERSELINE 2020-01-13 07:52:11 device 1, unique id '' 2020-01-13 07:52:11 condorella/rx550 90709987 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.30 bits/word 2020-01-13 07:52:12 condorella/rx550 OpenCL args "-DEXP=90709987u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xc.fb65b19625858p-3 -DIWEIGHT_STEP=0x9.dc1b382f1df1p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DAMDGPU=1 -DNO_ASM=1 -DMERGED_MIDDLE=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_MIDDLE=1 -DT2_SHUFFLE_REVERSELINE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-01-13 07:52:16 condorella/rx550 OpenCL compilation in 3.31 s 2020-01-13 07:52:23 condorella/rx550 90709987 OK 63200000 loaded: blockSize 400, bb3eec63b136a9c6 2020-01-13 07:52:41 condorella/rx550 90709987 OK 63200800 69.67%; 15355 us/it; ETA 4d 21:20; 8aac70bbc7dd7ca0 (check 6.31s) [/CODE]Note only 3 and 4MB used indicated in GPU-z, not consistent with usual GPU app activity. The 0 clocks indicated are a known issue with the Win7, GPU-Z, and Windows remote desktop combination in use here. The console was easily terminated and the work restarted in a new console instance. |
I don't know why this happens, but most likely is something outside of the app itself. On Linux I would look into dmesg (syslog) to see if there is anything logged there by the GPU driver. George reported a similar freeze on Linux.
[QUOTE=kriesel;535030]Gpuowl waiting for gpu, and gpu waiting for something to do? Note it went to almost no gpu ram committed. Spinner motion stopped. [CODE]2020-01-11 19:27:14 condorella/rx550 90709987 OK 62600000 69.01%; 15528 us/it; ETA 5d 01:15; 3a0d2997b51f9d09 (check 6.36s) 2020-01-11 20:19:05 condorella/rx550 90709987 OK 62800000 69.23%; 15527 us/it; ETA 5d 00:23; 8952aef2e247dec3 (check 6.35s) 2020-01-11 21:10:57 condorella/rx550 90709987 OK 63000000 69.45%; 15527 us/it; ETA 4d 23:31; 5da17c923a0ce57b (check 6.69s) 2020-01-11 22:02:49 condorella/rx550 90709987 OK 63200000 69.67%; 15527 us/it; ETA 4d 22:39; bb3eec63b136a9c6 (check 6.34s) 2020-01-13 07:52:11 config.txt: -device 1 -user kriesel -cpu condorella/rx550 -use NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_REVERSELINE 2020-01-13 07:52:11 device 1, unique id '' 2020-01-13 07:52:11 condorella/rx550 90709987 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.30 bits/word 2020-01-13 07:52:12 condorella/rx550 OpenCL args "-DEXP=90709987u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xc.fb65b19625858p-3 -DIWEIGHT_STEP=0x9.dc1b382f1df1p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DAMDGPU=1 -DNO_ASM=1 -DMERGED_MIDDLE=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_MIDDLE=1 -DT2_SHUFFLE_REVERSELINE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-01-13 07:52:16 condorella/rx550 OpenCL compilation in 3.31 s 2020-01-13 07:52:23 condorella/rx550 90709987 OK 63200000 loaded: blockSize 400, bb3eec63b136a9c6 2020-01-13 07:52:41 condorella/rx550 90709987 OK 63200800 69.67%; 15355 us/it; ETA 4d 21:20; 8aac70bbc7dd7ca0 (check 6.31s) [/CODE]Note only 3 and 4MB used indicated in GPU-z, not consistent with usual GPU app activity. The 0 clocks indicated are a known issue with the Win7, GPU-Z, and Windows remote desktop combination in use here. The console was easily terminated and the work restarted in a new console instance.[/QUOTE] |
[QUOTE=preda;535053]I don't know why this happens, but most likely is something outside of the app itself. On Linux I would look into dmesg (syslog) to see if there is anything logged there by the GPU driver. George reported a similar freeze on Linux.[/QUOTE]
Event Viewer, Windows Logs, System, Event 4101, Display 1/11/2020 10:26:10 PM Display driver amdkmdap stopped responding and has successfully recovered. This may have been the notorious Windows TDR behavior. Something took too long and Windows thought the driver stopped responding. Or the driver actually did stop responding, and needed restarting. Apparently if the gpu is reset by Windows, gpuowl will wait indefinitely and has no code for detecting that situation or dealing with it. (It waited more than 31 hours, until I intervened.) An hour timeout on gpuowl's cpu side and resubmit the lost work to the gpu might address this. This is a known issue on the CUDA side too, not just AMD. Warning, Source Display, Event ID 4101 Display driver nvlddmkm stopped responding and has successfully recovered. CUDALucas detects an error condition and exits. Batch wrappers are used to continue on. |
[Posted similar in the CuLu/nVidia how-to thread]
In looking at the GPU subforum through a n00b user's eyes, it strikes me what a mess it is. I want to be able to get to the best practices for my target GPU/OS in post 1 of a "GPU how-to" thread. This thread has a problem in that regard: Whatever was initially posted in Post #1, as a new user I expect to see either a list of, or link to, a Best Practices guide right there, and to have same updated on a running basis to reflect changes in Best Practices and/or new editions of hardware and software of the particular family covered by the thread. Here, I see that in post #195 George added some new info, and noted "The "gold standard" instructions in post #76 should be updated" ... well, they never were, and why would they be warehoused in post #76 to begin with? Why not at least edit Post #1 in the thread to reflect that? E.g "We encourage new users to peruse the whole thread, but for a quick best-practices guide, visit Post #76[link]". |
new users
"Best practice" like beauty is in the eye of the beholder.
"Information and answers" does seem a likely place for the new user to look. [URL]https://www.mersenneforum.org/forumdisplay.php?f=38[/URL] There has long been a thread (first sticky thread there) [URL]https://www.mersenneforum.org/showthread.php?t=1534[/URL] specifically for new users. I admit to not finding it when I was a new user, and for considerably after, too. Uncwilly added a pointer there (post 21) to the book-size collection of reference posts I've assembled. [URL]https://www.mersenneforum.org/showthread.php?t=24607[/URL] Second sticky thread there is one created to be a pointer to the reference info. (Ernst, #195 and #76 do not check out.) |
#76 is my short outdated checklist to getting an Ubuntu environment setup: [url]https://www.mersenneforum.org/showpost.php?p=511655&postcount=76[/url]
Someone should make an updated version, it might be me but I'll have to create my setup again to do that as it's been dismantled for a while. There is another quickstart option that might be fun which is to take a fresh install with ROCm etc installed and turn it into a live CD. |
[QUOTE=M344587487;535088]#76 is my short outdated checklist to getting an Ubuntu environment setup: [URL]https://www.mersenneforum.org/showpost.php?p=511655&postcount=76[/URL]
Someone should make an updated version, it might be me but I'll have to create my setup again to do that as it's been dismantled for a while. There is another quickstart option that might be fun which is to take a fresh install with ROCm etc installed and turn it into a live CD.[/QUOTE]Oh, thanks for clarifying that; entirely different thread. I took Ernst's post to mean #76 and 195 in this thread. |
[QUOTE=kriesel;535092]Oh, thanks for clarifying that; entirely different thread. I took Ernst's post to mean #76 and 195 in this thread.[/QUOTE]
My bad - I was referring to the Radeon 7 thread, which is also GpuOwl-centric, for obvious reasons. |
[QUOTE=preda;534937]What I think happened is this: you simply started a new exponent (a different one) from worktodo.txt. The order of worktodo entries changed, and the exponent you were 50% through is still there. Maybe it even has an entry in the worktodo.txt.[/QUOTE]
I think I have determined what happened, and it had nothing to do with gpuowl. It is a Linux gotcha that I didn't know existed. I have gpuowl running on one tty, and I check temperatures, voltages, maintain files, etc from a second tty. In this case I had stopped gpuowl, and the shell was sitting at the prompt with the working directory being gpuowl. In the other tty, I renamed the gpuowl directory, created a new one, and built a new gpuowl. I put all the relevant files and folders in that new gpuowl folder, went back to the other tty, and started gpuowl. The problem is, that shell's working directory didn't exist anymore. It had gotten renamed, but the shell didn't throw any errors. The prompt remained the same too, so I really thought I was working in the new gpuowl directory. The result was data loss. Anyway, the moral of the story is to make sure you leave the gpuowl working directory and re-enter it if you are fooling around with it inside two different tty sessions at the same time. :rakes: |
preda, kriesel, Prime95,
I am only running gpuOwl (no Prime95 on CPU) so since I am new to this what flags should I type at the command line other than gpuowl-win.exe to see if my per iteration time improves? [CODE] 2020-01-16 01:17:59 Note: no config.txt file found 2020-01-16 01:17:59 device 0, unique id '' 2020-01-16 01:18:00 GeForce GTX 1050-0 81943843 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 17.37 bits/word 2020-01-16 01:18:01 GeForce GTX 1050-0 OpenCL args "-DEXP=81943843u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -DWEIGHT_STEP=0xc.69d9ee158d5b8p-3 -DIWEIGHT_STEP=0xa.4fb5ef629afb8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-01-16 01:18:02 GeForce GTX 1050-0 2020-01-16 01:18:02 GeForce GTX 1050-0 OpenCL compilation in 0.38 s 2020-01-16 01:18:10 GeForce GTX 1050-0 81943843 OK 17130000 loaded: blockSize 400, de145902b2059f4b 2020-01-16 01:18:31 GeForce GTX 1050-0 81943843 OK 17130800 20.91%; 17605 us/it; ETA 13d 04:57; ebd9d81bce345290 (check 7.17s) 2020-01-16 01:39:20 GeForce GTX 1050-0 81943843 OK 17200000 20.99%; 17942 us/it; ETA 13d 10:40; 3d45c4478e50aeb6 (check 7.33s) 2020-01-16 02:39:38 GeForce GTX 1050-0 81943843 OK 17400000 21.23%; 18050 us/it; ETA 13d 11:37; f639fefb9039b2ab (check 7.34s) 2020-01-16 03:39:55 GeForce GTX 1050-0 81943843 OK 17600000 21.48%; 18051 us/it; ETA 13d 10:38; a78776fdd7f2ede3 (check 7.32s) 2020-01-16 04:40:13 GeForce GTX 1050-0 81943843 OK 17800000 21.72%; 18051 us/it; ETA 13d 09:37; 9fc9b0886bf2dc88 (check 7.33s) 2020-01-16 05:40:30 GeForce GTX 1050-0 81943843 OK 18000000 21.97%; 18051 us/it; ETA 13d 08:37; 7a4566d01385c94e (check 7.32s) 2020-01-16 06:40:47 GeForce GTX 1050-0 81943843 OK 18200000 22.21%; 18050 us/it; ETA 13d 07:36; 7f5c47985833c542 (check 7.33s) 2020-01-16 07:41:05 GeForce GTX 1050-0 81943843 OK 18400000 22.45%; 18050 us/it; ETA 13d 06:36; 24bf061871068b89 (check 7.34s) 2020-01-16 08:41:22 GeForce GTX 1050-0 81943843 OK 18600000 22.70%; 18050 us/it; ETA 13d 05:36; 5ffa6f774116574f (check 7.32s) 2020-01-16 09:41:40 GeForce GTX 1050-0 81943843 OK 18800000 22.94%; 18051 us/it; ETA 13d 04:37; 9c909adec676d76d (check 7.32s) 2020-01-16 10:41:57 GeForce GTX 1050-0 81943843 OK 19000000 23.19%; 18050 us/it; ETA 13d 03:35; bedb43a9ebaa0317 (check 7.33s) 2020-01-16 11:42:14 GeForce GTX 1050-0 81943843 OK 19200000 23.43%; 18049 us/it; ETA 13d 02:34; 869f10128493c2a3 (check 7.32s) 2020-01-16 12:31:20 GeForce GTX 1050-0 Stopping, please wait.. 2020-01-16 12:31:35 GeForce GTX 1050-0 81943843 OK 19363600 23.63%; 18052 us/it; ETA 13d 01:48; dec7c8f5d6498df8 (check 7.33s) 2020-01-16 12:31:35 GeForce GTX 1050-0 Exiting because "stop requested" 2020-01-16 12:31:35 GeForce GTX 1050-0 Bye[/CODE] |
I typically run something like the following to test different options, then put the best in config.txt or a .bat file. It varies by gpu what is best, and maybe by fft length also. Note, this is somewhat old and does not address all the latest -use options CARRY32 vs CARRY64 etc and following, which seem to me not well documented yet. A read of the source code is suggested for the full list. And take any recommendations from Preda or Prime95 very seriously.[CODE]:gwtime.bat for Windows in a command prompt box. Assumes cd to the gpuowl directory is already done.
:iter count is required to be multiple of 10000; 10000 is enough for repeatable results up to gtx1080 or so set iters=10000 :get gpu warmed up and stable, get baseline :first one is there just to ensure the gpu is warmed up and clock-stable somewhat, ignore its timing, use the second gpuowl-win -time -iters %iters% -use NO_ASM gpuowl-win -time -iters %iters% -use NO_ASM :uncomment as needed below to run the pass you want :goto passtwo :goto passthree :[B]passone[/B] :get the workingin and workingout optimals in pass one gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN :repeated, let's see reproducibility once; then onward through the list gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1A gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN2 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN3 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN5 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT0 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1A gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT2 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT3 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT4 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT5 goto chain :[B]passtwo[/B] :edit the following before running pass two, to the best workingin and workingout choices determined n pass one gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_MIDDLE gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_REVERSELINE goto chain :[B]passthree[/B] :edit the following if needed gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE start wordpad gpuowl.log goto chain :add passes if needed for CARRY32, CARRY64, etc here? :[B]chain[/B] to continuing production work; edit as needed for your environment. In my case mf.bat runs mfaktc.) cd C:\Users\Ken\Documents\tf-gtx1050ti mf[/CODE]It could be improved by substituting more environment variables in for in and out in passes two and beyond. Gains I've seen from tuning various GTX10xx have been pretty modest. From gpuowl-wrap.cpp, gpuowl-v6.11-132-gfd01ee5, it's a considerable list: [CODE]/* List of user-serviceable -use flags and their effects FMA : use OpenCL fma(x, y, z) instead of x * y + z in MAD(x, y, z) NO_ASM : request to not use any inline __asm() NO_OMOD: do not use GCN output modifiers in __asm() NO_MERGED_MIDDLE WORKINGOUTs <AMD default is WORKINGOUT3> <nVidia default is WORKINGOUT4> WORKINGINs <AMD default is WORKINGIN5> <nVidia default is WORKINGIN4> PREFER_LESS_FMA ORIG_X2 INLINE_X2 FMA_X2 UNROLL_ALL <nVidia default> UNROLL_NONE UNROLL_WIDTH UNROLL_HEIGHT <AMD default> UNROLL_MIDDLEMUL1 <AMD default> UNROLL_MIDDLEMUL2 <AMD default> T2_SHUFFLE <nVidia default> NO_T2_SHUFFLE T2_SHUFFLE_WIDTH T2_SHUFFLE_MIDDLE T2_SHUFFLE_HEIGHT T2_SHUFFLE_REVERSELINE <AMD default> OLD_FFT8 <default> NEWEST_FFT8 NEW_FFT8 OLD_FFT5 NEW_FFT5 <default> NEWEST_FFT5 NEW_FFT10 <default> OLD_FFT10 CARRY32 <AMD default> // This is potentially dangerous option for large FFTs. Carry may not fit in 31 bits. CARRY64 <nVidia default> FANCY_MIDDLEMUL1 <nVidia default> // Only implemented for MIDDLE=10 and MIDDLE=11 MORE_SQUARES_MIDDLEMUL1 // Replaces some complex muls with complex squares but uses more registers CHEBYSHEV_METHOD // Uses fewer floating point ops than original MiddleMul1 implementation (worse accuracy?) CHEBYSHEV_METHOD_FMA // Uses fewest floating point ops of any of the MiddleMul1 implementations (worse accuracy?) ORIGINAL_METHOD // The original straightforward MiddleMul1 implementation ORIGINAL_TWEAKED <AMD default> // The original MiddleMul1 implementation tweaked to save two multiplies ORIG_MIDDLEMUL2 <default> // The original straightforward MiddleMul2 implementation CHEBYSHEV_MIDDLEMUL2 // Uses fewer floating point ops than original MiddleMul2 implementation (worse accuracy?) ORIG_SLOWTRIG // Use the compliler's implementation of sin/cos functions NEW_SLOWTRIG <default> // Our own sin/cos implementation MORE_ACCURATE <AMD default> // Our own sin/cos implementation with extra accuracy (should be needlessly slower, but isn't) LESS_ACCURATE <nVidia default> // Opposite of MORE_ACCURATE */ [/CODE]It's not clear to me which combinations make sense and which don't. |
[QUOTE=wfgarnett3;535242]what flags should I type at the command line other than gpuowl-win.exe to see if my per iteration time improves?[/QUOTE]
With a 1050 you won't expect some significant speedup since it is bottlenecked by the GPU's double-precision capabilities and not memory bandwidth, which is what most of the recent code optimization addresses. All the necessary flags should be already enabled if you are using the newest version. What I recommend is to use MSI afterburner and push the core clock as high as possible (and maybe even a bit of memory but I don't think it will be significant). |
gpuowl-v6.11-132-gfd01ee5 Windows build
2 Attachment(s)
Here it is. The usual shower of warnings reappeared during the build. Untested so far except for help output.
|
[QUOTE=kriesel;535254]Here it is. The usual shower of warnings reappeared during the build. Untested so far except for help output.[/QUOTE]
Warning such as these are pretty benign: [QUOTE]File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:33:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); [/QUOTE] |
[QUOTE=Prime95;535003]nVidia change coming (pending preda's approval of my last commit).
I've gone through all the nVidia timings posted the last 2 months in an attempt to come up with reasonable default settings for nVidia GPUs. The new defaults will be: WORKINGIN4 (was WORKINGIN5) WORKINGOUT4 (was WORKINGOUT3) T2_SHUFFLE (was T2_SHUFFLE_REVERSELINE) CARRY64 (was CARRY32) FANCY_MIDDLEMUL1 (was ORIGINAL_TWEAKED) LESS_ACCURATE (was MORE_ACCURATE) The UNROLL_ALL default was not changed Note FANCY_MIDDLEMUL1 is only implemented for MIDDLE=10,11. Otherwise, the default is ORIGINAL_TWEAKED.[/QUOTE] What happens if a -use option is specified that does not apply for the fft length, such as specifying FANCY_MIDDLEMUL1 for MIDDLE other than 10 or 11? Do we know the performance of the numerous options are independent, such as optimal Workingin and workingout don't change as a result of the other options being changed? [QUOTE]// Use the [B]compliler[/B]'s implementation of sin/cos functions[/QUOTE]Is that a compiler that lies about errors and warnings?:smile: |
[QUOTE=kriesel;535266]What happens if a -use option is specified that does not apply for the fft length, such as specifying FANCY_MIDDLEMUL1 for MIDDLE other than 10 or 11?[/QUOTE]
I think you'll get an error message. Try it. You could also do "-use FANCY_MIDDLEMUL1,ORIGINAL_TWEAKED" to get fancy middlemul1 for middle=10,11 and original tweaked middle mul1 otherwise. |
900M P-1
Same Colab style run, 3.42 days computing time logged combined for the two stages. Stage 2 was 88.8% the length of stage 1. Fft length 57344K, 19 buffers.
[URL]https://www.mersenne.org/report_exponent/?exp_lo=900000107&exp_hi=&full=1[/URL] [QUOTE=kriesel;534386]Fan Ming build of gpuowl, 800M P-1 on Tesla P100, 2.35 days running time for both stages, [URL]https://www.mersenne.org/report_exponent/?exp_lo=800000027&full=1[/URL][/QUOTE] |
P-1 stage2 speed-up
I just commited a tiny change that should speed-up significantly second-stage of P-1. (I tested with ROCm 2.10)
[url]https://github.com/preda/gpuowl/commit/1e0ce1d8abf9f8b189373085a6cbdc2e2d814d33[/url] The ROCm optimizer bug is described here [url]https://github.com/RadeonOpenCompute/ROCm/issues/1002[/url] |
[QUOTE=preda;535445]I just commited a tiny change that should speed-up significantly second-stage of P-1. (I tested with ROCm 2.10)
[URL]https://github.com/preda/gpuowl/commit/1e0ce1d8abf9f8b189373085a6cbdc2e2d814d33[/URL][/QUOTE]Commit changes seem to be rocm-specific. |
[QUOTE=kriesel;535448]Commit changes seem to be rocm-specific.[/QUOTE]
Yes as I said, I tested (i.e. measured) with ROCm 2.10. Should not regress on other platforms, but I'm looking for feedback on this. If a regression is detected (e.g. on Nvidia) I'll switch the change on/off as appropriate. |
[QUOTE=preda;535450]Yes as I said, I tested (i.e. measured) with ROCm 2.10. Should not regress on other platforms, but I'm looking for feedback on this. If a regression is detected (e.g. on Nvidia) I'll switch the change on/off as appropriate.[/QUOTE]Coded for and test on are two separate distinctions. Thanks for all you do.
This is timely, as I was just considering rolling through a slew of gpu models with gpuowl minor updates and -use options timing script updates on PRP. |
gpuowl-v6.11-134-g1e0ce1d Windows x64 build
2 Attachment(s)
Only ran -h so far, but here it is. This is the latest commit at the moment, that has Preda's P-1 stage 2 tweak.
|
RX550 -use testing
[CODE]gpuowl v6.11-134
RX550 4GB Win7 x64 exponent 92400689 PRP 5M fft -iters 10000 -time NO_ASM 14491 NO_ASM,UNROLL_ALL 14492 NO_ASM,UNROLL_NONE 14364 NO_ASM,UNROLL_WIDTH 14363 NO_ASM,UNROLL_HEIGHT 14360 * NO_ASM,UNROLL_MIDDLEMUL1 14412 NO_ASM,UNROLL_MIDDLEMUL2 14363 NO_ASM,UNROLL_WIDTH,UNROLL_HEIGHT 14369 NO_ASM,UNROLL_WIDTH,UNROLL_MIDDLEMUL2 14363 NO_ASM,UNROLL_HEIGHT,UNROLL_MIDDLEMUL2 14361 NO_ASM,UNROLL_WIDTH,UNROLL_HEIGHT,UNROLL_MIDDLEMUL2 14362 NO_ASM,MERGED_MIDDLE,WORKINGIN 19729 NO_ASM,MERGED_MIDDLE,WORKINGIN 19730 NO_ASM,MERGED_MIDDLE,WORKINGIN1 14683 NO_ASM,MERGED_MIDDLE,WORKINGIN1A 14573 NO_ASM,MERGED_MIDDLE,WORKINGIN2 14849 NO_ASM,MERGED_MIDDLE,WORKINGIN3 15175 NO_ASM,MERGED_MIDDLE,WORKINGIN4 19404 NO_ASM,MERGED_MIDDLE,WORKINGIN5 14487 * NO_ASM,MERGED_MIDDLE,WORKINGOUT 32143 NO_ASM,MERGED_MIDDLE,WORKINGOUT0 17920 NO_ASM,MERGED_MIDDLE,WORKINGOUT1 14866 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A 14825 NO_ASM,MERGED_MIDDLE,WORKINGOUT2 14395 * NO_ASM,MERGED_MIDDLE,WORKINGOUT3 14496 NO_ASM,MERGED_MIDDLE,WORKINGOUT4 15450 NO_ASM,MERGED_MIDDLE,WORKINGOUT5 15736 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_WIDTH 14554 NO_ASM,MERGED_MIDDLE,%wgkin%,%wkgout%,T2_SHUFFLE_MIDDLE 14319 * NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_HEIGHT 14364 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_REVERSELINE 14394 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE 14483 NO_ASM,MERGED_MIDDLE,%wgkin%,%wkgout%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE 14309 * NO_ASM,MERGED_MIDDLE,%wgkin%,%wkgout%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE 18362 NO_ASM,MERGED_MIDDLE,%wgkin%,%wkgout%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,CARRY32 14326 * NO_ASM,MERGED_MIDDLE,%wgkin%,%wkgout%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE,CARRY64 14965 %allotheroptions%,FANCY_MIDDLEMUL1 14320 %allotheroptions%,MORE_SQUARES_MIDDLEMUL1 14398 %allotheroptions%,CHEBYSHEV_METHOD EE on load %allotheroptions%,CHEBYSHEV_METHOD_FMA EE on load %allotheroptions%,ORIGINAL_METHOD 14318 * %allotheroptions%,ORIGINAL_TWEAKED 14321 %allotheroptions%,ORIG_MIDDLEMUL2 14315 %allotheroptions%,CHEBYSHEV_MIDDLEMUL2 14309 * %allotheroptions%,ORIG_SLOWTRIG 14772 %allotheroptions%,NEW_SLOWTRIG 14306 %allotheroptions%,MORE_ACCURATE 14309 %allotheroptions%,LESS_ACCURATE 14184 * NO_ASM,UNROLL_HEIGHT,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,CARRY32,ORIGINAL_METHOD,LESS_ACCURATE 14152 *[/CODE]14491/14152= 1.024 |
ending in 2
I have just downloaded a bunch of world record PRPs and some end in 0 and others end in 2. Will "program":{"name":"gpuowl", "version":"v6.11-124-g267cc60"} perform P-1 automatically on those ending with "2" or do I need to upgrade gpuOwl?
|
gpuowl v6.11-134-g1e0ce1d RX480 -use option timings
[CODE]gpuowl v6.11-134-g1e0ce1d
RX480 8GB Win7 x64 exponent 92162731 PRP 5M fft -iters 10000 -time NO_ASM 3372, 3374 NO_ASM,UNROLL_ALL 3375 NO_ASM,UNROLL_NONE 3349 NO_ASM,UNROLL_WIDTH 3351 NO_ASM,UNROLL_HEIGHT 3344 * NO_ASM,UNROLL_MIDDLEMUL1 3352 NO_ASM,UNROLL_MIDDLEMUL2 3373 NO_ASM,UNROLL_WIDTH,UNROLL_HEIGHT 3337 * NO_ASM,UNROLL_WIDTH,UNROLL_MIDDLEMUL2 3374 NO_ASM,UNROLL_HEIGHT,UNROLL_MIDDLEMUL2 3365 NO_ASM,UNROLL_WIDTH,UNROLL_HEIGHT,UNROLL_MIDDLEMUL2 3370 NO_ASM,MERGED_MIDDLE,WORKINGIN 5991 NO_ASM,MERGED_MIDDLE,WORKINGIN 6011 NO_ASM,MERGED_MIDDLE,WORKINGIN1 3397 * NO_ASM,MERGED_MIDDLE,WORKINGIN1A 3426 NO_ASM,MERGED_MIDDLE,WORKINGIN2 3478 NO_ASM,MERGED_MIDDLE,WORKINGIN3 3473 NO_ASM,MERGED_MIDDLE,WORKINGIN4 3821 NO_ASM,MERGED_MIDDLE,WORKINGIN5 3365 NO_ASM,MERGED_MIDDLE,WORKINGOUT 5835 NO_ASM,MERGED_MIDDLE,WORKINGOUT0 4543 NO_ASM,MERGED_MIDDLE,WORKINGOUT1 3352 * NO_ASM,MERGED_MIDDLE,WORKINGOUT1A 3384 NO_ASM,MERGED_MIDDLE,WORKINGOUT2 3739 NO_ASM,MERGED_MIDDLE,WORKINGOUT3 3365 NO_ASM,MERGED_MIDDLE,WORKINGOUT4 3468 NO_ASM,MERGED_MIDDLE,WORKINGOUT5 3427 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_WIDTH 3383 * NO_ASM,MERGED_MIDDLE,%wgkin%,%wkgout%,T2_SHUFFLE_MIDDLE 3394 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_HEIGHT 3390 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_REVERSELINE 3395 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE 3436 set allotheroptions=NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,WORKINGIN1,WORKINGOUT1 %allotheroptions%,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT 3341 * %allotheroptions%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_WIDTH 3353 %allotheroptions%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_WIDTH,T2_REVERSELINE 3351 set allotheroptions=NO_ASM,MERGED_MIDDLE,UNROLL_HEIGHT,UNROLL_WIDTH,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT %allotheroptions%,CARRY32 3356 * %allotheroptions%,CARRY64 3479 set allotheroptions=NO_ASM,MERGED_MIDDLE,UNROLL_HEIGHT,UNROLL_WIDTH,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,CARRY32 %allotheroptions%,FANCY_MIDDLEMUL1 3349 %allotheroptions%,MORE_SQUARES_MIDDLEMUL1 3341 * %allotheroptions%,CHEBYSHEV_METHOD EE on load %allotheroptions%,CHEBYSHEV_METHOD_FMA 3350 %allotheroptions%,ORIGINAL_METHOD 3356 %allotheroptions%,ORIGINAL_TWEAKED 3348 %allotheroptions%,ORIG_MIDDLEMUL2 3434 %allotheroptions%,CHEBYSHEV_MIDDLEMUL2 3357 * %allotheroptions%,ORIG_SLOWTRIG EE %allotheroptions%,NEW_SLOWTRIG 3360 * %allotheroptions%,MORE_ACCURATE 3362 %allotheroptions%,LESS_ACCURATE EE NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG[/CODE] |
[QUOTE=paulunderwood;535614]I have just downloaded a bunch of world record PRPs and some end in 0 and others end in 2. Will "program":{"name":"gpuowl", "version":"v6.11-124-g267cc60"} perform P-1 automatically on those ending with "2" or do I need to upgrade gpuOwl?[/QUOTE]
Yes, that version is after the automatic P-1 commit, so you should be fine. However, the latest commit from 2 days ago implements a change that results in a 33% speed improvement in P-1 stage 2. |
[QUOTE=PhilF;535616]Yes, that version is after the automatic P-1 commit, so you should be fine. However, the latest commit from 2 days ago implements a change that results in a 33% speed improvement in P-1 stage 2.[/QUOTE]
Got it! :tu: |
[QUOTE=PhilF;535616]Yes, that version is after the automatic P-1 commit, so you should be fine. However, the latest commit from 2 days ago implements a change that results in a 33% speed improvement in P-1 stage 2.[/QUOTE]
Correction, it's 33% speed-up of one kernel (tailFusedMulDelta) that was taking up 45% of stage-2 time before. So it's more like a 12% speed-up of the stage2 (I hope). |
gpuowl v6.11-134-g1e0ce1d -use options on GTX1080
[CODE]gpuowl v6.11-134-g1e0ce1d
GTX1080 8GB Win7 x64 exponent 91996859 PRP 5M fft -iters 10000 -time NO_ASM 4541, 4560 NO_ASM,UNROLL_ALL 4542 NO_ASM,UNROLL_NONE BUILD_PROGRAM_FAILURE NO_ASM,UNROLL_WIDTH BUILD_PROGRAM_FAILURE NO_ASM,UNROLL_HEIGHT BUILD_PROGRAM_FAILURE NO_ASM,UNROLL_MIDDLEMUL1 BUILD_PROGRAM_FAILURE NO_ASM,UNROLL_MIDDLEMUL2 BUILD_PROGRAM_FAILURE NO_ASM,UNROLL_WIDTH,UNROLL_HEIGHT BUILD_PROGRAM_FAILURE NO_ASM,UNROLL_WIDTH,UNROLL_MIDDLEMUL2 BUILD_PROGRAM_FAILURE NO_ASM,UNROLL_HEIGHT,UNROLL_MIDDLEMUL2 BUILD_PROGRAM_FAILURE NO_ASM,UNROLL_WIDTH,UNROLL_HEIGHT,UNROLL_MIDDLEMUL2 4554 NO_ASM,MERGED_MIDDLE,WORKINGIN 4590 NO_ASM,MERGED_MIDDLE,WORKINGIN 5006 NO_ASM,MERGED_MIDDLE,WORKINGIN1 4574 NO_ASM,MERGED_MIDDLE,WORKINGIN1A 4666 NO_ASM,MERGED_MIDDLE,WORKINGIN2 4541 NO_ASM,MERGED_MIDDLE,WORKINGIN3 4548 NO_ASM,MERGED_MIDDLE,WORKINGIN4 4539 * NO_ASM,MERGED_MIDDLE,WORKINGIN5 4594 NO_ASM,MERGED_MIDDLE,WORKINGOUT 4615 NO_ASM,MERGED_MIDDLE,WORKINGOUT0 4622 NO_ASM,MERGED_MIDDLE,WORKINGOUT1 4587 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A 4654 NO_ASM,MERGED_MIDDLE,WORKINGOUT2 4614 NO_ASM,MERGED_MIDDLE,WORKINGOUT3 4587 NO_ASM,MERGED_MIDDLE,WORKINGOUT4 4555 * NO_ASM,MERGED_MIDDLE,WORKINGOUT5 4602 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4 4533 * NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_WIDTH 4599 NO_ASM,MERGED_MIDDLE,%wgkin%,%wkgout%,T2_SHUFFLE_MIDDLE 4646 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_HEIGHT 4591 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_REVERSELINE 4605 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE 4548 set allotheroptions=NO_ASM,UNROLL_ALL,WORKINGIN4,WORKINGOUT4 %allotheroptions%,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT 4537 %allotheroptions%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_WIDTH 4558 %allotheroptions%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_WIDTH,T2_REVERSELINE 4517 * set allotheroptions=NO_ASM,MERGED_MIDDLE,UNROLL_ALL,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_WIDTH,T2_REVERSELINE %allotheroptions%,CARRY32 4698 %allotheroptions%,CARRY64 4559 * set allotheroptions=NO_ASM,MERGED_MIDDLE,UNROLL_HEIGHT,UNROLL_WIDTH,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_WIDTH,T2_REVERSELINE,CARRY32 %allotheroptions%,FANCY_MIDDLEMUL1 4531 %allotheroptions%,MORE_SQUARES_MIDDLEMUL1 4542 %allotheroptions%,CHEBYSHEV_METHOD 4492 %allotheroptions%,CHEBYSHEV_METHOD_FMA 4483 * %allotheroptions%,ORIGINAL_METHOD 4546 %allotheroptions%,ORIGINAL_TWEAKED 4579 %allotheroptions%,ORIG_MIDDLEMUL2 4518 %allotheroptions%,CHEBYSHEV_MIDDLEMUL2 4445 * %allotheroptions%,ORIG_SLOWTRIG 4552 %allotheroptions%,NEW_SLOWTRIG 4447 %allotheroptions%,MORE_ACCURATE 4438 %allotheroptions%,LESS_ACCURATE 4428 * -use NO_ASM,UNROLL_ALL,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_WIDTH,T2_REVERSELINE,CARRY64,CHEBYSHEV_METHOD_FMA,CHEBYSHEV_MIDDLEMUL2,LESS_ACCURATE 4550/4428 =~ 1.0276[/CODE] |
gpuowl v6.11-134 -use options testing on Radeon VII
Substantial tuning gained about 3% above program defaults.[CODE]gpuowl v6.11-134-g1e0ce1d
Radeon VII 16GB at 1244Mhz gpu clock, 880Mhz memory clock Win10 x64 exponent 92561231 PRP 5M fft -iters 10000 -time NO_ASM 1017 NO_ASM 1014 NO_ASM,UNROLL_ALL 1015 NO_ASM,UNROLL_NONE 1001 NO_ASM,UNROLL_WIDTH 1002 NO_ASM,UNROLL_HEIGHT 1002 NO_ASM,UNROLL_MIDDLEMUL1 1013 NO_ASM,UNROLL_MIDDLEMUL2 989 * NO_ASM,MERGED_MIDDLE,WORKINGIN 1393 NO_ASM,MERGED_MIDDLE,WORKINGIN 1391 NO_ASM,MERGED_MIDDLE,WORKINGIN1 1035 NO_ASM,MERGED_MIDDLE,WORKINGIN1A 1032 NO_ASM,MERGED_MIDDLE,WORKINGIN2 1038 NO_ASM,MERGED_MIDDLE,WORKINGIN3 1023 NO_ASM,MERGED_MIDDLE,WORKINGIN4 1081 NO_ASM,MERGED_MIDDLE,WORKINGIN5 1010 * NO_ASM,MERGED_MIDDLE,WORKINGOUT 1177 NO_ASM,MERGED_MIDDLE,WORKINGOUT0 1133 NO_ASM,MERGED_MIDDLE,WORKINGOUT1 1028 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A 1058 NO_ASM,MERGED_MIDDLE,WORKINGOUT2 1117 NO_ASM,MERGED_MIDDLE,WORKINGOUT3 1011 * NO_ASM,MERGED_MIDDLE,WORKINGOUT4 1042 NO_ASM,MERGED_MIDDLE,WORKINGOUT5 1026 set wkgin=WORKINGIN5 set wkgout=WORKINGOUT3 NO_ASM,UNROLL_WIDTH,UNROLL_HEIGHT 1000 NO_ASM,UNROLL_WIDTH,UNROLL_MIDDLEMUL2 987 NO_ASM,UNROLL_HEIGHT,UNROLL_MIDDLEMUL2 989 NO_ASM,UNROLL_WIDTH,UNROLL_HEIGHT,UNROLL_MIDDLEMUL2 986 * NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_WIDTH 1017 NO_ASM,MERGED_MIDDLE,%wgkin%,%wkgout%,T2_SHUFFLE_MIDDLE 1022 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_HEIGHT 1012 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_REVERSELINE 1011 * NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE 1029 set allotheroptions=NO_ASM,MERGED_MIDDLE,UNROLL_HEIGHT,UNROLL_WIDTH,WORKINGIN5,WORKINGOUT3 %allotheroptions%,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT 1014 * %allotheroptions%,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_WIDTH 1016 %allotheroptions%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,T2_SHUFFLE_REVERSELINE 1018 %allotheroptions%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_REVERSELINE 1020 %allotheroptions%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_WIDTH,T2_SHUFFLE_REVERSELINE 1028 set allotheroptions=NO_ASM,MERGED_MIDDLE,UNROLL_HEIGHT,UNROLL_WIDTH,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT %allotheroptions%,CARRY32 989 * %allotheroptions%,CARRY64 1022 set allotheroptions=NO_ASM,MERGED_MIDDLE,UNROLL_HEIGHT,UNROLL_WIDTH,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,CARRY32 %allotheroptions%,FANCY_MIDDLEMUL1 1011 %allotheroptions%,MORE_SQUARES_MIDDLEMUL1 991 %allotheroptions%,CHEBYSHEV_METHOD 989 * %allotheroptions%,CHEBYSHEV_METHOD_FMA 989 * %allotheroptions%,ORIGINAL_METHOD 991 %allotheroptions%,ORIGINAL_TWEAKED 990 %allotheroptions%,ORIG_MIDDLEMUL2 987 * %allotheroptions%,CHEBYSHEV_MIDDLEMUL2 988 %allotheroptions%,ORIG_SLOWTRIG 1022 %allotheroptions%,NEW_SLOWTRIG 988 %allotheroptions%,MORE_ACCURATE 988 %allotheroptions%,LESS_ACCURATE 986 * NO_ASM,UNROLL_MIDDLEMUL2,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT,CARRY32,CHEBYSHEV_METHOD,ORIG_MIDDLEMUL2,LESS_ACCURATE repeatability +-1.5/1015.5 = 0.148% base 1015.5 final 986 ratio 1015.5/986 = 1.030 timing overhead ~986/974-1 =~ .0123[/CODE] |
[QUOTE=Prime95;535288]I think you'll get an error message. Try it.
You could also do "-use FANCY_MIDDLEMUL1,ORIGINAL_TWEAKED" to get fancy middlemul1 for middle=10,11 and original tweaked middle mul1 otherwise.[/QUOTE] Yup. Instant trouble, a real showstopper.[CODE]2020-01-23 16:54:19 condorella/rx480 82053239 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 17.39 bits/word 2020-01-23 16:54:19 condorella/rx480 OpenCL args "-DEXP=82053239u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -DWEIGHT_STEP=0xc.373107b1f3e78p-3 -DIWEIGHT_STE P=0xa.7a792f1683b7p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_MIDDLEMUL2=1 -DMERGED_MIDD LE=1 -DMORE_SQUARES_MIDDLEMUL1=1 -DNEW_SLOWTRIG=1 -DNO_ASM=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_WIDTH=1 -DUNROLL_HEIGHT=1 -DUNROLL_WIDTH=1 -DWORKINGIN1=1 -DWORK INGOUT1=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-01-23 16:54:22 condorella/rx480 OpenCL compilation in 2.65 s 2020-01-23 16:54:24 condorella/rx480 82053239 EE 0 loaded: blockSize 400, a8c3b11429b46cbf (expected 0000000000000003) 2020-01-23 16:54:24 condorella/rx480 Exiting because "error on load" 2020-01-23 16:54:24 condorella/rx480 Bye[/CODE]Next step; coding in gpuowl to apply -use options only when they are legal and nonfatal for the applicable fft length? Support for fft-length conditionals in config.txt? This is a PRP-DC 4608K fft length, to which 5M fft length optimal -use options were applied from config.txt, with fatal result. Not looking forward to tuning a long list of -use options on an fftlength by fftlength basis for numerous gpu models and swapping them out manually when exponents change, or having such crashes discard 18 hours of gpu time instead of make progress. At 3-5% speedup on many models, it takes a long time to pay that back. |
Gpuowl -use options tune on RX480 for 4.5M fft length
Likes a somewhat different combination than for 5M[QUOTE]gpuowl v6.11-134-g1e0ce1d
RX480 8GB Win7 x64 exponent 82053239 PRP 4.5M fft -iters 10000 -time all timings below are in microsec/iteration NO_ASM 3021 NO_ASM 3022 NO_ASM,UNROLL_ALL 3010 * NO_ASM,UNROLL_NONE 3039 NO_ASM,UNROLL_WIDTH 3035 NO_ASM,UNROLL_HEIGHT 3038 NO_ASM,UNROLL_MIDDLEMUL1 3036 NO_ASM,UNROLL_MIDDLEMUL2 3027 NO_ASM,UNROLL_WIDTH,UNROLL_MIDDLEMUL1 3028 NO_ASM,UNROLL_WIDTH,UNROLL_MIDDLEMUL2 3019, 3028 NO_ASM,NO_ASM,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1 2989 * NO_ASM,UNROLL_WIDTH,UNROLL_MIDDLEMUL1,UNROLL_MIDDLEMUL2 2996 NO_ASM,MERGED_MIDDLE,WORKINGIN 5309 NO_ASM,MERGED_MIDDLE,WORKINGIN 5306 NO_ASM,MERGED_MIDDLE,WORKINGIN1 3032 NO_ASM,MERGED_MIDDLE,WORKINGIN1A 3052 NO_ASM,MERGED_MIDDLE,WORKINGIN2 3111 NO_ASM,MERGED_MIDDLE,WORKINGIN3 3133 NO_ASM,MERGED_MIDDLE,WORKINGIN4 3454 NO_ASM,MERGED_MIDDLE,WORKINGIN5 2995 * NO_ASM,MERGED_MIDDLE,WORKINGOUT 5224 NO_ASM,MERGED_MIDDLE,WORKINGOUT0 4036 NO_ASM,MERGED_MIDDLE,WORKINGOUT1 2984 * NO_ASM,MERGED_MIDDLE,WORKINGOUT1A 3012/2982 NO_ASM,MERGED_MIDDLE,WORKINGOUT2 3353 NO_ASM,MERGED_MIDDLE,WORKINGOUT3 2986 NO_ASM,MERGED_MIDDLE,WORKINGOUT4 3137 NO_ASM,MERGED_MIDDLE,WORKINGOUT5 2995 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout% 2973 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_WIDTH 2957 * NO_ASM,MERGED_MIDDLE,%wgkin%,%wkgout%,T2_SHUFFLE_MIDDLE 3026 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_HEIGHT 2966 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE_REVERSELINE 2972 NO_ASM,MERGED_MIDDLE,%wkgin%,%wkgout%,T2_SHUFFLE 2992 set allotheroptions=NO_ASM,WORKINGIN5,WORKINGOUT1,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1 %allotheroptions%,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT 2938 * %allotheroptions%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_WIDTH 2989 %allotheroptions%,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_WIDTH,T2_SHUFFLE_REVERSELINE 2987 set allotheroptions=NO_ASM,MERGED_MIDDLE,UNROLL_HEIGHT,UNROLL_WIDTH,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT %allotheroptions%,CARRY32 2940 * %allotheroptions%,CARRY64 3054 set allotheroptions=NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1,CARRY32 %allotheroptions%,FANCY_MIDDLEMUL1 "error: Clang front-end compilation failed!" %allotheroptions%,MORE_SQUARES_MIDDLEMUL1 2985 %allotheroptions%,CHEBYSHEV_METHOD 2919 %allotheroptions%,CHEBYSHEV_METHOD_FMA 2911 * %allotheroptions%,ORIGINAL_METHOD 2942 %allotheroptions%,ORIGINAL_TWEAKED 2937 set allotheroptions=NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1,CARRY32,CHEBYSHEV_METHOD_FMA %allotheroptions%,ORIG_MIDDLEMUL2 2926 %allotheroptions%,CHEBYSHEV_MIDDLEMUL2 2916 * %allotheroptions%,ORIG_SLOWTRIG 3058 %allotheroptions%,NEW_SLOWTRIG 2910 %allotheroptions%,MORE_ACCURATE 2921 %allotheroptions%,LESS_ACCURATE 2909 * NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1,CARRY32,CHEBYSHEV_METHOD_FMA,CHEBYSHEV_MIDDLEMUL2,LESS_ACCURATE base 3021.5 repeatability +-1.5/5307.5 =~ +-0.028% best 2909 ratio 3021.5/2909 = 1.039[/QUOTE] |
GTX1660 -use options
Hi!
Can someone help me? I've a Nvidia GTX1660 running gpuowl at around 8250 us/it (FFT 5632K). With some overclock I can get less then 8000 us/it, but I'm not sure how to test gpus better for errors or tuning it with -use options. Can someone help me out? More questions: 1. I'm considering to buy 2x Radeon VII or should I wait for Big Navi? 2. Anyone with AMD 5700 XT benchmarks to compare with Radeon VII? 3. CudaLUCAS seems to run slower then gpuowl. Are there any other options? Thanks! |
[QUOTE=JCoveiro;536407]
1. I'm considering to buy 2x Radeon VII or should I wait for Big Navi? [/QUOTE] My expectation is that Radeon VII will still be better than "big navi" because it has such a good DP (FP64) throughput. Also the memory is both large and fast. In addition to that, the prices for Radeon VII moved down a bit. |
[QUOTE=JCoveiro;536407]Hi!
Can someone help me? I've a Nvidia GTX1660 running gpuowl at around 8250 us/it (FFT 5632K). With some overclock I can get less then 8000 us/it, but I'm not sure how to test gpus better for errors or tuning it with -use options. Can someone help me out? More questions: 1. I'm considering to buy 2x Radeon VII or should I wait for Big Navi? 2. Anyone with AMD 5700 XT benchmarks to compare with Radeon VII? 3. CudaLUCAS seems to run slower then gpuowl. Are there any other options? Thanks![/QUOTE] You have an Nvidia Turing GPU which is amazing for trial factoring, and the 1660 is particularly efficient in that workload due to its 1080ti like performance but significantly lower power draw. In that case, overclocking the core will help trial factoring but memory won't change anything but waste more power. Go ahead and try that out if you want to factor some numbers. A1: Definitely buy 2 radeon VII over big navi, I seriously doubt amd will put FP64 performance on big navi since the norm right now for gaming GPU is to cut down FP64 as much as possible to save die space for Ray Tracing or Shaders. A2: I think the OpenCL is still broken on Navi GPUs and run much more stably on GCN GPUs. Even if it's not broken I am assuming that the 5700xt should perform slightly better than a stock Vega 56 in PRP, so around 3000us/it for 5632K FFT. But Radeon VII should get it close to 1000us/it (I personally don't own one but if i remembered correctly from other owner's benchmarks). A3: gpuowl is already the fastest option for primality tests. Maybe future optimizations will make it even faster but for now it's going to be way faster than CUDALucas on memory bound GPUs such as Titan V or Radeon VII (in which the latter doesn't run on CUDALucas but gpuowl is 2x faster on Titan V). Though it doesn't matter if you own a modern Nvidia (supporting OpenCL 2.0 and above) or AMD GPU and you should always run gpuowl over CUDALucas or CLLucas due to its superior error checking algorithm that could potentially eliminate the need for double checking. |
[QUOTE=preda;536412]My expectation is that Radeon VII will still be better than "big navi" because it has such a good DP (FP64) throughput. Also the memory is both large and fast. In addition to that, the prices for Radeon VII moved down a bit.[/QUOTE]
The prices of the Radeon VII are still high in here. It's around 800€ each. But I think they're a good investment anyway (for this kind of project). I hope they move down a bit more, since AMD discontinued them. |
[QUOTE=xx005fs;536419]You have an Nvidia Turing GPU which is amazing for trial factoring, and the 1660 is particularly efficient in that workload due to its 1080ti like performance but significantly lower power draw. In that case, overclocking the core will help trial factoring but memory won't change anything but waste more power. Go ahead and try that out if you want to factor some numbers.
A1: Definitely buy 2 radeon VII over big navi, I seriously doubt amd will put FP64 performance on big navi since the norm right now for gaming GPU is to cut down FP64 as much as possible to save die space for Ray Tracing or Shaders. A2: I think the OpenCL is still broken on Navi GPUs and run much more stably on GCN GPUs. Even if it's not broken I am assuming that the 5700xt should perform slightly better than a stock Vega 56 in PRP, so around 3000us/it for 5632K FFT. But Radeon VII should get it close to 1000us/it (I personally don't own one but if i remembered correctly from other owner's benchmarks). A3: gpuowl is already the fastest option for primality tests. Maybe future optimizations will make it even faster but for now it's going to be way faster than CUDALucas on memory bound GPUs such as Titan V or Radeon VII (in which the latter doesn't run on CUDALucas but gpuowl is 2x faster on Titan V). Though it doesn't matter if you own a modern Nvidia (supporting OpenCL 2.0 and above) or AMD GPU and you should always run gpuowl over CUDALucas or CLLucas due to its superior error checking algorithm that could potentially eliminate the need for double checking.[/QUOTE] Thanks for the answers! Well... AMD 5700 XT is alot cheaper than the Radeon VII. They're almost half-price of the Radeon VII. Also 3000us/it for the 5700 XT is still good, but 1000us/it for the Radeon VII is awesome! |
1 Attachment(s)
[QUOTE=JCoveiro;536407]Hi!
Can someone help me? I've a Nvidia GTX1660 running gpuowl at around 8250 us/it (FFT 5632K). With some overclock I can get less then 8000 us/it, but I'm not sure how to test gpus better for errors or tuning it with -use options. Can someone help me out? More questions: 1. I'm considering to buy 2x Radeon VII or should I wait for Big Navi? 2. Anyone with AMD 5700 XT benchmarks to compare with Radeon VII? 3. CudaLUCAS seems to run slower then gpuowl. Are there any other options? Thanks![/QUOTE]A GTX1660 is so much better at TF, relatively speaking, that it's probably a waste to run it on gpuowl instead, even though gpuowl is excellent. But your kit your choice. CUDALucas has not had any significant development in years, so naturally has fallen behind. Preda, Prime95, and others have done a great job on gpuowl speed and other improvements. "More questions" has been covered pretty well already by others. For gpuowl -use option timing and tuning, I use the Windows batch file attached. Pass zero and one run together; other passes individually. Edit the gotos and sets from one pass to the next, to change the control flow and -use options in effect, respectively. That is what I did to produce my previous posts of tuning results. See the comments at both ends of the file, for more info. (Had to zip it, the forum won't accept a .bat file.) Please post your tuning results. |
Bug!!!
[QUOTE=kriesel;536423]A GTX1660 is so much better at TF, relatively speaking, that it's probably a waste to run it on gpuowl instead, even though gpuowl is excellent. But your kit your choice. CUDALucas has not had any significant development in years, so naturally has fallen behind. Preda, Prime95, and others have done a great job on gpuowl speed and other improvements.
"More questions" has been covered pretty well already by others. For gpuowl -use option timing and tuning, I use the Windows batch file attached. Pass zero and one run together; other passes individually. Edit the gotos and sets from one pass to the next, to change the control flow and -use options in effect, respectively. That is what I did to produce my previous posts of tuning results. See the comments at both ends of the file, for more info. (Had to zip it, the forum won't accept a .bat file.) Please post your tuning results.[/QUOTE] Thanks! But first, just want to say that there is a bug on the program. [B]I'm using gpuowl v6.11-134-g1e0ce1d.[/B] ##################################### Running the batch outputs the following errors: [B]Error#1[/B] Running the Windows batch file at: 2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_NONE outputs some errors and after the following: 2020-02-01 23:55:14 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B]Error#2[/B] Running the Windows batch file at: 2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_WIDTH outputs some errors and after the following: 2020-02-01 23:55:15 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B]Error#3[/B] Running the Windows batch file at: 2020-02-01 23:55:15 config: -time -iters 10000 -use NO_ASM,UNROLL_HEIGHT outputs some errors and after the following: 2020-02-01 23:55:15 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B] Error#4[/B] Running the Windows batch file at: 2020-02-01 23:55:15 config: -time -iters 10000 -use NO_ASM,UNROLL_MIDDLEMUL1 outputs some errors and after the following: 2020-02-01 23:55:16 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B]Error#5[/B] Running the Windows batch file at: 2020-02-01 23:55:16 config: -time -iters 10000 -use NO_ASM,UNROLL_MIDDLEMUL2 outputs some errors and after the following: 2020-02-01 23:55:16 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build ##################################### [B]Here are some more details on Error#1:[/B] [CODE]2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_NONE 2020-02-01 23:55:14 device 0, unique id '' 2020-02-01 23:55:14 GeForce GTX 1660-0 99753809 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.30 bits/word 2020-02-01 23:55:14 GeForce GTX 1660-0 OpenCL args "-DEXP=99753809u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xd.064531a6f6b48p-3 -DIWEIGHT_STEP=0x9.d3e00e7c301p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DUNROLL_NONE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-02-01 23:55:14 GeForce GTX 1660-0 OpenCL compilation error -11 (args -DEXP=99753809u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xd.064531a6f6b48p-3 -DIWEIGHT_STEP=0x9.d3e00e7c301p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DUNROLL_NONE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DNO_ASM=1) 2020-02-01 23:55:14 GeForce GTX 1660-0 <kernel>:1386:3: error: expected identifier or '(' for (i32 s = 4; s >= 0; s -= 2) { ^ <kernel>:1394:3: error: expected identifier or '(' for (i32 s = 4; s >= 0; s -= 2) { ^ <kernel>:1404:3: error: expected identifier or '(' for (i32 s = 3; s >= 0; s -= 3) { ^ <kernel>:1412:3: error: expected identifier or '(' for (i32 s = 3; s >= 0; s -= 3) { ^ <kernel>:1422:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 2) { ^ <kernel>:1430:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 2) { ^ <kernel>:1440:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 3) { ^ <kernel>:1448:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 3) { ^ <kernel>:1458:3: error: expected identifier or '(' for (i32 s = 5; s >= 2; s -= 3) { ^ <kernel>:1502:3: error: expected identifier or '(' for (i32 s = 5; s >= 2; s -= 3) { ^ <kernel>:2478:3: error: expected identifier or '(' for (i32 i = 0; i < MIDDLE; ++i) { ^ 2020-02-01 23:55:14 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build 2020-02-01 23:55:14 GeForce GTX 1660-0 Bye [/CODE] |
[QUOTE=kriesel;536423]Please post your tuning results.[/QUOTE]
I ran this file on my Titan V to try out the most recent update, but I got consistently slower result (657us/it vs 632us/it) compared to version 6.11-113-g6ecd9a2 that I am running. Seems like that the default Nvidia optimization settings don't play well with the Titan V. |
Bug#2
I have found another bug, while trying to test M47 (a lower exponent).
[CODE]2020-02-02 01:36:38 gpuowl v6.11-134-g1e0ce1d 2020-02-02 01:36:38 Note: not found 'config.txt' 2020-02-02 01:36:38 config: -use UNROLL_ALL,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE,CARRY64,FANCYMIDDLEMUL1,LESS_ACCURATE 2020-02-02 01:36:38 device 0, unique id '' 2020-02-02 01:36:38 GeForce GTX 1660-0 43112609 FFT 2304K: Width 8x8, Height 256x8, Middle 9; 18.27 bits/word 2020-02-02 01:36:39 GeForce GTX 1660-0 OpenCL args "-DEXP=43112609u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.3ca600d8f455p-3 -DIWEIGHT_STEP=0x9.ab80a96f8aeap-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DCARRY64=1 -DFANCYMIDDLEMUL1=1 -DLESS_ACCURATE=1 -DT2_SHUFFLE=1 -DUNROLL_ALL=1 -DWORKINGIN4=1 -DWORKINGOUT4=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-02-02 01:36:39 GeForce GTX 1660-0 OpenCL compilation error -11 (args -DEXP=43112609u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.3ca600d8f455p-3 -DIWEIGHT_STEP=0x9.ab80a96f8aeap-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DCARRY64=1 -DFANCYMIDDLEMUL1=1 -DLESS_ACCURATE=1 -DT2_SHUFFLE=1 -DUNROLL_ALL=1 -DWORKINGIN4=1 -DWORKINGOUT4=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DNO_ASM=1) 2020-02-02 01:36:39 GeForce GTX 1660-0 <kernel>:2009:2: error: WORKINGOUT4 not compatible with this FFT size #error WORKINGOUT4 not compatible with this FFT size ^ 2020-02-02 01:36:39 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build 2020-02-02 01:36:39 GeForce GTX 1660-0 Bye[/CODE] |
[QUOTE=kriesel;536423]CUDALucas has not had any significant development in years, so naturally has fallen behind.[/QUOTE]
This is a bit misleading. First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else. At least, for everything I run in my rigs, old cards (like 580 and clasic/black Titans) and new cards (like 1080Ti and 2080Ti) included. Second, there is "almost nothing" to improve in CudaLucas (well, there are some minor things, that's why the quotes, but the big picture won't change much), this toy is just a "square, subtract 2, repeat" tool, which uses Nvidia cuda FFT libraries (cuFFT) to do the squaring. These libraries, [U]indeed, fell behind[/U], as you said. They were not updated by Nvidia for ages, and if we can convince them to make (or make by ourselves:chappy:) some cuFFT library a hundred times faster than the actual one, all CL would need would be a recompilation. :razz:. For the owl, Preda made the libraries from scratch, and they are well tuned for opencl, but nvidia cards are not so good in emulating opencl, they are faster when native cuda is used. |
[QUOTE=LaurV;536442]First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else.[/QUOTE]
This statement is a bit misleading since with the new gpuowl updates it has became significantly more efficient on memory bandwidth usage. I am seeing significant speedups on GPUs with high DP ratio like K80, P100, V100, Titan V. There is indeed not much difference for the GTX and RTX cards due to most of them being DP bound instead of memory. |
[QUOTE=LaurV;536442]First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else. At least, for everything I run in my rigs, old cards (like 580 and clasic/black Titans) and new cards (like 1080Ti and 2080Ti) included. [/QUOTE]
Nope, on my RTX 2080 at least, the current version of gpuowl is about 20-30% faster than cudalucas, varying a bit from FFT size to another. The big improvement came in the beginning of December 2019, and smaller optimizations have accumulated since then, so if you've tested gpuowl before that, please test again. |
:rakes::surrender
You may be totally right... We didn't move to such new fancy things yet.. :sad: Edit @nomead, crosspost, I was replying to xx, but what you say is really tempting, BRB soon :smile: |
[QUOTE=nomead;536452]Nope, on my RTX 2080 at least, the current version of gpuowl is about 20-30% faster than cudalucas, varying a bit from FFT size to another. The big improvement came in the beginning of December 2019, and smaller optimizations have accumulated since then, so if you've tested gpuowl before that, please test again.[/QUOTE]
Interesting. I saw only around 5% improvement going from CUDALucas to gpuowl on my 1070. Did RTX series get higher than 1/32 DP ratio? |
It seems that your OpenCL compiler does not like __attribute__((opencl_unroll_hint(1))). To work around that, simply pass "-use UNROLL_ALL" (and none of the other UNROLL_ options), or, if running on a Nvidia card, don't pass any UNROLL option at all.
[QUOTE=JCoveiro;536429]Thanks! But first, just want to say that there is a bug on the program. [B]I'm using gpuowl v6.11-134-g1e0ce1d.[/B] ##################################### Running the batch outputs the following errors: [B]Error#1[/B] Running the Windows batch file at: 2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_NONE outputs some errors and after the following: 2020-02-01 23:55:14 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B]Error#2[/B] Running the Windows batch file at: 2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_WIDTH outputs some errors and after the following: 2020-02-01 23:55:15 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B]Error#3[/B] Running the Windows batch file at: 2020-02-01 23:55:15 config: -time -iters 10000 -use NO_ASM,UNROLL_HEIGHT outputs some errors and after the following: 2020-02-01 23:55:15 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B] Error#4[/B] Running the Windows batch file at: 2020-02-01 23:55:15 config: -time -iters 10000 -use NO_ASM,UNROLL_MIDDLEMUL1 outputs some errors and after the following: 2020-02-01 23:55:16 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build [B]Error#5[/B] Running the Windows batch file at: 2020-02-01 23:55:16 config: -time -iters 10000 -use NO_ASM,UNROLL_MIDDLEMUL2 outputs some errors and after the following: 2020-02-01 23:55:16 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build ##################################### [B]Here are some more details on Error#1:[/B] [CODE]2020-02-01 23:55:14 config: -time -iters 10000 -use NO_ASM,UNROLL_NONE 2020-02-01 23:55:14 device 0, unique id '' 2020-02-01 23:55:14 GeForce GTX 1660-0 99753809 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.30 bits/word 2020-02-01 23:55:14 GeForce GTX 1660-0 OpenCL args "-DEXP=99753809u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xd.064531a6f6b48p-3 -DIWEIGHT_STEP=0x9.d3e00e7c301p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DUNROLL_NONE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-02-01 23:55:14 GeForce GTX 1660-0 OpenCL compilation error -11 (args -DEXP=99753809u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xd.064531a6f6b48p-3 -DIWEIGHT_STEP=0x9.d3e00e7c301p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DUNROLL_NONE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DNO_ASM=1) 2020-02-01 23:55:14 GeForce GTX 1660-0 <kernel>:1386:3: error: expected identifier or '(' for (i32 s = 4; s >= 0; s -= 2) { ^ <kernel>:1394:3: error: expected identifier or '(' for (i32 s = 4; s >= 0; s -= 2) { ^ <kernel>:1404:3: error: expected identifier or '(' for (i32 s = 3; s >= 0; s -= 3) { ^ <kernel>:1412:3: error: expected identifier or '(' for (i32 s = 3; s >= 0; s -= 3) { ^ <kernel>:1422:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 2) { ^ <kernel>:1430:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 2) { ^ <kernel>:1440:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 3) { ^ <kernel>:1448:3: error: expected identifier or '(' for (i32 s = 6; s >= 0; s -= 3) { ^ <kernel>:1458:3: error: expected identifier or '(' for (i32 s = 5; s >= 2; s -= 3) { ^ <kernel>:1502:3: error: expected identifier or '(' for (i32 s = 5; s >= 2; s -= 3) { ^ <kernel>:2478:3: error: expected identifier or '(' for (i32 i = 0; i < MIDDLE; ++i) { ^ 2020-02-01 23:55:14 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build 2020-02-01 23:55:14 GeForce GTX 1660-0 Bye [/CODE][/QUOTE] |
As the error says, you can't use "WORKINGOUT4" with that FFT size.
Did you try running the program without any -use options? does that work? [QUOTE=JCoveiro;536435]I have found another bug, while trying to test M47 (a lower exponent). [CODE]2020-02-02 01:36:38 gpuowl v6.11-134-g1e0ce1d 2020-02-02 01:36:38 Note: not found 'config.txt' 2020-02-02 01:36:38 config: -use UNROLL_ALL,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE,CARRY64,FANCYMIDDLEMUL1,LESS_ACCURATE 2020-02-02 01:36:38 device 0, unique id '' 2020-02-02 01:36:38 GeForce GTX 1660-0 43112609 FFT 2304K: Width 8x8, Height 256x8, Middle 9; 18.27 bits/word 2020-02-02 01:36:39 GeForce GTX 1660-0 OpenCL args "-DEXP=43112609u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.3ca600d8f455p-3 -DIWEIGHT_STEP=0x9.ab80a96f8aeap-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DCARRY64=1 -DFANCYMIDDLEMUL1=1 -DLESS_ACCURATE=1 -DT2_SHUFFLE=1 -DUNROLL_ALL=1 -DWORKINGIN4=1 -DWORKINGOUT4=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-02-02 01:36:39 GeForce GTX 1660-0 OpenCL compilation error -11 (args -DEXP=43112609u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.3ca600d8f455p-3 -DIWEIGHT_STEP=0x9.ab80a96f8aeap-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DCARRY64=1 -DFANCYMIDDLEMUL1=1 -DLESS_ACCURATE=1 -DT2_SHUFFLE=1 -DUNROLL_ALL=1 -DWORKINGIN4=1 -DWORKINGOUT4=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0 -DNO_ASM=1) 2020-02-02 01:36:39 GeForce GTX 1660-0 <kernel>:2009:2: error: WORKINGOUT4 not compatible with this FFT size #error WORKINGOUT4 not compatible with this FFT size ^ 2020-02-02 01:36:39 GeForce GTX 1660-0 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:247 build 2020-02-02 01:36:39 GeForce GTX 1660-0 Bye[/CODE][/QUOTE] |
[QUOTE=preda;536513]As the error says, you can't use "WORKINGOUT4" with that FFT size.
Did you try running the program without any -use options? does that work?[/QUOTE] Yes. It runs without -use options. I was just testing the "optimized settings" for Nvidia cards, but it seems that I can't use WORKINGOUT4. Going to test again and publish the results for the GTX1660. |
[QUOTE=LaurV;536442]
First, CudaLucas was never intended to run on AMD cards. For native/cuda/nvidia cards is still faster than anything else.[/QUOTE] Same here -- CUDALucas is faster than gpuOwL on my EVGA Geforce 1050 2GB and also doesn't slow down Prime95 running on the CPU concurrently. However even with the iteration times being a couple milleseconds slower on gpuOwL versus CUDALucas (plus a couple millesecond slowdown to Prime95 if it is running too) since gpuOwL eliminates the need for a double-check that makes gpuOwL the overall time saver winner over CUDALucas for me. I only did one PRP double-check with gpuOwL and I occasionally do LL double-checks with CUDALucas. Since the 1/32 double-precision ratio is terrible I mostly stick with Trial Factoring using mfaktc. |
[QUOTE=wfgarnett3;536864]since gpuOwL eliminates the need for a double-check[/QUOTE]But it doesn't. There is a PRP DC work type for good reasons;
1) errors may occur outside the code that the GEC occurs, both in the software and in the manual reporting process, and some have already been confirmed to occur; 2) PRP DC guards against someone forging PRP first test submissions; 3) PRP GEC itself has a very low error rate, but not zero. Gerbicz himself has given error rate estimates. [QUOTE]Since the 1/32 double-precision ratio is terrible I mostly stick with Trial Factoring using mfaktc.[/QUOTE]Good choice. |
CUDALucas still has its place;
faster on a few gpu models than gpuowl; will run on older NVIDIA gpus that are entirelly incapable of running gpuowl because they don't support the required OpenCL level for gpuowl; relatively current gpuowl versions don't do LL so can't do LLDC (although v0.5 and v0.6 gpuowl can with 4M fft) It would be great if CUDALucas had the Jacobi check. |
[QUOTE=kriesel;536893]CUDALucas still has its place;
faster on a few gpu models than gpuowl; will run on older NVIDIA gpus that are entirelly incapable of running gpuowl because they don't support the required OpenCL level for gpuowl; relatively current gpuowl versions don't do LL so can't do LLDC (although v0.5 and v0.6 gpuowl can with 4M fft) It would be great if CUDALucas had the Jacobi check.[/QUOTE] Ha, ha, re. your note about running on older nVidia GPUs -- in preparation for my recent upgrade of my deskside Haswell system to put a Radeon 7 in the PCI3 slot, I first removed an ancient gtx430 card from the PCI2 slot. Mike/Xyzzy had gifted me that ~5 years ago to use to play with CUDA development work - I actually got as far as working TF code, but never did get the sieving stuff optimized for the nVidia architecture, so it was spending way more time there than it should, overall speed was about 1/2 that of mfaktc. Anyhow, I still have the card, could re-install it in PCI2, that would leave a mere 1" gap between it and the underbelly fan array of the R7 so I would need to make sure it wasn't significantly impeding airflow to the latter. Just curious - do you have any sense how fast - and I use the term very loosely :) - this card would be at LL and TF? Probably not worth it on a work-per-watt basis, but the experiment might be useful in terms of seeing whether *some* kind of GPU - e.g. a newer used model of one of the ones known to be good choices for GIMPS work - could go into PCI2 without hurting the R7 throughput. (A second R7 is not an option, even if it could be adapted to go into PCI2, my PS has only 2 8-pin power connects, and the current R7 uses them both.) |
[QUOTE=ewmayer;536909]I first removed an ancient gtx430 ... do you have any sense how fast - and I use the term very loosely :) - this card would be at LL and TF? Probably not worth it on a work-per-watt basis[/QUOTE]Probably slower than any of these:
[URL]https://www.mersenne.ca/mfaktc.php[/URL] (enter gtx 4 in the model search box) [URL]https://www.mersenne.ca/cudalucas.php[/URL] (ditto) If you want a more sure answer, find the NVIDIA spec pages for the GTX430 and one or more of the models listed on Heinrich's pages, and compute an estimate by proportion. Since my test GTX480 could not run gpuowl, the GTX430's chances are slim. It would look pretty weak compared to a Radeon VII. Or in mfaktc, compared to a GTX 1xxx or RTX. Try it on. Send James some benchmarks. Understandable if you don't regard its space-heater value as sufficient to run for long, in California. |
gpuowl-win build 6.11-142-gf54af2e
2 Attachment(s)
Stumbled across the new commit while bringing up a new Colab account. Haven't run it other than -h. I defer to Mihai as to what this offers beyond v6.11-134.
It still eats a whole cpu core on Windows 10 during P-1 with -yield option included. At least for a while after startup. |
10M test p-1 reliably missed known factor
The first run of the 10M exponent was with optimized -use flags on gpuowl-win v6.11-134. After that it was just -use NO_ASM. Different -use, different bounds, never found the known factor although all the bounds tried should have been adequate. The 50M test p-1 would not run with the numerous -use options previously in place for a different fft length, so those options were temporarily removed.
The same radeon vii gpu has run error free in PRP GEC since a clock reduction Dec 15 2019. (1250Mhz gpu, 880 Mhz memory; runs P-1 at 70-100W; hot spot currently 71C) It also missed the known factor on 4444091. And missed the known factors of 2000081. And that of 15000031. Everything I tried below 20M failed. [CODE]{"exponent":"10000831", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:17:56 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":524288, "B1":30000, "B2":500000} {"exponent":"50001781", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:31:40 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":2883584, "B1":100000, "B2":5000000, "factors":["4392938042637898431087689"]} {"exponent":"10000831", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:33:03 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":524288, "B1":30000, "B2":500000} {"exponent":"24000577", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:43:41 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":1310720, "B1":300000, "factors":["13504596665207"]} {"exponent":"10000831", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:46:42 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":524288, "B1":40000, "B2":600000} {"exponent":"10000831", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:48:38 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":524288, "B1":120000, "B2":2200000} {"exponent":"4444091", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 00:02:00 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":229376, "B1":15015, "B2":90000} {"exponent":"61012769", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 00:05:19 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":3670016, "B1":20000, "B2":2000000, "factors":["2018028590362685212673"]} {"exponent":"2000081", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 00:19:58 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":131072, "B1":15015, "B2":300300} {"exponent":"2000081", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 00:24:03 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":131072, "B1":15015, "B2":30030} {"exponent":"15000031", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 00:55:47 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":786432, "B1":180000, "B2":3780000} {"exponent":"20000023", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 01:05:52 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":1179648, "B1":240000, "factors":["60100040564410724460091241"]} [/CODE] |
[QUOTE=preda;533571]ROCm exposes a per-GPU unique_id, e.g.:
[CODE] cat /sys/class/drm/card0/device/unique_id 3044212172dc768c [/CODE]This id is a property of the GPU itself, and does not depend on the system or PCIe slot. So changing a GPU in a different slot, or in a different system, preserves the UID. I added a way to specify the GPU to run on by using this unique id: ./gpuowl -uid 3044212172dc768c this can be used instead of -device (-d) which specifies the device by position in the list of devices. The advantage is that the identity of the GPU is preserved when swapping the PCIe slots. Combining -uid with -cpu allows to associate a stable symbolic name to an actual GPU. [/QUOTE] The Windows driver does not support this, yielding a nul id:[CODE]2020-02-09 19:00:02 roa/radeonvii Bye 2020-02-09 19:00:06 config: -device 1 -user kriesel -cpu roa/radeonvii -yield -maxAlloc 16000 -use NO_ASM 2020-02-09 19:00:06 device 1, unique id ''[/CODE] |
Thank you for the bug report! investigating..
The first failed case, 10000831, is fixed by using "-use ORIG_SLOWTRIG". Could you please check if there are any failures with -use ORIG_SLOWTRIG In parallel we'll be looking for a better fix. A faster way to repro the problem is e.g. "gpuowl -prp 10000831" which fails GEC. Note: if re-running the P-1s be sure to delete the savefiles from the previous runs (or run in a new location) [QUOTE=kriesel;537179]The first run of the 10M exponent was with optimized -use flags on gpuowl-win v6.11-134. After that it was just -use NO_ASM. Different -use, different bounds, never found the known factor although all the bounds tried should have been adequate. The 50M test p-1 would not run with the numerous -use options previously in place for a different fft length, so those options were temporarily removed. The same radeon vii gpu has run error free in PRP GEC since a clock reduction Dec 15 2019. (1250Mhz gpu, 880 Mhz memory; runs P-1 at 70-100W; hot spot currently 71C) It also missed the known factor on 4444091. And missed the known factors of 2000081. And that of 15000031. Everything I tried below 20M failed. [CODE]{"exponent":"10000831", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:17:56 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":524288, "B1":30000, "B2":500000} {"exponent":"50001781", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:31:40 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":2883584, "B1":100000, "B2":5000000, "factors":["4392938042637898431087689"]} {"exponent":"10000831", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:33:03 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":524288, "B1":30000, "B2":500000} {"exponent":"24000577", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:43:41 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":1310720, "B1":300000, "factors":["13504596665207"]} {"exponent":"10000831", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:46:42 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":524288, "B1":40000, "B2":600000} {"exponent":"10000831", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-09 23:48:38 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":524288, "B1":120000, "B2":2200000} {"exponent":"4444091", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 00:02:00 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":229376, "B1":15015, "B2":90000} {"exponent":"61012769", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 00:05:19 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":3670016, "B1":20000, "B2":2000000, "factors":["2018028590362685212673"]} {"exponent":"2000081", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 00:19:58 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":131072, "B1":15015, "B2":300300} {"exponent":"2000081", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 00:24:03 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":131072, "B1":15015, "B2":30030} {"exponent":"15000031", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 00:55:47 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":786432, "B1":180000, "B2":3780000} {"exponent":"20000023", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 01:05:52 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":1179648, "B1":240000, "factors":["60100040564410724460091241"]} [/CODE][/QUOTE] |
@Ken An attempted fix is in the most recent commit [url]https://github.com/preda/gpuowl/commit/6146b6d49716011d0340f5a670653c12ef4f417c[/url]
, which is supposed to fix the issues without requiring -use ORIG_SLOWTRIG [QUOTE=preda;537204]Thank you for the bug report! investigating.. The first failed case, 10000831, is fixed by using "-use ORIG_SLOWTRIG". Could you please check if there are any failures with -use ORIG_SLOWTRIG In parallel we'll be looking for a better fix. A faster way to repro the problem is e.g. "gpuowl -prp 10000831" which fails GEC. Note: if re-running the P-1s be sure to delete the savefiles from the previous runs (or run in a new location)[/QUOTE] |
[QUOTE=preda;537214]@Ken An attempted fix is in the most recent commit [URL]https://github.com/preda/gpuowl/commit/6146b6d49716011d0340f5a670653c12ef4f417c[/URL]
, which is supposed to fix the issues without requiring -use ORIG_SLOWTRIG[/QUOTE] Thanks for the quick actions on this. The issue reached to at least 19M. [CODE]{"exponent":"18000137", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 17:58:35 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":1048576, "B1":220000, "B2":5060000} {"exponent":"19000013", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 19:03:35 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":1048576, "B1":230000, "B2":9520000} [/CODE]Will try a new commit later. |
Hi Ken, running with -use ORIG_SLOWTRIG I find factors (in stage1 already) for both 18000137 and 19000013. Do you still see a problem with -use ORIG_SLOWTRIG?
[QUOTE=kriesel;537248]Thanks for the quick actions on this. The issue reached to at least 19M. [CODE]{"exponent":"18000137", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 17:58:35 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":1048576, "B1":220000, "B2":5060000} {"exponent":"19000013", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 19:03:35 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":1048576, "B1":230000, "B2":9520000} [/CODE]Will try a new commit later.[/QUOTE] |
[QUOTE=preda;537256]Hi Ken, running with -use ORIG_SLOWTRIG I find factors (in stage1 already) for both 18000137 and 19000013. Do you still see a problem with -use ORIG_SLOWTRIG?[/QUOTE]Win10 x64, gpuowl v6.11-134
In config.txt:[CODE]-device 1 -user kriesel -cpu roa/radeonvii -yield -maxAlloc 16000 -use NO_ASM,ORIG_SLOWTRIG[/CODE]In worktodo:[CODE]B1=15015,B2=300300;PFactor=0,1,2,2000081,-1,61,2 B1=3000;PFactor=0,1,2,4444091,-1,64,2 B1=30000,B2=500000;PFactor=0,1,2,10000831,-1,70,2 B1=180000,B2=3780000;PFactor=0,1,2,15000031,-1,66,2 B1=220000,B2=5060000;PFactor=0,1,2,18000137,-1,35,2 B1=230000,B2=9520000;PFactor=0,1,2,19000013,-1,53,2[/CODE]Results (15M missed factor): [CODE]{"exponent":"2000081", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 21:25:34 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":131072, "B1":15015, "factors":["2700109974025273"]} {"exponent":"4444091", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 21:25:47 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":229376, "B1":15015, "factors":["1809798096458971047321927127"]} {"exponent":"10000831", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 21:26:15 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":524288, "B1":30000, "B2":500000, "factors":["646560662529991467527"]} {"exponent":"15000031", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 21:30:07 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":786432, "B1":180000, "B2":3780000} {"exponent":"18000137", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 21:32:35 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":1048576, "B1":220000, "factors":["2479169845866581244380961527"]} {"exponent":"19000013", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-02-10 21:35:57 UTC", "user":"kriesel", "computer":"roa/radeonvii", "fft-length":1048576, "B1":230000, "factors":["4674003199"]}[/CODE]I don't know why, but the 768k fft for 15M took 444us/it in P1 while the 1024k for 18M, 19M took 310us/it in P1. I reran the 15M with a 3980000 B2 but it still missed. |
I ran
[CODE]B1=180000,B2=3780000;PFactor=0,1,2,15000031,-1,66,2 [/CODE] with a fresh checked out master, with and without ORIG_SLOWTRIG, as well as an old version (pre new sin/cos code) and none of those found a factor. So there may be some other issue. I'll keep checking. |
P-1 needs more error checks built in
All 0 residues in P-1 stage 1, and it marches blindly on.
Worktodo line: [CODE]B1=1040000,B2=28080000;PFactor=0,1,2,99998441,-1,77,2[/CODE]Config.txt: [CODE]-device 0 -user kriesel -cpu condorella/rx480 -use NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG [/CODE][CODE]C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win 2020-02-11 16:35:15 gpuowl v6.11-134-g1e0ce1d 2020-02-11 16:35:15 config: -device 0 -user kriesel -cpu condorella/rx480 -use NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE _HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG [COLOR=Gray]2020-02-11 16:35:15 config: 2020-02-11 16:35:15 config: 4.5m fft NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1,CARRY32, CHEBYSHEV_METHOD_FMA,CHEBYSHEV_MIDDLEMUL2,LESS_ACCURATE 2020-02-11 16:35:15 config: :5m fft NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUA RES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG[/COLOR] 2020-02-11 16:35:15 device 0, unique id '' 2020-02-11 16:35:15 condorella/rx480 99998441 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.34 bits/word 2020-02-11 16:35:17 condorella/rx480 OpenCL args "-DEXP=99998441u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xc.a5a9d5baf7a18p-3 -DIWEIGHT_ST EP=0xa.1ef1eeb123f08p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_MIDDLEMUL2=1 -DMERGED_MI DDLE=1 -DMORE_SQUARES_MIDDLEMUL1=1 -DNEW_SLOWTRIG=1 -DNO_ASM=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_WIDTH=1 -DUNROLL_HEIGHT=1 -DUNROLL_WIDTH=1 -DWORKINGIN1=1 -DWO RKINGOUT1=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-02-11 16:35:20 condorella/rx480 OpenCL compilation in 3.11 s 2020-02-11 16:35:20 condorella/rx480 99998441 P1 B1=1040000, B2=28080000; 1500153 bits; starting at 0 2020-02-11 16:35:58 condorella/rx480 99998441 P1 10000 0.67%; 3761 us/it; ETA 0d 01:33; [COLOR=Red]0000000000000000[/COLOR] 2020-02-11 16:36:36 condorella/rx480 99998441 P1 20000 1.33%; 3768 us/it; ETA 0d 01:33; 0000000000000000 2020-02-11 16:37:13 condorella/rx480 99998441 P1 30000 2.00%; 3767 us/it; ETA 0d 01:32; 0000000000000000 2020-02-11 16:37:51 condorella/rx480 99998441 P1 40000 2.67%; 3769 us/it; ETA 0d 01:32; 0000000000000000 2020-02-11 16:38:29 condorella/rx480 99998441 P1 50000 3.33%; 3767 us/it; ETA 0d 01:31; 0000000000000000 2020-02-11 16:39:06 condorella/rx480 99998441 P1 60000 4.00%; 3771 us/it; ETA 0d 01:31; 0000000000000000 2020-02-11 16:39:44 condorella/rx480 99998441 P1 70000 4.67%; 3764 us/it; ETA 0d 01:30; 0000000000000000 2020-02-11 16:40:21 condorella/rx480 saved 2020-02-11 16:40:22 condorella/rx480 99998441 P1 80000 5.33%; 3781 us/it; ETA 0d 01:30; 0000000000000000 2020-02-11 16:41:00 condorella/rx480 99998441 P1 90000 6.00%; 3768 us/it; ETA 0d 01:29; 0000000000000000 2020-02-11 16:41:37 condorella/rx480 99998441 P1 100000 6.67%; 3768 us/it; ETA 0d 01:28; 0000000000000000 2020-02-11 16:42:15 condorella/rx480 99998441 P1 110000 7.33%; 3765 us/it; ETA 0d 01:27; 0000000000000000 2020-02-11 16:42:53 condorella/rx480 99998441 P1 120000 8.00%; 3768 us/it; ETA 0d 01:27; 0000000000000000 2020-02-11 16:43:30 condorella/rx480 99998441 P1 130000 8.67%; 3763 us/it; ETA 0d 01:26; 0000000000000000 2020-02-11 16:44:08 condorella/rx480 99998441 P1 140000 9.33%; 3770 us/it; ETA 0d 01:25; 0000000000000000 2020-02-11 16:44:46 condorella/rx480 99998441 P1 150000 10.00%; 3766 us/it; ETA 0d 01:25; 0000000000000000 2020-02-11 16:45:21 condorella/rx480 saved 2020-02-11 16:45:23 condorella/rx480 99998441 P1 160000 10.67%; 3784 us/it; ETA 0d 01:25; 0000000000000000 2020-02-11 16:46:01 condorella/rx480 99998441 P1 170000 11.33%; 3769 us/it; ETA 0d 01:24; 0000000000000000 2020-02-11 16:46:39 condorella/rx480 99998441 P1 180000 12.00%; 3765 us/it; ETA 0d 01:23; 0000000000000000 2020-02-11 16:47:16 condorella/rx480 99998441 P1 190000 12.67%; 3771 us/it; ETA 0d 01:22; 0000000000000000 2020-02-11 16:47:54 condorella/rx480 99998441 P1 200000 13.33%; 3767 us/it; ETA 0d 01:22; 0000000000000000 2020-02-11 16:48:32 condorella/rx480 99998441 P1 210000 14.00%; 3769 us/it; ETA 0d 01:21; 0000000000000000 2020-02-11 16:49:10 condorella/rx480 99998441 P1 220000 14.67%; 3768 us/it; ETA 0d 01:20; 0000000000000000 2020-02-11 16:49:47 condorella/rx480 99998441 P1 230000 15.33%; 3769 us/it; ETA 0d 01:20; 0000000000000000 2020-02-11 16:50:22 condorella/rx480 saved 2020-02-11 16:50:25 condorella/rx480 99998441 P1 240000 16.00%; 3779 us/it; ETA 0d 01:19; 0000000000000000 2020-02-11 16:51:03 condorella/rx480 99998441 P1 250000 16.66%; 3764 us/it; ETA 0d 01:18; 0000000000000000 2020-02-11 16:51:40 condorella/rx480 99998441 P1 260000 17.33%; 3767 us/it; ETA 0d 01:18; 0000000000000000 2020-02-11 16:52:18 condorella/rx480 99998441 P1 270000 18.00%; 3766 us/it; ETA 0d 01:17; 0000000000000000 2020-02-11 16:52:56 condorella/rx480 99998441 P1 280000 18.66%; 3770 us/it; ETA 0d 01:17; 0000000000000000 2020-02-11 16:53:33 condorella/rx480 99998441 P1 290000 19.33%; 3769 us/it; ETA 0d 01:16; 0000000000000000 2020-02-11 16:54:11 condorella/rx480 99998441 P1 300000 20.00%; 3769 us/it; ETA 0d 01:15; 0000000000000000 2020-02-11 16:54:49 condorella/rx480 99998441 P1 310000 20.66%; 3764 us/it; ETA 0d 01:15; 0000000000000000 2020-02-11 16:55:22 condorella/rx480 saved 2020-02-11 16:55:27 condorella/rx480 99998441 P1 320000 21.33%; 3780 us/it; ETA 0d 01:14; 0000000000000000 2020-02-11 16:56:04 condorella/rx480 99998441 P1 330000 22.00%; 3765 us/it; ETA 0d 01:13; 0000000000000000 2020-02-11 16:56:42 condorella/rx480 99998441 P1 340000 22.66%; 3759 us/it; ETA 0d 01:13; 0000000000000000 2020-02-11 16:57:19 condorella/rx480 99998441 P1 350000 23.33%; 3770 us/it; ETA 0d 01:12; 0000000000000000 2020-02-11 16:57:57 condorella/rx480 99998441 P1 360000 24.00%; 3767 us/it; ETA 0d 01:12; 0000000000000000 2020-02-11 16:58:35 condorella/rx480 99998441 P1 370000 24.66%; 3767 us/it; ETA 0d 01:11; 0000000000000000[/CODE]Stopped the run, renamed the intermediate files out of the way, reduced -use option to just NO_ASM, and it seems to be running ok. [CODE]C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win 2020-02-11 17:07:30 gpuowl v6.11-134-g1e0ce1d 2020-02-11 17:07:30 config: -device 0 -user kriesel -cpu condorella/rx480 -use NO_ASM [COLOR=Gray]2020-02-11 17:07:30 config: 2020-02-11 17:07:30 config: :,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1 ,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG 2020-02-11 17:07:30 config: 2020-02-11 17:07:30 config: :4.5m fft NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1,CARRY32 ,CHEBYSHEV_METHOD_FMA,CHEBYSHEV_MIDDLEMUL2,LESS_ACCURATE 2020-02-11 17:07:30 config: :5m fft NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUA RES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG[/COLOR] 2020-02-11 17:07:30 device 0, unique id '' 2020-02-11 17:07:30 condorella/rx480 99998441 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.34 bits/word 2020-02-11 17:07:32 condorella/rx480 OpenCL args "-DEXP=99998441u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xc.a5a9d5baf7a18p-3 -DIWEIGHT_ST EP=0xa.1ef1eeb123f08p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL 2.0" 2020-02-11 17:07:35 condorella/rx480 OpenCL compilation in 3.20 s 2020-02-11 17:07:36 condorella/rx480 99998441 P1 B1=1040000, B2=28080000; 1500153 bits; starting at 0 2020-02-11 17:08:14 condorella/rx480 99998441 P1 10000 0.67%; 3887 us/it; ETA 0d 01:37; 5412ff3dd7337b62 2020-02-11 17:08:53 condorella/rx480 99998441 P1 20000 1.33%; 3897 us/it; ETA 0d 01:36; 67401cf04590fe9e [/CODE] |
Latest gpuowl commit is still missing the 15M. This was on Google Colaboratory, for a hopefully very reliable gpu and underlying system.[CODE]{"exponent":"2000081", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-145-g6146b6d-dirty"}, "timestamp":"2020-02-12 20:35:23 UTC", "user":"kriesel", "computer":"colab3/TeslaP4", "fft-length":131072, "B1":15015, "factors":["2700109974025273"]}
{"exponent":"4444091", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-145-g6146b6d-dirty"}, "timestamp":"2020-02-12 20:35:47 UTC", "user":"kriesel", "computer":"colab3/TeslaP4", "fft-length":229376, "B1":15015, "factors":["1809798096458971047321927127"]} {"exponent":"10000831", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-145-g6146b6d-dirty"}, "timestamp":"2020-02-12 20:40:05 UTC", "user":"kriesel", "computer":"colab3/TeslaP4", "fft-length":524288, "B1":120000, "B2":2200000, "factors":["646560662529991467527"]} {"exponent":"15000031", "worktype":"PM1", "status":"[COLOR=Red][B]NF[/B][/COLOR]", "program":{"name":"gpuowl", "version":"v6.11-145-g6146b6d-dirty"}, "timestamp":"2020-02-12 20:54:51 UTC", "user":"kriesel", "computer":"colab3/TeslaP4", "fft-length":786432, "B1":180000, "B2":3780000} {"exponent":"18000137", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-145-g6146b6d-dirty"}, "timestamp":"2020-02-12 20:55:29 UTC", "user":"kriesel", "computer":"colab3/TeslaP4", "fft-length":1048576, "B1":15015, "factors":["2479169845866581244380961527"]} {"exponent":"19000013", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-145-g6146b6d-dirty"}, "timestamp":"2020-02-12 20:56:14 UTC", "user":"kriesel", "computer":"colab3/TeslaP4", "fft-length":1048576, "B1":15015, "factors":["4674003199"]}[/CODE]CUDAPm1 v0.20 finds it. [CODE]CUDAPm1 v0.20 ------- DEVICE 0 ------- name GeForce GTX 1080 Compatibility 6.1 clockRate (MHz) 1797 memClockRate (MHz) 5005 totalGlobalMem zu totalConstMem zu l2CacheSize 2097152 sharedMemPerBlock zu regsPerBlock 65536 warpSize 32 memPitch zu maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 20 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment zu deviceOverlap 1 CUDA reports 7968M of 8192M GPU memory free. Using threads: norm1 256, mult 256, norm2 1024. No stage 2 checkpoint. Using up to 4992M GPU memory. Selected B1=2360000, B2=59590000, 3.83% chance of finding a factor Using B1 = 2360000 from savefile. Continuing stage 2 from a partial result of M282000073 fft length = 16384K batch wrapper reports (re)launch at Wed 02/12/2020 15:13:40.06 reset count 0 of max 3 CUDAPm1 v0.20 ------- DEVICE 0 ------- name GeForce GTX 1080 Compatibility 6.1 clockRate (MHz) 1797 memClockRate (MHz) 5005 totalGlobalMem zu totalConstMem zu l2CacheSize 2097152 sharedMemPerBlock zu regsPerBlock 65536 warpSize 32 memPitch zu maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 20 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment zu deviceOverlap 1 CUDA reports 7968M of 8192M GPU memory free. Index 25 Using threads: norm1 256, mult 32, norm2 64. Using up to 4137M GPU memory. Selected B1=275000, B2=8112500, 5.53% chance of finding a factor Starting stage 1 P-1, M15000031, B1 = 275000, B2 = 8112500, fft length = 800K Doing 396818 iterations Iteration 100000 M15000031, 0x7a8e085ca931e223, n = 800K, CUDAPm1 v0.20 err = 0.14453 (1:19 real, 0.7873 ms/iter, ETA 3:53) Iteration 200000 M15000031, 0xad072f2e5fc4eb76, n = 800K, CUDAPm1 v0.20 err = 0.14844 (1:19 real, 0.7901 ms/iter, ETA 2:35) Iteration 300000 M15000031, 0x82162c462572c64d, n = 800K, CUDAPm1 v0.20 err = 0.14063 (1:19 real, 0.7920 ms/iter, ETA 1:16) M15000031, 0xe80933bd37a9f9a9, n = 800K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 5:14 Starting stage 1 gcd. M15000031 Stage 1 found no factor (P-1, B1=275000, B2=8112500, e=0, n=800K CUDAPm1 v0.20) Starting stage 2. Using b1 = 275000, b2 = 8112500, d = 2310, e = 12, nrp = 480 Zeros: 348644, Ones: 429436, Pairs: 93347 Processing 1 - 480 of 480 relative primes. Inititalizing pass... done. transforms: 31221, err = 0.14453, (13.64 real, 0.4369 ms/tran, ETA NA) Transforms: 229552 M15000031, 0x6760f107920d3922, n = 800K, CUDAPm1 v0.20 err = 0.14844 (1:35 real, 0.4140 ms/tran, ETA 4:36) Transforms: 218518 M15000031, 0x45ab02ca00c98138, n = 800K, CUDAPm1 v0.20 err = 0.14063 (1:31 real, 0.4142 ms/tran, ETA 3:06) Transforms: 214190 M15000031, 0x1f7fed9dfd61de18, n = 800K, CUDAPm1 v0.20 err = 0.14844 (1:28 real, 0.4145 ms/tran, ETA 1:37) Transforms: 235492 M15000031, 0xbff7fea3340e621f, n = 800K, CUDAPm1 v0.20 err = 0.15625 (1:38 real, 0.4160 ms/tran, ETA 0:00) Stage 2 complete, 928973 transforms, estimated total time = 6:25 Starting stage 2 gcd. M15000031 has a factor: 1178543237739460982839 (P-1, B1=275000, B2=8112500, e=12, n=800K CUDAPm1 v0.20) [/CODE]And prime95 v29.8b does too[CODE][Wed Feb 12 15:47:41 2020] P-1 found a factor in stage #2, B1=255000, B2=5737500, E=12. UID: Kriesel/peregrine, M15000031 has a factor: 1178543237739460982839 (P-1, B1=255000, B2=5737500, E=12) [/CODE] |
gpuowl-win build 6.11-145-g6146b6d
2 Attachment(s)
Here it is, tested only as far as the help output.
|
Latest commit gpuowl is crashing on Colab Tesla P4
It ran below 20M P-1 with -maxAlloc 7500, and crashes on 20M.[CODE]2020-02-12 22:15:08 gpuowl v6.11-145-g6146b6d-dirty
2020-02-12 22:15:09 config: -user kriesel -cpu colab3/TeslaP4 -yield -maxAlloc 7000 -use NO_ASM 2020-02-12 22:15:09 device 0, unique id '' 2020-02-12 22:15:09 colab3/TeslaP4 20000023 FFT 1152K: Width 8x8, Height 256x4, Middle 9; 16.95 bits/word 2020-02-12 22:15:09 colab3/TeslaP4 OpenCL args "-DEXP=20000023u -DWIDTH=64u -DSMALL_HEIGHT=1024u -DMIDDLE=9u -DWEIGHT_STEP=0x1.0840814dcafb8p+0 -DIWEIGHT_STEP=0x1.f002ed51e880ap-1 -DWEIGHT_BIGSTEP=0x1.172b83c7d517bp+0 -DIWEIGHT_BIGSTEP=0x1.d5818dcfba487p-1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-02-12 22:15:09 colab3/TeslaP4 2020-02-12 22:15:09 colab3/TeslaP4 OpenCL compilation in 0.01 s 2020-02-12 22:15:09 colab3/TeslaP4 20000023 P1 B1=240000, B2=5760000; 346123 bits; starting at 346122 2020-02-12 22:15:09 colab3/TeslaP4 20000023 P1 346123 100.00%; 43626 us/it; ETA 0d 00:00; 6a2e08e14df5900e 2020-02-12 22:15:09 colab3/TeslaP4 P-1 (B1=240000, B2=5760000, D=30030): primes 376241, expanded 390008, doubles 69210 (left 241927), singles 237821, total 307031 (82%) 2020-02-12 22:15:09 colab3/TeslaP4 20000023 P2 using blocks [8 - 192] to cover 307031 primes 2020-02-12 22:15:09 colab3/TeslaP4 20000023 P2 using 759 buffers of 9.0 MB each 2020-02-12 22:15:21 colab3/TeslaP4 Exception gpu_error: MEM_OBJECT_ALLOCATION_FAILURE clEnqueueCopyBuffer(queue, src, dst, 0, 0, size, 0, NULL, NULL) at clwrap.cpp:344 copyBuf 2020-02-12 22:15:21 colab3/TeslaP4 Bye[/CODE]That's an 8GB gpu. [url]https://www.mersenneforum.org/showpost.php?p=533245&postcount=15[/url] |
Try with a smaller -maxAlloc. Can you check the free memory on the GPU -- how much is free before and during the gpuowl run?
[QUOTE=kriesel;537447]It ran below 20M P-1 with -maxAlloc 7500, and crashes on 20M.[CODE]2020-02-12 22:15:08 gpuowl v6.11-145-g6146b6d-dirty 2020-02-12 22:15:09 config: -user kriesel -cpu colab3/TeslaP4 -yield -maxAlloc 7000 -use NO_ASM 2020-02-12 22:15:09 device 0, unique id '' 2020-02-12 22:15:09 colab3/TeslaP4 20000023 FFT 1152K: Width 8x8, Height 256x4, Middle 9; 16.95 bits/word 2020-02-12 22:15:09 colab3/TeslaP4 OpenCL args "-DEXP=20000023u -DWIDTH=64u -DSMALL_HEIGHT=1024u -DMIDDLE=9u -DWEIGHT_STEP=0x1.0840814dcafb8p+0 -DIWEIGHT_STEP=0x1.f002ed51e880ap-1 -DWEIGHT_BIGSTEP=0x1.172b83c7d517bp+0 -DIWEIGHT_BIGSTEP=0x1.d5818dcfba487p-1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-02-12 22:15:09 colab3/TeslaP4 2020-02-12 22:15:09 colab3/TeslaP4 OpenCL compilation in 0.01 s 2020-02-12 22:15:09 colab3/TeslaP4 20000023 P1 B1=240000, B2=5760000; 346123 bits; starting at 346122 2020-02-12 22:15:09 colab3/TeslaP4 20000023 P1 346123 100.00%; 43626 us/it; ETA 0d 00:00; 6a2e08e14df5900e 2020-02-12 22:15:09 colab3/TeslaP4 P-1 (B1=240000, B2=5760000, D=30030): primes 376241, expanded 390008, doubles 69210 (left 241927), singles 237821, total 307031 (82%) 2020-02-12 22:15:09 colab3/TeslaP4 20000023 P2 using blocks [8 - 192] to cover 307031 primes 2020-02-12 22:15:09 colab3/TeslaP4 20000023 P2 using 759 buffers of 9.0 MB each 2020-02-12 22:15:21 colab3/TeslaP4 Exception gpu_error: MEM_OBJECT_ALLOCATION_FAILURE clEnqueueCopyBuffer(queue, src, dst, 0, 0, size, 0, NULL, NULL) at clwrap.cpp:344 copyBuf 2020-02-12 22:15:21 colab3/TeslaP4 Bye[/CODE]That's an 8GB gpu. [url]https://www.mersenneforum.org/showpost.php?p=533245&postcount=15[/url][/QUOTE] |
Did you try with -use ORIG_SLOWTRIG
[QUOTE=kriesel;537444]Latest gpuowl commit is still missing the 15M[/QUOTE] |
| All times are UTC. The time now is 21:16. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.