![]() |
[QUOTE=kriesel;546206]Update to 6.11-288, tried larger fft, ULTRA_TRIG=1, still trouble:[/QUOTE]
ULTRA_TRIG=1 and MIDDLE=11 are a problem -- needs some research on my part. That's why ULTRA_TRIG is not used anywhere by default. We are looking at changing the FFT crossovers to be more conservative. This is made a bit more difficult due to a rocm optimizer bug that sometimes corrupts the output of our sin/cos routines. FYI there are 6 crossover points within each FFT length! As gpuowl cycles through several different and more accurate implementations of some code (the MM_CHAIN and MM2_CHAIN variables). We'll make it better soon. |
[QUOTE=ATH;546219]Of course the manual submissions does not have enough information if the results are reliable or not, thanks.[/QUOTE]
Current versions gpuowl report number of GEC error detections; some earlier versions did not. |
[QUOTE=Prime95;546145]Judging from these flags -DMM_CHAIN=2u -DMM2_CHAIN=3u the exponent is near the upper limits of that FFT length. Our gpuowl defaults may be too aggressive.
Assuming you saved the problematic files: Can you try getting past the bad iteration with "ULTRA_TRIG=1"? Can you get past with "MM_CHAIN=3?[/QUOTE] George & Mihai, would it possible to reduce the odds of the program simply quitting even in the presence of such errors (whether caused by the ROCm optimizer or overaggressive breakpoints, or hardware glitchage) by using the nearness of the expo to the default max-p for the given FFT length via something like the following, which is more or less the way I do things in Mlucas? Note that even in the absence of per-iter-and-per-convolution-output-datum ROE checking, we can use the bits-per-word to make a reasonable inference as to whether excess ROE might be the cause of a J-check or G-check failure. [1] If given expo is not running at max-accuracy-needed settings for the FFT length in question but is near the default crossover-to-next-more-accurate MM_CHAIN/MM2_CHAIN settings, switch to the next-more accurate such settings; [2] If given expo is already running at max-accuracy-needed settings for the FFT length in question, reset MM_CHAIN/MM2_CHAIN settings to least-accurate mode and switch run to next-larger FFT length; [3] If run is already at a larger-than-default FFT length as a result of [2], assume either user's hardware is faulty or some kind of transient data glitches are occurring, and switch *back* to default FFT lenght and MM_CHAIN/MM2_CHAIN settings. The strategy in [3] might strike some as bizarre, but: [a] I've actually seen stuff occurring not-infrequently on my favorite flaky-hardware test system, my old Haswell; [b] If all the above workarounds fail due to user's hardware or software-install being borked, the run flailing around endlessly makes user no worse off than he would be is the run simply aborted, and - this is important - we may get some valuable diagnostic data from the logfile capturing the flailing-around. We need to operate under the assumption that all manner of errors might bork an ongoing run, and code our software to be robust in the face of such - users get a whole lot less upset at the program continuing to run, perhaps in slightly suboptimal mode for the given expo, than they do at waking up or checking in to find it simply quit hours or days before. |
[QUOTE=ATH;546176]I can find 19 "PRP Bad" which is a lot more than I thought, but it is only 4 different users?[/QUOTE]
Sir Rutherford's had prime95's suspect Gerbicz check. My 3 were when I switched to PRP but forgot to upgrade prime95 to a Gerbicz version. The other 2 users were using pre-Gerbicz prime95s. In summary, nothing to see here. |
[QUOTE=ewmayer;546228]George & Mihai, would it possible to reduce the odds of the program simply quitting even in the presence of such errors.[/QUOTE]
Current plan is to recompute the default crossovers to be more conservative. Your idea to restart PRP with the next MM_CHAIN setting is worthwhile. Preda can comment on how difficult that would be to implement. Other ideas include: 1) More conservative default settings for LL and P-1 as opposed to PRP. 2) A command line parameter for those that want to be more aggressive. Thanks for everyone's patience. It is a work in progress. |
[QUOTE=ewmayer;546228]We need to operate under the assumption that all manner of errors might bork an ongoing run, and code our software to be robust in the face of such - users get a whole lot less upset at the program continuing to run, perhaps in slightly suboptimal mode for the given expo, than they do at waking up or checking in to find it simply quit hours or days before.[/QUOTE]Amen to that.Even the next longer fft. After running through whatever list is settled on, for "if the default or user-entered settings generate errors too frequently, try successive alternate approaches on the first worktodo entry", comment the problematic entry out, and give the next entry a try. You'll probably want a way to lock out that substitution process as an option, such as for testing.[QUOTE=Prime95;546241]Current plan is to recompute the default crossovers to be more conservative.
Your idea to restart PRP with the next MM_CHAIN setting is worthwhile. Preda can comment on how difficult that would be to implement. Other ideas include: 1) More conservative default settings for LL and P-1 as opposed to PRP. 2) A command line parameter for those that want to be more aggressive. Thanks for everyone's patience. It is a work in progress.[/QUOTE]Great progress for 3 years, especially considering the low license cost. And hardware manufacturers and the march up the exponent scale virtually guarantee it will remain a work in progress for quite some time. |
[QUOTE=ewmayer;546228]George & Mihai, would it possible to reduce the odds of the program simply quitting even in the presence of such errors (whether caused by the ROCm optimizer or overaggressive breakpoints, or hardware glitchage) by using the nearness of the expo to the default max-p for the given FFT length via something like the following, which is more or less the way I do things in Mlucas? Note that even in the absence of per-iter-and-per-convolution-output-datum ROE checking, we can use the bits-per-word to make a reasonable inference as to whether excess ROE might be the cause of a J-check or G-check failure.
[/QUOTE] There is no need to push the FFT limits in the danger-zone, the goal is to have them set safely such that it's extremely unlikely to hit an error because of FFT-size. In this situation adapting the size dinamically would just "hide the error" in the FFT limits. I'd rather hear the error lound and clear and fix it. I do have plenty of errors on my poor GPUs, and (in my case) they've never been because of FFT limits yet (unfortunately, they're more serious HW errors). Anyway, adding the adaptive-FFT on top of that would be very confusing. There is one way to reliably detect FFT errors though: if the same error is hit in the same pleace *with the same residue*, i.e. the error is deterministic, then it's unlikely to be HW. |
In my experience one can never be 100% safe with exponent limits ... one wants to set them as aggressively as reasonably possible to save wasted work, but we simply cannot eliminate the possibility of one iteration of a particular test hitting an 'unlucky' combination of FFT inputs which cause one or more convolution outputs to have an anomalously high fractional part. With per-iter-and-per-output-ROE checking those will result in a repeatable-on-retry ROE, which is the code is set up to handle, is no big deal, we just the breakpoints to be set so as to make this a rare event.
People have done rigorous ROE bounding for these kinds of computations, and the result is invariably that to be 100% safe, one must use exponent bounds *drastically* lower than we actually do. But, not my call outside of my own code. |
[QUOTE=ewmayer;546250]In my experience one can never be 100% safe with exponent limits ... one wants to set them as aggressively as reasonably possible to save wasted work, but we simply cannot eliminate the possibility of one iteration of a particular test hitting an 'unlucky' combination of FFT inputs which cause one or more convolution outputs to have an anomalously high fractional part. With per-iter-and-per-output-ROE checking those will result in a repeatable-on-retry ROE, which is the code is set up to handle, is no big deal, we just the breakpoints to be set so as to make this a rare event.
People have done rigorous ROE bounding for these kinds of computations, and the result is invariably that to be 100% safe, one must use exponent bounds *drastically* lower than we actually do. But, not my call outside of my own code.[/QUOTE] It can be modelled statistically with a distribution. Then you don't talk about 100% safe, but about the probability of having at least one roundoff-overflow over the number of iterations of the whole test. E.g. the FFT bounds could be set such that the probability of "at least one" roundoff-overflow over the 100M iterations of a test to be under 0.001, which would mean that I expect one roundoff problem in 1000 tests, and that's it. |
[QUOTE=preda;546262]E.g. the FFT bounds could be set such that the probability of "at least one" roundoff-overflow over the 100M iterations of a test to be under 0.001, which would mean that I expect one roundoff problem in 1000 tests, and that's it.[/QUOTE]
That is what he was calling "*drastically* lower". This way, lots of tests that won't cause error in the "aggressive" version, would run a lot slower with the larger FFT. Then you would have to find a compromise between the aggressiveness (total speed or output, including re-doing few failed cases two or more times) and "safety" (your version, where there are no errors, no "re-do"s, but many of the tests are running slower). In fact, [U][B]that[/B][/U] will impose the limit, and not the 0.001% or whatever. And guess what, to find that compromise, all the discussion has to start again from scratch. This is what we are doing for about 20 years now... moving the limits a bit up, then a bit down, when tweaking the software and new ideas pop up, or when somebody finds boundary-related errors, or when somebody sees that some tests are too slow and they would run faster (and still correct) with a bit lower FFT. |
[QUOTE=kriesel;546224]Current versions gpuowl report number of GEC error detections; some earlier versions did not.[/QUOTE]
I just checked, even though the gpuowl version I use on colab has "[B]errors":{"gerbicz":0}[/B]" in the json result line, primenet still says only "PRP Unverified" without the (Reliable), so I guess it is just a manual results turnin issue. [M]91385527[/M] |
| All times are UTC. The time now is 23:03. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.