mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2017-08-09, 18:42   #56
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by preda View Post
Yes it makes sense to [also] do the check at the end before submitting the final residue.

It's not clear to me what is the right behavior of the software when the check fails -- should it roll back to the most recent "good" point and re-attempt from there, or should it report the hardware as hopelessly broken and give up.

The situation is that, when the jacobi detects the first "sure" error, there is a high probability that there were also undetected errors before. Thus rolling back to a recent point only fixes the visible error, while preserving the hidden errors -- not good.
Here's what I would suggest. If an error is detected, roll back to the last believed-good save state and retry, up to a small limited number of retry times. (Say 3). If one of the retries succeeds, it might be fine or it might be multiple errors hiding. If none of the few retries succeed, save the save files, log a warning message, and go on to the next work. (A user with multiple systems might want to take the last believed-good save state and run it to completion on another more reliable system to salvage the work done, or run it on new hardware after replacing failing hardware.)

PrimeNet's track record overall, absent flaky failing hardware, some misconfigurations or pilot error issues such as using an incompatible threads number on a GPU or too small an fft length for an exponent, is pretty good for these lengthy computations mostly done without ECC memory. Worst case, a bad lltest gets submitted, and probably identified years later with a double check and triple check.

If I recall correctly, the percentage of bad residues at completion is increasing as we progress to higher exponents, and currently around 4%?

Thank you Mihai for taking the initiative and pursuing a new avenue to detect and reduce impact of computational errors along the way!

Last fiddled with by kriesel on 2017-08-09 at 19:39
kriesel is online now   Reply With Quote
Old 2017-08-09, 19:04   #57
GP2
 
GP2's Avatar
 
Sep 2003

2·5·7·37 Posts
Default

Quote:
Originally Posted by kriesel View Post
If I recall correctly, the percentage of bad residues at completion is increasing as we progress to higher exponents, and currently around 4%?
There is a selection effect because strategic double checks are being done systematically on likely-bad exponents well ahead of the wavefront. So in the interim there is an artificially high error rate for higher exponents, which will settle back down to normal as the advancing wavefront finishes double-checking all the routine correct first-time checks.
GP2 is offline   Reply With Quote
Old 2017-08-09, 19:22   #58
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

11110100100002 Posts
Default

Quote:
Originally Posted by GP2 View Post
In addition to any interim checking, the Jacobi check should be done at the very end, on the final residue, before it is truncated to 64 bits and sent to GIMPS.

The interim checks benefit only the user in question by saving some wasted computing cycles; the check at the end on the other hand gives GIMPS a 50% chance of identifying a bad result with certainty.

If this final check fails, and no recovery from an interim savefile is possible or desired, then the 64-bit residue is garbage and rather than reporting a meaningless wrong value it should be set to some special marker value like 00000000DEADBEEF.
How bad would it be to support, in the applications result report line and the primenet server processing and reports, a separate field for result expected reliability, last error status, error-detection type and count, or error-status-history? Recovered counts of Jacobi errors could be represented as j3, excessive roundoff error recovered on retry as e2, for example. (A perfect run e0j0) This shows what tests are implemented and the frequency of retries. To compensate for shorter save iterations making the Jacobi check more often, add a second value, how many checks were made;
e3,170j2,17
representing roundoff error checked 170 times in the exponent's run and excessive roundoff recovered from by retries used 3 times; Jacobi checked 17 times and recovered twice. One could also make a case for continuing despite a Jacobi retry failure. That's the same error rate we've had until now without the test being implemented. These could also be counted, and would mark the result as somewhat suspect, an early candidate for double-check, the more so as the count of separate error occurrences per exponent or hardware increases:
e3,170j2,17J3
More frequent error checking will more frequently detect errors, for systems of equal reliability. Counting and reporting the total Jacobi checks may allow for compensating for that confounding effect.

Other possibilities I feel are inferior include either negative numbers representing error types or counts in place of the residue, or non-hex characters in the residue. Who knows, "00000000DEADBEEF" might actually be a legit final residue sometime. And as an error flag it is vague; did it fail the Jacobi check, the zero residue check, the repeating twos check, or something else?

Given enough residues (monkeys on typewriters) all sorts of things show up. Checking one GPU's logs, I found both "dead" and "beef"; not together; 0x887b4dead500ba11 for example. (Is that flagging a large zombie formal dance?)
kriesel is online now   Reply With Quote
Old 2017-08-09, 19:34   #59
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by error View Post
This would probably depend on the frequency of the errors caught.

If single errors are observed only during a few complete tests and there are many (at least apparently) good tests in between, it is probably safe to start from the last (apparently) good iteration.

If most tests are interrupted with errors, one should probably start from scratch, and then proceed as long as the permanent checkpoints agree. The last such can then be marked as a good continuation point. The test would be carried on until another error is spotted (or till the end). If the run will be completed, it might be a good idea to make another run from the last good continuation point (perhaps depending on the number of errors seen during the run). (That would result in a self-verified doublecheck.)
Maybe I'm misunderstanding you, but it seems you're saying a single run on a single set of hardware can be repeated from perhaps midway and count as a double-check as well as a first-time test. That is not as effective at confirming matching residues are correct, as a separate set of runs on separate hardware with separate offsets and ideally different software with different ancestry. The first half may have had an undetected error that may invalidate both completions. The hardware or software may have repeatable issues. The Jacobi test detects some errors and misses others; passing the test indicates higher probability the run is correct to that point, but not certainty. Reused code may carry undetected bugs from one software to another.
kriesel is online now   Reply With Quote
Old 2017-08-10, 07:18   #60
error
 
Sep 2014

1716 Posts
Default

Quote:
Originally Posted by kriesel View Post
Maybe I'm misunderstanding you, but it seems you're saying a single run on a single set of hardware can be repeated from perhaps midway and count as a double-check as well as a first-time test. That is not as effective at confirming matching residues are correct, as a separate set of runs on separate hardware with separate offsets and ideally different software with different ancestry. The first half may have had an undetected error that may invalidate both completions. The hardware or software may have repeatable issues. The Jacobi test detects some errors and misses others; passing the test indicates higher probability the run is correct to that point, but not certainty. Reused code may carry undetected bugs from one software to another.
Not exactly. The scanario applies to the case where, after an indication of failure, you decide to start all over from the beginning but already have interim residues at some intervals saved. You go on and compare the corresponding residues (preferrably with different offset) along the way. If you get a match, you mark them good. If they disagree, you can roll back to the previous one (marked good) and check once more if you'll get a match with either of the two you already produced. If the first attempt was bad, you discard the rest of the checkpoints and continue till the end, unless no further visible errors pop up. The end result still may or may not be good. Then you can go back to the last residue marked good (already doublechecked) and repeat the computation from there (with different offset) and compare the second set of interim residues with the new ones being produced. If you get a match, you are pretty sure that the hardware did not act up any more. The other issues you mention (software bugs etc) may still cause the result being wrong anyway, but that is a different story.
error is offline   Reply With Quote
Old 2017-08-10, 11:06   #61
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

5AC16 Posts
Default

The Jacoby symbol computation for 75M range on Ryzen 1700x (3.4 GHz) is only 30s single-core (hyperthreading disabled), better than on the Xeon 2.4 GHz where it's 50s (likely RAM plays a role too).

For gpuOwL this works nicely because the Jacobi check is done on the CPU "in the shadow" of the GPU kernel -- i.e. when the CPU normally waits for the enqueued GPU kernels to complete. As a batch of 20K iterations takes about 45s on the GPU, there is no additional delay introduced by Jacobi-check. Of course, the Jacobi check is not free because in does take some CPU from mprime. Now, by default, the check is done every 200K iterations (10 * logstep).

I dropped the -offset option from gpuOwL 0.6 (the only, implicit offset is now 0). Before doing this I created a branch "offset" on github https://github.com/preda/gpuowl/tree/offset in case anybody wants to look etc. The reason for dropping the offset is that I felt that the resulting simplification outweighs the offset benefit.

Today I added a new option in gpuOwL, -supersafe. This runs every iteration *twice* using independent memory buffers (so that a memory corruption affects the two computations differently), and checks after each batch (20K iterations) for identical results -- otherwise retries. The two parallel runs are with the same offset (0).

The drawback of -supersafe is that it's twice slower. The benefit is that it's very strongly protected from hardware errors such as memory corruption (at any level, global/cache/register) or non-systematic arithmetic corruption (if there is such a kind of hardware error).

I have a GPU that went bad. I could not use it for LL anymore. Now, with -supersafe, it's twice slower but I trust the results again. (also, I can drop the underclock that I was using in hope to improve reliability before).

Last fiddled with by preda on 2017-08-10 at 11:15
preda is offline   Reply With Quote
Old 2017-08-10, 11:14   #62
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22×3×112 Posts
Default

A log excerpt, GPU is 390x, with -supersafe and Jacobi. (Jacobi I would say is not really needed when using -supersafe). The ms/iter is the time for the "double" iteration.
Quote:
00600000 / 75345409 [0.80%], ms/iter: 4.535, ETA: 3d 22:10; 8ea5efdef4456983 error 0.1875 (max 0.1875)
Jacobi-symbol check OK (31628 ms)
00620000 / 75345409 [0.82%], ms/iter: 4.533, ETA: 3d 22:05; 43a2f0925105e63c error 0.15625 (max 0.1875)

Last fiddled with by preda on 2017-08-10 at 11:18
preda is offline   Reply With Quote
Old 2017-08-10, 17:51   #63
GP2
 
GP2's Avatar
 
Sep 2003

1010000111102 Posts
Default

I started a thread about this in the Number Theory Discussion Group subforum.

I hope that number theory experts can not only confirm the theoretical soundness of this idea — seems simple, so how did this elude us for so long? — but maybe even suggest additional, independent checks on interim or final Lucas-Lehmer residues, if any exist.
GP2 is offline   Reply With Quote
Old 2017-08-10, 18:07   #64
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1E9016 Posts
Default

Quote:
Originally Posted by preda View Post
The Jacoby symbol computation for 75M range on Ryzen 1700x (3.4 GHz) is only 30s single-core (hyperthreading disabled), better than on the Xeon 2.4 GHz where it's 50s (likely RAM plays a role too).

For gpuOwL this works nicely because the Jacobi check is done on the CPU "in the shadow" of the GPU kernel -- i.e. when the CPU normally waits for the enqueued GPU kernels to complete. As a batch of 20K iterations takes about 45s on the GPU, there is no additional delay introduced by Jacobi-check. Of course, the Jacobi check is not free because in does take some CPU from mprime. Now, by default, the check is done every 200K iterations (10 * logstep).

I dropped the -offset option from gpuOwL 0.6 (the only, implicit offset is now 0). Before doing this I created a branch "offset" on github https://github.com/preda/gpuowl/tree/offset in case anybody wants to look etc. The reason for dropping the offset is that I felt that the resulting simplification outweighs the offset benefit.

Today I added a new option in gpuOwL, -supersafe. This runs every iteration *twice* using independent memory buffers (so that a memory corruption affects the two computations differently), and checks after each batch (20K iterations) for identical results -- otherwise retries. The two parallel runs are with the same offset (0).

The drawback of -supersafe is that it's twice slower. The benefit is that it's very strongly protected from hardware errors such as memory corruption (at any level, global/cache/register) or non-systematic arithmetic corruption (if there is such a kind of hardware error).

I have a GPU that went bad. I could not use it for LL anymore. Now, with -supersafe, it's twice slower but I trust the results again. (also, I can drop the underclock that I was using in hope to improve reliability before).
That's an interesting approach to memory errors. I have a GPU that shows errors broadly on roughly the middle third of its vram independently of clocking (575 to 1000MB of a 1.5GB card). Do you pretest the memory and place the buffers in known-reliable areas? If bad memory could be allocated to do nothing and was left that way, and LL testing occurred in memory allocated elsewhere, it might not be necessary to make double runs to get reliable results. (Something like the linux badram driver, although memory block size necessary may interfere here if the allocations need to be contiguous and the bad sections divide up memory too much.) http://rick.vanrein.org/linux/badram/results.html

I'm curious about your choice to run the Jacobi check on the CPU rather than GPU. Was this a design decision to do the check with different probably more reliable hardware than your problematic GPU or with different hardware in general, for ease of programming, something else?

Removing nonzero offset seems unfortunate. The two supersafe runs may need to be the same offset but nonzero same should work. Nonzero offset is a feature that promotes utility in double checks.
kriesel is online now   Reply With Quote
Old 2017-08-10, 21:17   #65
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

D4E16 Posts
Default

Quote:
Originally Posted by preda View Post
Yes I agree that the point is not to fight the user.

But assuming the user is well intended and wants to produce high quality results, upon Jacobi-error should we revert to "the most recent point with no visible error" or revert to start.
I don't understand all the technical aspects of using the Jacobi stuff you're all chatting about, but it does make me think it could be useful to save an interim residue every xx% of the way along a test. Then if there's something you could run to catch certain errors, you could roll back to those previous temp files until you find the one with the best odds of being good and start from there.

At the end of the test those temp files could be deleted then.

Probably the only reason Prime95 doesn't do that now has everything to do with the good old days when it started, back in '96. Drive space (and speed) were factors and saving a bunch of temp files along the way could have caused issues.

A 79M exponent has a tmp file about 10MB... keeping one every 10% and maxing at 100MB in that exponent size wouldn't be terrible (especially if it was optional, but under the guise of "this could potentially save a lot of time if it finds an error).

Something to think about anyway.

EDIT: Oh, and that reminds me of the another idea that's been brought up before: to have Primenet save partial residues at fixed % along the way as well, so you'd know sooner during a double-check if a mismatch happened, or it might be interesting to know at what point along the way the mismatches began.

Last fiddled with by Madpoo on 2017-08-10 at 21:18
Madpoo is offline   Reply With Quote
Old 2017-08-10, 21:29   #66
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dartmouth NS

8,461 Posts
Default

Quote:
Originally Posted by Madpoo View Post
I don't understand all the technical aspects of using the Jacobi stuff you're all chatting about, but it does make me think it could be useful to save an interim residue every xx% of the way along a test. Then if there's something you could run to catch certain errors, you could roll back to those previous temp files until you find the one with the best odds of being good and start from there.

At the end of the test those temp files could be deleted then.

Probably the only reason Prime95 doesn't do that now has everything to do with the good old days when it started, back in '96. Drive space (and speed) were factors and saving a bunch of temp files along the way could have caused issues.

A 79M exponent has a tmp file about 10MB... keeping one every 10% and maxing at 100MB in that exponent size wouldn't be terrible (especially if it was optional, but under the guise of "this could potentially save a lot of time if it finds an error).

Something to think about anyway.

EDIT: Oh, and that reminds me of the another idea that's been brought up before: to have Primenet save partial residues at fixed % along the way as well, so you'd know sooner during a double-check if a mismatch happened, or it might be interesting to know at what point along the way the mismatches began.
https://en.wikipedia.org/wiki/Jacobi_symbol
https://en.wikipedia.org/wiki/Legendre_symbol
etc.
science_man_88 is online now   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Stockfish / Lutefisk game, move 14 poll. Hungry for fish and black pieces. MooMoo2 Other Chess Games 0 2016-11-26 06:52
Redoing factoring work done by unreliable machines tha Lone Mersenne Hunters 23 2016-11-02 08:51
Unreliable AMD Phenom 9850 xilman Hardware 4 2014-08-02 18:08
[new fish check in] heloo mwxdbcr Lounge 0 2009-01-14 04:55
The Happy Fish thread xilman Hobbies 24 2006-08-22 11:44

All times are UTC. The time now is 15:25.


Fri Jul 7 15:25:23 UTC 2023 up 323 days, 12:53, 0 users, load averages: 1.32, 1.16, 1.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔