mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2021-10-28, 19:50   #1
techn1ciaN
 
techn1ciaN's Avatar
 
Oct 2021
U.S. / Maine

97 Posts
Default Very strange proof problem

First off: GPUOwl 7.2.63 on up-to-date Windows 10.

I decided today to try a modest undervolt for GPUOwl on my Radeon 5700 XT. I set the voltage I wanted in Radeon Software and -log 10000 in config.txt, then began stepping the clock by 25 MHz, starting GPUOwl, and watching for GEC failures. (I just did this with the exponent I was already working on rather than loading a test exponent, because I figured the GEC failure rollback would save me, especially with a 10,000 iter. check interval.) I kept doing this until I saw some, then backed off one step and started GPUOwl again for a longer burn-in test.

Upon starting up this time, I got a failed proof residue validation. But, the problematic residue was stated to be from much earlier in the test, well before I started screwing with anything (I was at iter. 33,xxx,xxx and the mismatch was stated to be in a residue from around 15,xxx,xxx). Also, GPUOwl automatically tried validating my residues for the next proof power down (I use 10, it tried 9) — and that passed and the test resumed.

What happened here? I tried deleting every save file and temporary proof file from after I started my undervolting process, and also reverting to my 5700 XT's base voltage and clock, but neither resolved anything. At that point I didn't want to risk turning in a bad proof so I cut my losses and unreserved the exponent, but I'm still very curious on exactly how this problem arose. All input appreciated.
techn1ciaN is offline   Reply With Quote
Old 2021-10-28, 20:21   #2
techn1ciaN
 
techn1ciaN's Avatar
 
Oct 2021
U.S. / Maine

9710 Posts
Default

I realize I was being a bit inspecific. Here is the actual GPUOwl printout, copied from my log file:
Code:
114482779 OK  38410000 on-load: blockSize 400, 42fed0e4b7671ebf
114482779 validating proof residues for power 10
114482779 checksum 13cd8e0d (expected 3bb9aafa) in '.\114482779\proof\15540145'
114482779 validating proof residues for power 9
114482779 Proof using power 9 (vs 10) for 114482779
(I slightly misremembered what my progress in the test was.)

This is from after I took the step of deleting the noted files, but the error appeared exactly the same before that, right down to the expected and actual checksums.
techn1ciaN is offline   Reply With Quote
Old 2021-10-28, 20:57   #3
techn1ciaN
 
techn1ciaN's Avatar
 
Oct 2021
U.S. / Maine

11000012 Posts
Default

I may have an insight. I reviewed further up in my log to see if I could notice anything about the run just before the problem started appearing. It turns out, I had killed the process when it was in the middle of validating proof residues (I caught an incorrectly set Radeon Software parameter and needed to fix it). Is there a possibility that the problematic residue is the one the program was then in the middle of reading, and aborting the operation corrupted it?

I will feel stupid if this turns out to have nothing to do with undervolting (although I'm now even more confused on how the bad residue still validated for proof power 9).
techn1ciaN is offline   Reply With Quote
Old 2021-10-28, 21:42   #4
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

32·13·19 Posts
Default

A stab in the dark... Perhaps the residues were fine, but the voltage was still too low and an error happened during the validation.
frmky is offline   Reply With Quote
Old 2021-10-28, 21:58   #5
techn1ciaN
 
techn1ciaN's Avatar
 
Oct 2021
U.S. / Maine

11000012 Posts
Default

Quote:
Originally Posted by frmky View Post
A stab in the dark... Perhaps the residues were fine, but the voltage was still too low and an error happened during the validation.

Dubious. I applied my 5700 XT's default clock and voltage and restarted GPUOwl, but got the same error with the same bad checksum.
techn1ciaN is offline   Reply With Quote
Old 2021-11-19, 14:49   #6
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

25378 Posts
Default

Quote:
Originally Posted by techn1ciaN View Post
What happened here?
I don't understand what happened. The proof residues are written once only, afterwards they are only read. The check of residues at startup is done CPU-side only (it's a very simple checksum over the file). But if the check is suspected, a simple restart would re-do the check of the proof files, and if the outcome is the reproducible than it's reliable.

I understand that you did restart the process a few times, and it did check as correct the 15540145 proof file, only for it to turn bad at some later point.. strange because I don't expect that file to mutate.

Anyway, you're lucky because you can still generate a power-9 proof, which is perfectly fine. If you still have the data around, I'd suggest you finish the exponent, the proof should be good.

OTOH it's true, it's a problem that I don't see how that file error appeared.

Last fiddled with by preda on 2021-11-19 at 14:50
preda is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Something strange ... bayanne Software 6 2016-04-06 04:33
strange problem: efficient 'radix sums' jasonp Programming 13 2013-05-16 19:11
Strange bug with GMP-ECM MatWur-S530113 GMP-ECM 2 2007-11-19 00:01
strange problem with torture test on 16core machines TheJudger Hardware 5 2006-04-08 11:20
STRANGE problem with Shuttle ST20G5 g1ul10 Hardware 6 2006-03-19 17:27

All times are UTC. The time now is 21:52.


Fri Dec 3 21:52:22 UTC 2021 up 133 days, 16:21, 0 users, load averages: 1.34, 1.27, 1.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.