mersenneforum.org SkylakeX teasers (aka prime95 29.5)
 Register FAQ Search Today's Posts Mark Forums Read

2019-02-09, 14:00   #331
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,903 Posts

Quote:
 Originally Posted by Prime95 Build 10 for Simon and Ken. 1) Fixed some possible bugs in Gerbicz PRP running on flaky hardware. 2) Fixed (I hope) the hang during benchmarking. Windows 64-bit: ftp://mersenne.org/gimps/p95v295b10.win64.zip
Well that took a while; one benchmark, 1024k-32768k, all types, hyperthreaded and not, launched 10:22 pm Feb 7, finished 4:08 am Feb 9, on the same i7-8750H system on which it previously would hang in minutes. I won't be trying that again soon, too time consuming at ~29.5 hours.
Attached Thumbnails

 2019-02-09, 16:31 #332 simon389   Aug 2013 3×29 Posts Double checking PRP with build 10 on my “bad PRP” machine. Had to turn off Gerbicz verbosity 3 because progress had advanced just 0.74% overnight. Turned it off and speeds returned to normal.
2019-02-09, 16:48   #333
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

1CB616 Posts

Quote:
 Originally Posted by simon389 Double checking PRP with build 10 on my “bad PRP” machine. Had to turn off Gerbicz verbosity 3 because progress had advanced just 0.74% overnight. Turned it off and speeds returned to normal.
That makes no sense. Gerbicz verbosity only controls how much info is output to the screen and results.txt file.

Please send the results.txt file to me.

2019-02-09, 18:25   #334
GP2

Sep 2003

258310 Posts

Quote:
 Originally Posted by Prime95 That makes no sense. Gerbicz verbosity only controls how much info is output to the screen and results.txt file.
Maybe the additional printf calls themselves are somehow causing problems.

A string buffer overflow somewhere in all those additional verbose strings, or a change in the patterns of register usage, if registers aren't being saved and restored correctly.

That might even tie in with errors happening in the end-of-PRP-test final processing, because new strings get printed at that point, other than the usual iteration count lines.

2019-02-09, 18:28   #335
simon389

Aug 2013

5716 Posts

Quote:
 Originally Posted by Prime95 That makes no sense. Gerbicz verbosity only controls how much info is output to the screen and results.txt file. Please send the results.txt file to me.

I also removed the "Gerbicz offset" line. Maybe that affected things? Every half second it was showing outputs of "Gerbicz checking iteration X" "Test passed" or whatever. After 7 hours it had advanced less than 1%. I'll forward the results.txt file to you.

I will keep that individual machine "broken" for the time being, but I finally "fixed" the other three, meaning I found CPU/RAM settings that have stable AVX512 tests in AIDA64.

Originally was 4.1Ghz, and that was failing. Tried 3.9Ghz - failed. Tried 3.8Ghz - failed. Tried 3.7 - that failed too. Finally, Mystical suggested undoing XMP settings for RAM (3600Mhz @ 1.35v and 19-20-20-40), and I discovered (to my surprise) that 3.8Ghz CPU and stock RAM settings (2000Mhz) had stable AVX512 in AIDA64 for 37 hours on all 3 machines. SUCCESS!

So now I'm leaving the "broken" PRP doublecheck system the way it is for testing purposes (it's currently doing a PRP doublecheck on build 10), but I'm using the other three machines to try different RAM speeds to see which works. Right now I have one still at stock RAM (2000Mhz), another one at 3000Mhz RAM (default cas latency, which I think is 15), and another one with RAM at 3400Mhz (with 1.3v and 19-20-20-40 like XMP suggests). We'll see which is stable after 24 hours.

 2019-02-09, 20:54 #336 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 2·3·52·72 Posts This line from Simon's results.txt file is ominous: Code: [Fri Feb 08 21:13:48 2019] Start Gerbicz block of size 0 at iteration 0. As noted in another thread, if block size is zero there is no Gerbicz error checking (although the result is reported as passing Gerbicz with no error recoveries during the test). Simon, do you know if this was build 9 or build 10? Also, early last month we see: Code: [Fri Jan 04 03:13:35 2019] Iteration: 1/79048733, Possible error: round off (0.193936384) > 0 The only way I can see this happening is if a stack variable is corrupt.
2019-02-09, 21:35   #337
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

1CB616 Posts

Quote:
 Originally Posted by Prime95 As noted in another thread, if block size is zero there is no Gerbicz error checking (although the result is reported as passing Gerbicz with no error recoveries during the test). Simon, do you know if this was build 9 or build 10?
Ah, found the zeroed block size problem in prime.txt

Code:
PRPGerbiczCompareIntervalAdj=    -206254
This value is supposed to be between 0.001 and 1.0. This value is automatically set smaller when a Gerbicz compare fails. It is raised when a Gerbicz compare succeeds. How it got to this value is a mystery. Anyway, the block size was set to sqrt ( -206254 * 1000000) and I guess the C runtime library returns 0.

Build 10 would have caught this and set block size to 16 -- which also explains why build 10 was running so slowly for Simon.

 2019-02-09, 21:45 #338 Mysticial     Sep 2016 331 Posts Nice! Just like that, it sounds like both software and hardware problems are now resolved?
2019-02-09, 23:22   #339
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,903 Posts

Quote:
Sanity checks for anything a user could get his hands on is probably a good idea. Even if it is an issue created by some unforeseen combination of code behavior plus unexpected hardware error, not user error. The mindset of what characters could possibly be entered or stored here by a creative user and what effects might those have, removes filtering expectations about what the program will or won't or couldn't put there.

 2019-02-10, 01:57 #340 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 11100101101102 Posts There are now 4 separate sanity checks that would have caught Simon's problem. 1) The adjustment value is forced to be between 0.001 and 1.0 2) The gerbicz block size is forced to be between 25 and number of iterations remaining 3) If the iteration counter somehow gets past the end of the Gerbicz error a rollback occurs. 4) If the PRP test completes and the internal PRP state is "in the middle of a Gerbicz block" then a rollback occurs. More sanity checks are on the way as well as protection against copying errors discussed in another thread.
 2019-02-10, 10:54 #341 ET_ Banned     "Luigi" Aug 2002 Team Italia 26·3·52 Posts Jacobi errors on v29.5.9 I was doing a double-check on my 9800X (4 threads used by Prime95, 4 threads used by another sieving program), when I had the following message: Code: [Work thread Feb 10 11:49] Running Jacobi error check. Failed. Time: 11.230 sec. [Work thread Feb 10 11:50] Iteration: 17409895/47905967, ERROR: Jacobi error check failed! [Work thread Feb 10 11:50] Continuing from last save file. [Work thread Feb 10 11:50] Setting affinity to run helper thread 1 on CPU core #3 [Work thread Feb 10 11:50] Setting affinity to run helper thread 2 on CPU core #4 [Work thread Feb 10 11:50] Setting affinity to run helper thread 3 on CPU core #5 [Work thread Feb 10 11:50] Running Jacobi error check. Failed. Time: 11.132 sec. [Work thread Feb 10 11:50] Error reading intermediate file: p9M05967 [Work thread Feb 10 11:50] Renaming p9M05967 to p9M05967.bad1 [Work thread Feb 10 11:50] Trying backup intermediate file: p9M05967.bu [Work thread Feb 10 11:50] Running Jacobi error check. Failed. Time: 11.175 sec. [Work thread Feb 10 11:50] Error reading intermediate file: p9M05967.bu [Work thread Feb 10 11:50] Renaming p9M05967.bu to p9M05967.bad2 [Work thread Feb 10 11:50] Trying backup intermediate file: p9M05967.bu2 [Work thread Feb 10 11:50] Running Jacobi error check. Failed. Time: 11.209 sec. [Work thread Feb 10 11:50] Error reading intermediate file: p9M05967.bu2 [Work thread Feb 10 11:50] Renaming p9M05967.bu2 to p9M05967.bad3 [Work thread Feb 10 11:50] All intermediate files bad. Temporarily abandoning work unit. Is version 29.5.10 able to recover from such situation? What should I do with the *.bad savefiles?