mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2019-02-09, 14:00   #331
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

114478 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Build 10 for Simon and Ken.

1) Fixed some possible bugs in Gerbicz PRP running on flaky hardware.
2) Fixed (I hope) the hang during benchmarking.

Windows 64-bit: ftp://mersenne.org/gimps/p95v295b10.win64.zip
Well that took a while; one benchmark, 1024k-32768k, all types, hyperthreaded and not, launched 10:22 pm Feb 7, finished 4:08 am Feb 9, on the same i7-8750H system on which it previously would hang in minutes. I won't be trying that again soon, too time consuming at ~29.5 hours.
Attached Thumbnails
Click image for larger version

Name:	long-benchmark-finished.png
Views:	43
Size:	275.2 KB
ID:	19875  
kriesel is online now   Reply With Quote
Old 2019-02-09, 16:31   #332
simon389
 
Aug 2013

5716 Posts
Default

Double checking PRP with build 10 on my “bad PRP” machine. Had to turn off Gerbicz verbosity 3 because progress had advanced just 0.74% overnight. Turned it off and speeds returned to normal.
simon389 is offline   Reply With Quote
Old 2019-02-09, 16:48   #333
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1CB616 Posts
Default

Quote:
Originally Posted by simon389 View Post
Double checking PRP with build 10 on my “bad PRP” machine. Had to turn off Gerbicz verbosity 3 because progress had advanced just 0.74% overnight. Turned it off and speeds returned to normal.
That makes no sense. Gerbicz verbosity only controls how much info is output to the screen and results.txt file.

Please send the results.txt file to me.
Prime95 is online now   Reply With Quote
Old 2019-02-09, 18:25   #334
GP2
 
GP2's Avatar
 
Sep 2003

32×7×41 Posts
Default

Quote:
Originally Posted by Prime95 View Post
That makes no sense. Gerbicz verbosity only controls how much info is output to the screen and results.txt file.
Maybe the additional printf calls themselves are somehow causing problems.

A string buffer overflow somewhere in all those additional verbose strings, or a change in the patterns of register usage, if registers aren't being saved and restored correctly.

That might even tie in with errors happening in the end-of-PRP-test final processing, because new strings get printed at that point, other than the usual iteration count lines.
GP2 is offline   Reply With Quote
Old 2019-02-09, 18:28   #335
simon389
 
Aug 2013

3×29 Posts
Default

Quote:
Originally Posted by Prime95 View Post
That makes no sense. Gerbicz verbosity only controls how much info is output to the screen and results.txt file.

Please send the results.txt file to me.

I also removed the "Gerbicz offset" line. Maybe that affected things? Every half second it was showing outputs of "Gerbicz checking iteration X" "Test passed" or whatever. After 7 hours it had advanced less than 1%. I'll forward the results.txt file to you.


I will keep that individual machine "broken" for the time being, but I finally "fixed" the other three, meaning I found CPU/RAM settings that have stable AVX512 tests in AIDA64.


Originally was 4.1Ghz, and that was failing. Tried 3.9Ghz - failed. Tried 3.8Ghz - failed. Tried 3.7 - that failed too. Finally, Mystical suggested undoing XMP settings for RAM (3600Mhz @ 1.35v and 19-20-20-40), and I discovered (to my surprise) that 3.8Ghz CPU and stock RAM settings (2000Mhz) had stable AVX512 in AIDA64 for 37 hours on all 3 machines. SUCCESS!



So now I'm leaving the "broken" PRP doublecheck system the way it is for testing purposes (it's currently doing a PRP doublecheck on build 10), but I'm using the other three machines to try different RAM speeds to see which works. Right now I have one still at stock RAM (2000Mhz), another one at 3000Mhz RAM (default cas latency, which I think is 15), and another one with RAM at 3400Mhz (with 1.3v and 19-20-20-40 like XMP suggests). We'll see which is stable after 24 hours.
simon389 is offline   Reply With Quote
Old 2019-02-09, 20:54   #336
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1CB616 Posts
Default

This line from Simon's results.txt file is ominous:

Code:
[Fri Feb 08 21:13:48 2019]
Start Gerbicz block of size 0 at iteration 0.
As noted in another thread, if block size is zero there is no Gerbicz error checking (although the result is reported as passing Gerbicz with no error recoveries during the test).

Simon, do you know if this was build 9 or build 10?

Also, early last month we see:
Code:
[Fri Jan 04 03:13:35 2019]
Iteration: 1/79048733, Possible error: round off (0.193936384) > 0
The only way I can see this happening is if a stack variable is corrupt.
Prime95 is online now   Reply With Quote
Old 2019-02-09, 21:35   #337
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×3×52×72 Posts
Default

Quote:
Originally Posted by Prime95 View Post
As noted in another thread, if block size is zero there is no Gerbicz error checking (although the result is reported as passing Gerbicz with no error recoveries during the test).

Simon, do you know if this was build 9 or build 10?
Ah, found the zeroed block size problem in prime.txt

Code:
PRPGerbiczCompareIntervalAdj=    -206254
This value is supposed to be between 0.001 and 1.0. This value is automatically set smaller when a Gerbicz compare fails. It is raised when a Gerbicz compare succeeds. How it got to this value is a mystery. Anyway, the block size was set to sqrt ( -206254 * 1000000) and I guess the C runtime library returns 0.

Build 10 would have caught this and set block size to 16 -- which also explains why build 10 was running so slowly for Simon.

I'm adding checks for bogus adjustment values.
Prime95 is online now   Reply With Quote
Old 2019-02-09, 21:45   #338
Mysticial
 
Mysticial's Avatar
 
Sep 2016

331 Posts
Default

Nice! Just like that, it sounds like both software and hardware problems are now resolved?

Mysticial is offline   Reply With Quote
Old 2019-02-09, 23:22   #339
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,903 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I'm adding checks for bogus adjustment values.
Sanity checks for anything a user could get his hands on is probably a good idea. Even if it is an issue created by some unforeseen combination of code behavior plus unexpected hardware error, not user error. The mindset of what characters could possibly be entered or stored here by a creative user and what effects might those have, removes filtering expectations about what the program will or won't or couldn't put there.
kriesel is online now   Reply With Quote
Old 2019-02-10, 01:57   #340
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·3·52·72 Posts
Default

There are now 4 separate sanity checks that would have caught Simon's problem.

1) The adjustment value is forced to be between 0.001 and 1.0
2) The gerbicz block size is forced to be between 25 and number of iterations remaining
3) If the iteration counter somehow gets past the end of the Gerbicz error a rollback occurs.
4) If the PRP test completes and the internal PRP state is "in the middle of a Gerbicz block" then a rollback occurs.

More sanity checks are on the way as well as protection against copying errors discussed in another thread.
Prime95 is online now   Reply With Quote
Old 2019-02-10, 10:54   #341
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

26·3·52 Posts
Default Jacobi errors on v29.5.9

I was doing a double-check on my 9800X (4 threads used by Prime95, 4 threads used by another sieving program), when I had the following message:

Code:
[Work thread Feb 10 11:49] Running Jacobi error check.  Failed.  Time: 11.230 sec.
[Work thread Feb 10 11:50] Iteration: 17409895/47905967, ERROR: Jacobi error check failed!
[Work thread Feb 10 11:50] Continuing from last save file.
[Work thread Feb 10 11:50] Setting affinity to run helper thread 1 on CPU core #3
[Work thread Feb 10 11:50] Setting affinity to run helper thread 2 on CPU core #4
[Work thread Feb 10 11:50] Setting affinity to run helper thread 3 on CPU core #5
[Work thread Feb 10 11:50] Running Jacobi error check.  Failed.  Time: 11.132 sec.
[Work thread Feb 10 11:50] Error reading intermediate file: p9M05967
[Work thread Feb 10 11:50] Renaming p9M05967 to p9M05967.bad1
[Work thread Feb 10 11:50] Trying backup intermediate file: p9M05967.bu
[Work thread Feb 10 11:50] Running Jacobi error check.  Failed.  Time: 11.175 sec.
[Work thread Feb 10 11:50] Error reading intermediate file: p9M05967.bu
[Work thread Feb 10 11:50] Renaming p9M05967.bu to p9M05967.bad2
[Work thread Feb 10 11:50] Trying backup intermediate file: p9M05967.bu2
[Work thread Feb 10 11:50] Running Jacobi error check.  Failed.  Time: 11.209 sec.
[Work thread Feb 10 11:50] Error reading intermediate file: p9M05967.bu2
[Work thread Feb 10 11:50] Renaming p9M05967.bu2 to p9M05967.bad3
[Work thread Feb 10 11:50] All intermediate files bad.  Temporarily abandoning work unit.
Is version 29.5.10 able to recover from such situation?
What should I do with the *.bad savefiles?
ET_ is online now   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 16:56.

Fri Feb 26 16:56:15 UTC 2021 up 85 days, 13:07, 0 users, load averages: 2.03, 1.85, 1.75

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.