mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2019-03-30, 17:58   #12
AG5BPilot
 
AG5BPilot's Avatar
 
Dec 2011
New York, U.S.A.

97 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I suggest getting prime95 and running an AVX-512 torture test on the machine.

I also suggest building LLR with gwnum 29.7.
I've passed your suggestion along to the owner of that computer.

He noticed something that I didn't. It may not be CPU model that's important.

All the computers that have run the AVX-512 transform correctly on SGS tasks have been running Linux.

The single computer that was running AVX-512 SGS tasks on Windows (Win 10 to be exact) is the one computer that produced the bad results.
AG5BPilot is offline   Reply With Quote
Old 2019-03-30, 21:33   #13
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

35·13 Posts
Default

I tried running it on EC2 Windows instance with AVX512, but LLR uses FMA3 FFT and the residue matches.

Is there a way to force LLR to use AVX512 FFT?

Even Prime95 29.7b1 uses FMA3 FFT when running this test on the EC2 machine.



Since it is a single computer with the errors it sounds like hardware error? Have him run the test with error checking and sum input output checking?:
cllr64.exe -d -oVerify=1 -oTestdiff=1 -q"4344392810277*2^1290000-1"

Have him check if it uses FMA3 or AVX512 FFT? With the -d option it writes it at the start of the test:
Code:
Starting Lucas Lehmer Riesel prime test of 4344392810277*2^1290000-1
Using zero-padded FMA3 FFT length 128K, Pass1=512, Pass2=256, clm=2
V1 = 5 ; Computing U0...done.


Can we get LLR recompiled with gwnum 29.7?
ftp://mersenne.org/gimps/p95v297b1.source.zip

It failed to compile for me on linux and I do not know how to compile it on Windows.

Last fiddled with by ATH on 2019-03-30 at 21:40
ATH is offline   Reply With Quote
Old 2019-03-30, 22:59   #14
AG5BPilot
 
AG5BPilot's Avatar
 
Dec 2011
New York, U.S.A.

97 Posts
Default

Quote:
Originally Posted by ATH View Post
I tried running it on EC2 Windows instance with AVX512, but LLR uses FMA3 FFT and the residue matches.

Is there a way to force LLR to use AVX512 FFT?
The Amazon VM hypervisor and the guest OS both need to support AVX-512. The OS -- both host and guest -- have to know there's more registers to save during a context switch.

If they don't support AVX-512, then the virtual CPU tells the software there's no support for AVX-512. That's why both prime95 and LLR only use the FMA3 transform.

Even if you could force the software to use AVX-512, you would destroy the calculations at each context switch, which happens thousands of times per second. It simply won't work.

Quote:
Originally Posted by ATH View Post
Since it is a single computer with the errors it sounds like hardware error?
It doesn't feel like hardware. It's 100% stable on Proth numbers and 100% failure on Riesel. It is, however, a Windows build while the computers that correctly do the Riesel test are Linux. It's either a build problem, or a problem with some Windows code.

Last fiddled with by AG5BPilot on 2019-03-30 at 23:03
AG5BPilot is offline   Reply With Quote
Old 2019-03-30, 23:39   #15
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

35×13 Posts
Default

You are right I forgot about that, I was using Windows Server 2012.

I tried again in Windows Server 2016 and I got another residue, so it does seem like a Windows AVX512 issue and not a hardware issue with that specific computer. It is either an issue with the specific Windows compiled version, or an issue with AVX512.

Code:
C:\Users\Administrator\Desktop\cllr38mwin64>cllr64.exe -d -q"4344392810277*2^1290000-1"
Starting Lucas Lehmer Riesel prime test of 4344392810277*2^1290000-1
Using zero-padded AVX-512 FFT length 128K, Pass1=128, Pass2=1K, clm=1
V1 = 5 ; Computing U0...done.
4344392810277*2^1290000-1 is not prime.  LLR Res64: 8176D89360699CE1  Time : 479.648 sec.

Edit: Adding error check and sumin sumout check did not help, got a new residue and no error warning during the run:

Code:
C:\Users\Administrator\Desktop\cllr38mwin64>cllr64.exe -d -oVerify=1 -oTestdiff=1 -q"4344392810277*2^1290000-1"
Starting Lucas Lehmer Riesel prime test of 4344392810277*2^1290000-1
Using zero-padded AVX-512 FFT length 128K, Pass1=128, Pass2=1K, clm=1
V1 = 5 ; Computing U0...done.
4344392810277*2^1290000-1 is not prime.  LLR Res64: 8134AB10D1E64456  Time : 398.769 sec.


Edit2: I was using Verify=1 by mistake instead of ErrorCheck=1, but the result is the same, no error and a new residue:

Code:
C:\Users\Administrator\Desktop\cllr38mwin64>cllr64.exe -d -oErrorCheck=1 -oTestdiff=1 -q"4344392810277*2^1290000-1"
Starting Lucas Lehmer Riesel prime test of 4344392810277*2^1290000-1
Using zero-padded AVX-512 FFT length 128K, Pass1=128, Pass2=1K, clm=1
V1 = 5 ; Computing U0...done.
4344392810277*2^1290000-1 is not prime.  LLR Res64: 7AA803A9D23D883F  Time : 399.738 sec.

Last fiddled with by ATH on 2019-03-31 at 00:01
ATH is offline   Reply With Quote
Old 2019-03-31, 03:28   #16
tshinozk
 
Nov 2012

23 Posts
Default

PRP also failed on my machime in p95v297b1.win64. (Windows10 7980X)

Code:
[Worker #1 Mar 31 08:16] Iteration: 250000 / 1290041 [19.37%], ms/iter:  0.379, ETA: 00:06:34
[Worker #1 Mar 31 08:16] ERROR: Comparing Gerbicz checksum values failed.  Rolling back to iteration 41.
[Worker #1 Mar 31 08:16] Continuing from last save file.
[Worker #1 Mar 31 08:16] Resuming Gerbicz error-checking PRP test of 4344392810277*2^1290000-1 using zero-padded AVX-512 FFT length 128K, Pass1=128, Pass2=1K, clm=1
[Worker #1 Mar 31 08:16] Iteration: 42 / 1290041 [0.00%].
[Worker #1 Mar 31 08:16] Hardware errors have occurred during the test!
tshinozk is offline   Reply With Quote
Old 2019-03-31, 04:04   #17
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

C5716 Posts
Default

Ok, PRP test works in mprime 29.7b1, so it is only a Windows AVX512 issue, but both in LLR and Prime95.

Maybe it is because of the special k*2^n-1 form with a large k that is not working 100% in AVX512 ?

Last fiddled with by ATH on 2019-03-31 at 04:05
ATH is offline   Reply With Quote
Old 2019-03-31, 04:45   #18
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

752610 Posts
Default

Quote:
Originally Posted by ATH View Post
Ok, PRP test works in mprime 29.7b1, so it is only a Windows AVX512 issue, but both in LLR and Prime95.

Maybe it is because of the special k*2^n-1 form with a large k that is not working 100% in AVX512 ?
Nuts. I cannot debug or test Windows AVX-512.

The relevant C code and assembly code is the same for both prime95 and mprime. The most likely culprit is an assembly language routine that is not saving registers that Windows is expecting to be saved. I'll eyeball the code for an such an instance.
Prime95 is offline   Reply With Quote
Old 2019-03-31, 07:43   #19
rebirther
 
rebirther's Avatar
 
Sep 2011
Germany

54348 Posts
Default

I have recompiled 64bit only with the gwnum 29.7.


download
rebirther is offline   Reply With Quote
Old 2019-03-31, 15:19   #20
tshinozk
 
Nov 2012

23 Posts
Default

Quote:
Maybe it is because of the special k*2^n-1 form with a large k that is not working 100% in AVX512 ?
For a large k of 4344392810277*2^1290000-1, it seems that there is a boundary.
If I halve the k,
PRP=4344392810277,2,1290000,-1
PRP=2172196405138,2,1290000,-1
PRP=1086098202569,2,1290000,-1
.....
PRP=1035783,2,1290000,-1 fail
PRP=517891,2,1290000,-1 success 

The boundary of K is about 20 bit length.
tshinozk is offline   Reply With Quote
Old 2019-03-31, 18:00   #21
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

C5716 Posts
Default

Interesting, I did some further tests to narrow down the point of failure for different exponents. In 2 of the 3 tests it happens just at the boundary between 2 FFT sizes.

Prime95 29.7b1 running AVX512:

Code:
PRP=1050000,2,500000,-1 pass
PRP=1150001,2,500000,-1 pass
PRP=1500000,2,500000,-1 pass
PRP=1700000,2,500000,-1 pass 40K FFT
PRP=1800000,2,500000,-1 pass 40K FFT___
PRP=1850000,2,500000,-1 pass 48K FFT
PRP=1875000,2,500000,-1 pass 48K FFT
PRP=1890000,2,500000,-1 pass 48K FFT
PRP=1895000,2,500000,-1 pass 48K FFT
PRP=1899000,2,500000,-1 pass
PRP=1899900,2,500000,-1 pass
PRP=1899990,2,500000,-1 pass
PRP=1899998,2,500000,-1 pass
PRP=1899999,2,500000,-1 fail
PRP=1900001,2,500000,-1 fail
PRP=2000001,2,500000,-1 fail

PRP=1800000,2,1000000,-1 pass
PRP=1890000,2,1000000,-1 pass
PRP=1899000,2,1000000,-1 pass
PRP=1899900,2,1000000,-1 pass
PRP=1899984,2,1000000,-1 pass 84K FFT
PRP=1899986,2,1000000,-1 fail 120K FFT
PRP=1899987,2,1000000,-1 fail
PRP=1899989,2,1000000,-1 fail
PRP=1899990,2,1000000,-1 fail
PRP=1899998,2,1000000,-1 fail
PRP=1899999,2,1000000,-1 fail
PRP=1900001,2,1000000,-1 fail

PRP=900000,2,1290000,-1 pass
PRP=930000,2,1290000,-1 pass
PRP=933000,2,1290000,-1 pass
PRP=933500,2,1290000,-1 pass
PRP=933800,2,1290000,-1 pass
PRP=933900,2,1290000,-1 pass
PRP=933990,2,1290000,-1 pass
PRP=933998,2,1290000,-1 pass 120K FFT
PRP=933999,2,1290000,-1 fail 128K FFT
PRP=934001,2,1290000,-1 fail
PRP=935001,2,1290000,-1 fail
PRP=940001,2,1290000,-1 fail
PRP=950001,2,1290000,-1 fail
PRP=980001,2,1290000,-1 fail
PRP=1000001,2,1290000,-1 fail
PRP=1035783,2,1290000,-1 fail

Last fiddled with by ATH on 2019-03-31 at 18:01
ATH is offline   Reply With Quote
Old 2019-03-31, 18:19   #22
AG5BPilot
 
AG5BPilot's Avatar
 
Dec 2011
New York, U.S.A.

97 Posts
Default

I asked our user with the Windows Skylake-X that had the original problem to manually repeat the test to see if the bad residue was repeatable. The results were, to say the least, confusing.

His orignal bad residue was B878873BD88188FB and the correct residue is 7D325B0469A1226E.

Code:
LLR Program - Version 3.8.22, using Gwnum Library Version 29.6
4344392810277*2^1290000-1 is not prime. LLR Res64: 8176D89360699CE1 Time : 311.089 sec.
4344392810277*2^1290000-1 is not prime. LLR Res64: 8176D89360699CE1 Time : 312.274 sec.
4344392810277*2^1290000-1 is not prime. LLR Res64: 8176D89360699CE1 Time : 318.074 sec.
4344392810277*2^1290000-1 is not prime. LLR Res64: 01384154D0C02195 Time : 317.227 sec.
4344392810277*2^1290000-1 is not prime. LLR Res64: BCF991F298BDC4C3 Time : 315.665 sec.

LLR Program - Version 3.8.22, using Gwnum Library Version 29.7
4344392810277*2^1290000-1 is not prime. LLR Res64: F422DC7BABEBA15D Time : 310.649 sec.
4344392810277*2^1290000-1 is not prime. LLR Res64: 6C5120CF5D674C89 Time : 311.306 sec.
4344392810277*2^1290000-1 is not prime. LLR Res64: 7AA803A9D23D883F Time : 313.710 sec.
4344392810277*2^1290000-1 is not prime. LLR Res64: 8176D89360699CE1 Time : 317.978 sec.
4344392810277*2^1290000-1 is not prime. LLR Res64: CBB863A741F7095B Time : 317.296 sec.
4344392810277*2^1290000-1 is not prime. LLR Res64: 732C6FF13B401C41 Time : 316.425 sec.
Sometimes it repeats one of the bad residues, sometimes it doesn't. Maybe that will help George figure out where the problem is, but that just seems confusing to me.
AG5BPilot is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
LLR Version 3.8.21 Released Jean Penné Software 26 2019-07-08 16:54
LLR Version 3.8.15 released Jean Penné Software 28 2015-08-04 04:51
LLR Version 3.8.11 released Jean Penné Software 37 2014-01-29 16:32
LLR Version 3.8.9 released Jean Penné Software 37 2013-10-31 08:45
llr 3.8.2 released as dev-version opyrt Prime Sierpinski Project 11 2010-11-18 18:24

All times are UTC. The time now is 16:33.


Fri Jul 16 16:33:10 UTC 2021 up 49 days, 14:20, 1 user, load averages: 1.56, 1.51, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.