mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve v1.46 feedback (https://www.mersenneforum.org/showthread.php?t=13676)

jasonp 2010-08-15 01:24

[QUOTE=Phil MjX;225458]
Block size 65536 for a cache of 6MB (core 2 quad 9550 : shouldn't it be 3MB by core = 12 MB ?)
[/QUOTE]
The code does not multiply the detected on-die cache size by the number of cores; if the die has an L3 cache then it will use that size instead. But modern processors have so much cache that it's pretty much inevitable you'll get block size 65536 (or 4x65536 with LARGEBLOCKS)

Phil MjX 2010-08-15 10:53

Thanks Jason,

Another point, with SVN 376, I have just obtained the same error as Jeff :

error: corrupt state, please restart from checkpoint (at 22.8%, iteration 21981, dim 1389987)

Restarting with -ncr also crashes immediately.

edit : Restarting from the backup checkpoint did work (iteration 20149, dim 1274159)

Kind Regards.
Philippe.

Jeff Gilchrist 2010-08-15 20:25

[QUOTE=Phil MjX;225499]Another point, with SVN 376, I have just obtained the same error as Jeff :

error: corrupt state, please restart from checkpoint (at 22.8%, iteration 21981, dim 1389987)[/QUOTE]

I just got back from a camping trip (at least I know that our new tent is water proof, torrential downpours for 3 hours put about 2 inches of rain under the tent and we were literally floating, felt like a water bed, but no water inside, the kids loved it) so wasn't running the MPI version but did leave the non-MPI version going and I also got the same error on that:

[CODE]matrix is 7416958 x 7417185 (2127.1 MB) with weight 525245035 (70.81/col)
sparse part has weight 483423157 (65.18/col)
using block size 262144 for processor cache size 6144 kB
commencing Lanczos iteration (8 threads)
memory use: 2842.5 MB
restarting at iteration 47023 (dim = 2973738)
checkpointing every 75120 dimensions
linear algebra at 40.1%, ETA 56h 7m

error: corrupt state, please restart from checkpoint
[/CODE]

But with the non-MPI version I was able to restart from the last .chk file. In this case I had started -nc2, ran for a while, stopped, then re-started with -ncr then it crashed with the corrupt state some time while I was gone.

Some something fishy is going on in the code base that affects both the MPI and non-MPI versions.

Jeff.

Batalov 2010-08-15 20:34

[quote=Phil MjX;225499]error: corrupt state, please restart from checkpoint (at 22.8%, iteration 21981, dim 138[U]9987[/U])

Restarting with -ncr also crashes immediately.
[/quote]
3rd strike at 10000*n-[FONT=Times New Roman]ε[/FONT]. Time to valgrind!

Batalov 2010-08-15 20:55

[quote=Jeff Gilchrist;225569]...the non-MPI version going and I also got the same error on that:

[code]matrix is 7416958 x 7417185 (2127.1 MB) with weight 525245035 (70.81/col)
sparse part has weight 483423157 (65.18/col)
using block size 262144 for processor cache size 6144 kB
commencing Lanczos iteration (8 threads)
memory use: 2842.5 MB
restarting at iteration 47023 (dim = 2973738) [COLOR=green]<-- not 10000*n-[FONT=Times New Roman]ε[/FONT][/COLOR]
checkpointing every 75120 dimensions
linear algebra at 40.1%, ETA 56h 7m

error: corrupt state, please restart from checkpoint
[/code]

But with the non-MPI version I was able to restart from the last .chk file. In this case I had started -nc2, ran for a while, stopped, then re-started with -ncr then it crashed with the corrupt state some time while I was gone.

Some something fishy is going on in the code base that affects both the MPI and non-MPI versions.

Jeff.[/quote]
Jeff, I see that Jasonp assumes that your memory is ECC and cannot fail.
However, clusters of cross-my-heart-and-hope-not-to-die non-ECC i7's exist. Is your cluster ECC? And the computer where non-MPI version ran?

The orthogonality check is the greatest thing since sliced bread. memory errors happen. You may remember the idea to make msieve into a harder memory test than Prime95. Because it is!

jasonp 2010-08-15 21:47

Serge, you said that you don't see the failure on restart if you force the checkpoint dump interval to be a multiple of the ortho check interval?

When the ortho check happens, if it succeeds then the check vector is replaced with the result from the current iteration. If a dump happens on the [i]next[/i] iteration, the ortho check vector is probably not orthogonal. The Lanczos recurrence only guarantees that a check vector three iterations back or more would pass the ortho check. So we have to make sure that the ortho check does not run less than four iterations after the last ortho check.

Phil MjX 2010-08-15 22:53

I should have made a copy of the faulty checkpoint, so if I can help to test a patch, just tell me where in the code !...

Regards.

jasonp 2010-08-16 02:18

Okay, I think I've committed a fix for this problem. Could folks here test it out? If it behaves better I'll get the process of releasing v1.47 started, the bugs in the linear algebra make the current version pretty unsuitable for big jobs.

Thanks to everyone for putting up with this churn.

tgrdy 2010-08-16 02:40

I also met the "checkpoint - corrupt state " bug v1.46 on a C157, in linear algebra (linux x64).

I am testing on v1.47 (svn 377) now, still the "corrupt state".
[COLOR=SeaGreen]
[/COLOR][COLOR=SeaGreen]8043804767 2010-08-12 17:12 r521b.dat
1462452324 2010-08-12 18:39 r521b.dat.mat
122372972 2010-08-12 18:37 r521b.dat.cyc
201607496 2010-08-14 05:52 r521b.dat.chk [/COLOR]

[COLOR=Blue]Msieve v. 1.47
Mon Aug 16 10:57:05 2010
random seeds: c29fb1b4 9e33d1a6
factoring ......(157 digits)
no P-1/P+1/ECM available, skipping
commencing number field sieve (157-digit input)
R0: -718286775230264074412218462160
R1: 330882384079102889
A0: -574785673446953991103337093662087263
A1: -2849845913901309456445653108969
A2: 19403845861276275944787296
A3: 40609767955771924744
A4: 185982982727102
A5: 33078600
skew 468676.75, size 2.585e-15, alpha -7.130, combined = 1.913e-12 rroots = 3

commencing linear algebra
matrix starts at (0, 0)
matrix is 5039914 x 5040091 (1529.3 MB) with weight 485006686 (96.23/col)
sparse part has weight 340412623 (67.54/col)
saving the first 48 matrix rows for later
matrix includes 64 packed rows
matrix is 5039866 x 5040091 (1482.4 MB) with weight 387000662 (76.78/col)
sparse part has weight 338202876 (67.10/col)
using block size 65536 for processor cache size 6144 kB
commencing Lanczos iteration (2 threads)
memory use: 1225.2 MB
restarting at iteration 47281 (dim = 2989936)
error: corrupt state, please restart from checkpoint

error: corrupt state, please restart from checkpoint


[/COLOR]
[COLOR=Red]Msieve v. 1.46
Mon Aug 16 09:54:11 2010
random seeds: 07768b59 621b9f95
factoring ......(157 digits)
no P-1/P+1/ECM available, skipping
commencing number field sieve (157-digit input)
R0: -718286775230264074412218462160
R1: 330882384079102889
A0: -574785673446953991103337093662087263
A1: -2849845913901309456445653108969
A2: 19403845861276275944787296
A3: 40609767955771924744
A4: 185982982727102
A5: 33078600
skew 468676.75, size 2.585e-15, alpha -7.130, combined = 1.913e-12 rroots = 3

commencing linear algebra[/COLOR] [COLOR=Red]
matrix starts at (0, 0)
matrix is 5039914 x 5040091 (1529.3 MB) with weight 485006686 (96.23/col)
sparse part has weight 340412623 (67.54/col)
saving the first 48 matrix rows for later
matrix includes 64 packed rows
matrix is 5039866 x 5040091 (1482.4 MB) with weight 387000662 (76.78/col)
sparse part has weight 338202876 (67.10/col)
using block size 65536 for processor cache size 6144 kB
commencing Lanczos iteration (2 threads)
memory use: 1225.2 MB
restarting at iteration 47281 (dim = 2989936)

error: corrupt state, please restart from checkpoint[/COLOR] [COLOR=Red]
[/COLOR]

Batalov 2010-08-16 03:25

Hold on, all is not lost. You will be able to restart after another patch. Do not start from zero.

Another patch is needed to remember the protected iteration # (e.g. after reading from a file, or from last test), and then as the simplest thing - do not perform the test if it is time for the ortho test but the iteration is still within 4 from the protected iteration.
Could be more sophisticated: not simply skip but delay by 20 iterations or whatever.
(Extra point to consider - the user may have stopped by signal; then iteration number will be random and the bug will reappear.)
_____

[COLOR=green]@Jasonp: yes. Errh, I mean, with the disclaimer - that there were no more restarts, so it's hard to tell. My memory (my [I]computer's[/I] memory) is flaky, but not too much. One flaw per week in summer, none in winter. I can live with that. :-)[/COLOR]

jasonp 2010-08-16 03:32

If you already have a checkpoint that was written at an inopportune time, there isn't much you can do. Try changing the check interval to 11000 on line 837 of common/lanczos/lanczos.c and the restart should work then.

Serge: you beat me to it, this is an LA bug that has always been there but has been hidden until now because the dump interval has always been a multiple of 10000. Even older versions are vulnerable if the user interrupts the LA at a bad time.

Another possibility to get around this problem is to not update the check vector with the current solution after the ortho check, but the solution four iterations back (which is still available but gets deleted at the end of the current iteration). That would allow the check to pass even if it happened right on the next iteration. The down side is that an error that occurs less than four iterations back will not be noticed for an entire check interval.


All times are UTC. The time now is 04:50.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.