![]() |
[QUOTE=Phil MjX;225458]
Block size 65536 for a cache of 6MB (core 2 quad 9550 : shouldn't it be 3MB by core = 12 MB ?) [/QUOTE] The code does not multiply the detected on-die cache size by the number of cores; if the die has an L3 cache then it will use that size instead. But modern processors have so much cache that it's pretty much inevitable you'll get block size 65536 (or 4x65536 with LARGEBLOCKS) |
Thanks Jason,
Another point, with SVN 376, I have just obtained the same error as Jeff : error: corrupt state, please restart from checkpoint (at 22.8%, iteration 21981, dim 1389987) Restarting with -ncr also crashes immediately. edit : Restarting from the backup checkpoint did work (iteration 20149, dim 1274159) Kind Regards. Philippe. |
[QUOTE=Phil MjX;225499]Another point, with SVN 376, I have just obtained the same error as Jeff :
error: corrupt state, please restart from checkpoint (at 22.8%, iteration 21981, dim 1389987)[/QUOTE] I just got back from a camping trip (at least I know that our new tent is water proof, torrential downpours for 3 hours put about 2 inches of rain under the tent and we were literally floating, felt like a water bed, but no water inside, the kids loved it) so wasn't running the MPI version but did leave the non-MPI version going and I also got the same error on that: [CODE]matrix is 7416958 x 7417185 (2127.1 MB) with weight 525245035 (70.81/col) sparse part has weight 483423157 (65.18/col) using block size 262144 for processor cache size 6144 kB commencing Lanczos iteration (8 threads) memory use: 2842.5 MB restarting at iteration 47023 (dim = 2973738) checkpointing every 75120 dimensions linear algebra at 40.1%, ETA 56h 7m error: corrupt state, please restart from checkpoint [/CODE] But with the non-MPI version I was able to restart from the last .chk file. In this case I had started -nc2, ran for a while, stopped, then re-started with -ncr then it crashed with the corrupt state some time while I was gone. Some something fishy is going on in the code base that affects both the MPI and non-MPI versions. Jeff. |
[quote=Phil MjX;225499]error: corrupt state, please restart from checkpoint (at 22.8%, iteration 21981, dim 138[U]9987[/U])
Restarting with -ncr also crashes immediately. [/quote] 3rd strike at 10000*n-[FONT=Times New Roman]ε[/FONT]. Time to valgrind! |
[quote=Jeff Gilchrist;225569]...the non-MPI version going and I also got the same error on that:
[code]matrix is 7416958 x 7417185 (2127.1 MB) with weight 525245035 (70.81/col) sparse part has weight 483423157 (65.18/col) using block size 262144 for processor cache size 6144 kB commencing Lanczos iteration (8 threads) memory use: 2842.5 MB restarting at iteration 47023 (dim = 2973738) [COLOR=green]<-- not 10000*n-[FONT=Times New Roman]ε[/FONT][/COLOR] checkpointing every 75120 dimensions linear algebra at 40.1%, ETA 56h 7m error: corrupt state, please restart from checkpoint [/code] But with the non-MPI version I was able to restart from the last .chk file. In this case I had started -nc2, ran for a while, stopped, then re-started with -ncr then it crashed with the corrupt state some time while I was gone. Some something fishy is going on in the code base that affects both the MPI and non-MPI versions. Jeff.[/quote] Jeff, I see that Jasonp assumes that your memory is ECC and cannot fail. However, clusters of cross-my-heart-and-hope-not-to-die non-ECC i7's exist. Is your cluster ECC? And the computer where non-MPI version ran? The orthogonality check is the greatest thing since sliced bread. memory errors happen. You may remember the idea to make msieve into a harder memory test than Prime95. Because it is! |
Serge, you said that you don't see the failure on restart if you force the checkpoint dump interval to be a multiple of the ortho check interval?
When the ortho check happens, if it succeeds then the check vector is replaced with the result from the current iteration. If a dump happens on the [i]next[/i] iteration, the ortho check vector is probably not orthogonal. The Lanczos recurrence only guarantees that a check vector three iterations back or more would pass the ortho check. So we have to make sure that the ortho check does not run less than four iterations after the last ortho check. |
I should have made a copy of the faulty checkpoint, so if I can help to test a patch, just tell me where in the code !...
Regards. |
Okay, I think I've committed a fix for this problem. Could folks here test it out? If it behaves better I'll get the process of releasing v1.47 started, the bugs in the linear algebra make the current version pretty unsuitable for big jobs.
Thanks to everyone for putting up with this churn. |
I also met the "checkpoint - corrupt state " bug v1.46 on a C157, in linear algebra (linux x64).
I am testing on v1.47 (svn 377) now, still the "corrupt state". [COLOR=SeaGreen] [/COLOR][COLOR=SeaGreen]8043804767 2010-08-12 17:12 r521b.dat 1462452324 2010-08-12 18:39 r521b.dat.mat 122372972 2010-08-12 18:37 r521b.dat.cyc 201607496 2010-08-14 05:52 r521b.dat.chk [/COLOR] [COLOR=Blue]Msieve v. 1.47 Mon Aug 16 10:57:05 2010 random seeds: c29fb1b4 9e33d1a6 factoring ......(157 digits) no P-1/P+1/ECM available, skipping commencing number field sieve (157-digit input) R0: -718286775230264074412218462160 R1: 330882384079102889 A0: -574785673446953991103337093662087263 A1: -2849845913901309456445653108969 A2: 19403845861276275944787296 A3: 40609767955771924744 A4: 185982982727102 A5: 33078600 skew 468676.75, size 2.585e-15, alpha -7.130, combined = 1.913e-12 rroots = 3 commencing linear algebra matrix starts at (0, 0) matrix is 5039914 x 5040091 (1529.3 MB) with weight 485006686 (96.23/col) sparse part has weight 340412623 (67.54/col) saving the first 48 matrix rows for later matrix includes 64 packed rows matrix is 5039866 x 5040091 (1482.4 MB) with weight 387000662 (76.78/col) sparse part has weight 338202876 (67.10/col) using block size 65536 for processor cache size 6144 kB commencing Lanczos iteration (2 threads) memory use: 1225.2 MB restarting at iteration 47281 (dim = 2989936) error: corrupt state, please restart from checkpoint error: corrupt state, please restart from checkpoint [/COLOR] [COLOR=Red]Msieve v. 1.46 Mon Aug 16 09:54:11 2010 random seeds: 07768b59 621b9f95 factoring ......(157 digits) no P-1/P+1/ECM available, skipping commencing number field sieve (157-digit input) R0: -718286775230264074412218462160 R1: 330882384079102889 A0: -574785673446953991103337093662087263 A1: -2849845913901309456445653108969 A2: 19403845861276275944787296 A3: 40609767955771924744 A4: 185982982727102 A5: 33078600 skew 468676.75, size 2.585e-15, alpha -7.130, combined = 1.913e-12 rroots = 3 commencing linear algebra[/COLOR] [COLOR=Red] matrix starts at (0, 0) matrix is 5039914 x 5040091 (1529.3 MB) with weight 485006686 (96.23/col) sparse part has weight 340412623 (67.54/col) saving the first 48 matrix rows for later matrix includes 64 packed rows matrix is 5039866 x 5040091 (1482.4 MB) with weight 387000662 (76.78/col) sparse part has weight 338202876 (67.10/col) using block size 65536 for processor cache size 6144 kB commencing Lanczos iteration (2 threads) memory use: 1225.2 MB restarting at iteration 47281 (dim = 2989936) error: corrupt state, please restart from checkpoint[/COLOR] [COLOR=Red] [/COLOR] |
Hold on, all is not lost. You will be able to restart after another patch. Do not start from zero.
Another patch is needed to remember the protected iteration # (e.g. after reading from a file, or from last test), and then as the simplest thing - do not perform the test if it is time for the ortho test but the iteration is still within 4 from the protected iteration. Could be more sophisticated: not simply skip but delay by 20 iterations or whatever. (Extra point to consider - the user may have stopped by signal; then iteration number will be random and the bug will reappear.) _____ [COLOR=green]@Jasonp: yes. Errh, I mean, with the disclaimer - that there were no more restarts, so it's hard to tell. My memory (my [I]computer's[/I] memory) is flaky, but not too much. One flaw per week in summer, none in winter. I can live with that. :-)[/COLOR] |
If you already have a checkpoint that was written at an inopportune time, there isn't much you can do. Try changing the check interval to 11000 on line 837 of common/lanczos/lanczos.c and the restart should work then.
Serge: you beat me to it, this is an LA bug that has always been there but has been hidden until now because the dump interval has always been a multiple of 10000. Even older versions are vulnerable if the user interrupts the LA at a bad time. Another possibility to get around this problem is to not update the check vector with the current solution after the ortho check, but the solution four iterations back (which is still available but gets deleted at the end of the current iteration). That would allow the check to pass even if it happened right on the next iteration. The down side is that an error that occurs less than four iterations back will not be noticed for an entire check interval. |
| All times are UTC. The time now is 04:50. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.