![]() |
Msieve v1.48 feedback
Running msieve 1.48 on 64bit Linux, I'm running into a problem re-starting a linear algebra and I'm not sure if it is msieve or something wrong with the cluster.
Using -nc2 5,5 with 25 CPUs, it started the linear algebra and completed about 89%. Now when I try to restart it with -ncr 5,5 Some of the nodes are reporting: [CODE]Sat Feb 19 06:39:51 2011 commencing Lanczos iteration Sat Feb 19 06:39:51 2011 memory use: 90.7 MB Sat Feb 19 06:39:55 2011 restarting at iteration 82229 (dim = 5200062) [/CODE] While others are stuck at this without the "restarting" line: [CODE]Sat Feb 19 06:39:47 2011 commencing Lanczos iteration Sat Feb 19 06:39:47 2011 memory use: 91.0 MB [/CODE] So there is never any output in any of the log files or stdout it just "sits" there even those there are 25 processes running using 100% of the CPU for each core. Any idea what is going on or how to correct this? If I got back to the previous checkpoint file, same thing. Jeff. |
Reverting back to version 1.47 it is now working. Something seems to be broken in 1.48 (official release, not using latest SVN).
Jeff |
Uh-oh, it didn't like that:
[CODE]linear algebra completed 5860421 of 5860705 dimensions (100.0%, ETA 0h 0m) lanczos halted after 92675 iterations (dim = 5860478) lanczos halted after 92675 iterations (dim = 5860478) [... several deleted...] lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found BLanczosTime: 8967 lanczos error: only trivial dependencies found BLanczosTime: 8968 lanczos error: only trivial dependencies found [/CODE] Is that because I tried to mix 1.47 and 1.48 or is this another issue? |
Full log, please? It could be the gross oversieving symptom (I am shooting in the dark).
|
1 Attachment(s)
[QUOTE=Batalov;253062]Full log, please? It could be the gross oversieving symptom (I am shooting in the dark).[/QUOTE]
Darn, some how my linear algebra log got flushed when I tried to restart so I only have the initial filtering part, that start of the LA, and the error of the square root. Filtering (with relation error reads removed): [I]See attached[/I] Linear algebra (from my restart): [CODE] Sat Feb 19 19:22:33 2011 Sat Feb 19 19:22:33 2011 Sat Feb 19 19:22:33 2011 Msieve v. 1.48 Sat Feb 19 19:22:33 2011 random seeds: 7b2cfd83 3a840734 Sat Feb 19 19:22:33 2011 MPI process 0 of 25 Sat Feb 19 19:22:33 2011 factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits) Sat Feb 19 19:22:35 2011 no P-1/P+1/ECM available, skipping Sat Feb 19 19:22:35 2011 commencing number field sieve (226-digit input) Sat Feb 19 19:22:35 2011 R0: -10000000000000000000000000000000000000 Sat Feb 19 19:22:35 2011 R1: 1 Sat Feb 19 19:22:35 2011 A0: -7 Sat Feb 19 19:22:35 2011 A1: 0 Sat Feb 19 19:22:35 2011 A2: 0 Sat Feb 19 19:22:35 2011 A3: 0 Sat Feb 19 19:22:35 2011 A4: 0 Sat Feb 19 19:22:35 2011 A5: 0 Sat Feb 19 19:22:35 2011 A6: 430000 Sat Feb 19 19:22:35 2011 skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2 Sat Feb 19 19:22:35 2011 Sat Feb 19 19:22:35 2011 commencing linear algebra Sat Feb 19 19:22:35 2011 initialized process (0,0) of 5 x 5 grid Sat Feb 19 19:22:38 2011 read 5860705 cycles Sat Feb 19 19:22:54 2011 cycles contain 16443107 unique relations Sat Feb 19 19:27:05 2011 read 16443107 relations Sat Feb 19 19:27:48 2011 using 20 quadratic characters above 1073737242 Sat Feb 19 19:29:09 2011 building initial matrix Sat Feb 19 19:34:22 2011 memory use: 2172.4 MB Sat Feb 19 19:34:33 2011 read 5860705 cycles Sat Feb 19 19:34:34 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col) Sat Feb 19 19:34:34 2011 sparse part has weight 447476324 (76.35/col) Sat Feb 19 19:35:42 2011 filtering completed in 1 passes Sat Feb 19 19:35:43 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col) Sat Feb 19 19:35:43 2011 sparse part has weight 447476324 (76.35/col) Sat Feb 19 19:36:43 2011 matrix starts at (0, 0) Sat Feb 19 19:36:43 2011 matrix is 1172198 x 1039795 (122.5 MB) with weight 42759959 (41.12/col) Sat Feb 19 19:36:43 2011 sparse part has weight 20671663 (19.88/col) Sat Feb 19 19:36:43 2011 saving the first 48 matrix rows for later Sat Feb 19 19:36:43 2011 matrix includes 64 packed rows Sat Feb 19 19:37:36 2011 matrix is 1172150 x 1039795 (106.3 MB) with weight 24404657 (23.47/col) Sat Feb 19 19:37:36 2011 sparse part has weight 17480889 (16.81/col) Sat Feb 19 19:37:36 2011 using block size 262144 for processor cache size 6144 kB Sat Feb 19 19:37:37 2011 commencing Lanczos iteration Sat Feb 19 19:37:37 2011 memory use: 130.9 MB Sat Feb 19 19:38:23 2011 linear algebra at 0.0%, ETA 44h56m Sat Feb 19 19:38:38 2011 checkpointing every 130000 dimensions . . . <end of log from previous> . . . linear algebra completed 5860421 of 5860705 dimensions (100.0%, ETA 0h 0m) lanczos halted after 92675 iterations (dim = 5860478) lanczos halted after 92675 iterations (dim = 5860478) [... several deleted...] lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found BLanczosTime: 8967 lanczos error: only trivial dependencies found BLanczosTime: 8968 lanczos error: only trivial dependencies found [/CODE] Attempted square root: [CODE] Sat Feb 19 16:47:36 2011 Sat Feb 19 16:47:36 2011 Sat Feb 19 16:47:36 2011 Msieve v. 1.48 Sat Feb 19 16:47:36 2011 random seeds: ab68941b 6b786b14 Sat Feb 19 16:47:36 2011 factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits) Sat Feb 19 16:47:39 2011 no P-1/P+1/ECM available, skipping Sat Feb 19 16:47:39 2011 commencing number field sieve (226-digit input) Sat Feb 19 16:47:39 2011 R0: -10000000000000000000000000000000000000 Sat Feb 19 16:47:39 2011 R1: 1 Sat Feb 19 16:47:39 2011 A0: -7 Sat Feb 19 16:47:39 2011 A1: 0 Sat Feb 19 16:47:39 2011 A2: 0 Sat Feb 19 16:47:39 2011 A3: 0 Sat Feb 19 16:47:39 2011 A4: 0 Sat Feb 19 16:47:39 2011 A5: 0 Sat Feb 19 16:47:39 2011 A6: 430000 Sat Feb 19 16:47:39 2011 skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2 Sat Feb 19 16:47:39 2011 Sat Feb 19 16:47:39 2011 commencing square root phase Sat Feb 19 16:47:39 2011 error: read_cycles can't open dependency file [/CODE] |
The 'only trivial dependencies found' message will appear for all the machines except for process 0, which writes the dependency file. Though in this case process 0 seems to have messed up too :)
Regarding the deadlock, could you build with debug symbols, then build the matrix and restart from a checkpoint, then use 'pstack' on some of the nodes to see where they are hanging? I think the hang point is after the matrix gets read in, so it may be a race in the checkpoint reading code. AFAIK there aren't any disk format changes between 1.47 and 1.48, although there were a few changes to the MPI lanczos initialization, including the checkpoint reading code. |
Well, there's one typo that should be fixed...does applying the following patch make an LA restart behave better?
[code] =================================================================== --- common/lanczos/lanczos.c (revision 541) +++ common/lanczos/lanczos.c (working copy) @@ -628,7 +628,7 @@ #ifdef HAVE_MPI /* push the full-size vectors to the top grid row */ - if (obj->mpi_ncols > 1) { + if (obj->mpi_ncols > 1 && obj->mpi_la_row_rank == 0) { MPI_TRY(MPI_Scatterv(x, packed_matrix->col_counts, packed_matrix->col_offsets, MPI_LONG_LONG, x, n, MPI_LONG_LONG, 0, [/code] |
[QUOTE=jasonp;253151]Well, there's one typo that should be fixed...does applying the following patch make an LA restart behave better?
[/QUOTE] Yes, that fixes the restart. Yesterday I started again with -nc2, and with the old binary can reproduce the -ncr not working every time, with the new fix it restarts every time. Unfortunately for me I forgot to save the .cyc file and .mat files before I restarted with -nc2 (the -nc1 was not redone), can I still use an old checkpoint file (at 89%) or would everything be all messed up? I will let you know when it finishes the second time if the last step actually works. Jeff. |
I'm not sure you can restart from an old checkpoint; if the restart code has a race then it's possible you previously restarted from a checkpoint and corrupted the result. But 11% of your run sounds like less than overnight, so it's worth a shot.
Sorry for the churn... PS: If you don't have the .cyc file that built the matrix then you'll have to start over. The square root will need it even if the LA did not, and I can't guarantee that you'll get the same matrix even from the same initial dataset. |
[QUOTE=jasonp;253219]I'm not sure you can restart from an old checkpoint; if the restart code has a race then it's possible you previously restarted from a checkpoint and corrupted the result. But 11% of your run sounds like less than overnight, so it's worth a shot.
Sorry for the churn... PS: If you don't have the .cyc file that built the matrix then you'll have to start over. The square root will need it even if the LA did not, and I can't guarantee that you'll get the same matrix even from the same initial dataset.[/QUOTE] I tried it just to see assuming that it wouldn't work and it failed, gave an corruption message. That is ok, my fault for not saving the file. I'm down to 3.5h ETA to finish now so no big deal. Should be interesting to see what happens after that. At least I will have the full log this time. Jeff. |
Ok, while restarting works, it failed again at the LA stage. Here is what is in the log. You already have the filtering stage log posted previously.
Rank 0 log: [CODE]Sat Feb 19 19:22:33 2011 Sat Feb 19 19:22:33 2011 Sat Feb 19 19:22:33 2011 Msieve v. 1.48 Sat Feb 19 19:22:33 2011 random seeds: 7b2cfd83 3a840734 Sat Feb 19 19:22:33 2011 MPI process 0 of 25 Sat Feb 19 19:22:33 2011 factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits) Sat Feb 19 19:22:35 2011 no P-1/P+1/ECM available, skipping Sat Feb 19 19:22:35 2011 commencing number field sieve (226-digit input) Sat Feb 19 19:22:35 2011 R0: -10000000000000000000000000000000000000 Sat Feb 19 19:22:35 2011 R1: 1 Sat Feb 19 19:22:35 2011 A0: -7 Sat Feb 19 19:22:35 2011 A1: 0 Sat Feb 19 19:22:35 2011 A2: 0 Sat Feb 19 19:22:35 2011 A3: 0 Sat Feb 19 19:22:35 2011 A4: 0 Sat Feb 19 19:22:35 2011 A5: 0 Sat Feb 19 19:22:35 2011 A6: 430000 Sat Feb 19 19:22:35 2011 skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2 Sat Feb 19 19:22:35 2011 Sat Feb 19 19:22:35 2011 commencing linear algebra Sat Feb 19 19:22:35 2011 initialized process (0,0) of 5 x 5 grid Sat Feb 19 19:22:38 2011 read 5860705 cycles Sat Feb 19 19:22:54 2011 cycles contain 16443107 unique relations Sat Feb 19 19:27:05 2011 read 16443107 relations Sat Feb 19 19:27:48 2011 using 20 quadratic characters above 1073737242 Sat Feb 19 19:29:09 2011 building initial matrix Sat Feb 19 19:34:22 2011 memory use: 2172.4 MB Sat Feb 19 19:34:33 2011 read 5860705 cycles Sat Feb 19 19:34:34 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col) Sat Feb 19 19:34:34 2011 sparse part has weight 447476324 (76.35/col) Sat Feb 19 19:35:42 2011 filtering completed in 1 passes Sat Feb 19 19:35:43 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col) Sat Feb 19 19:35:43 2011 sparse part has weight 447476324 (76.35/col) Sat Feb 19 19:36:43 2011 matrix starts at (0, 0) Sat Feb 19 19:36:43 2011 matrix is 1172198 x 1039795 (122.5 MB) with weight 42759959 (41.12/col) Sat Feb 19 19:36:43 2011 sparse part has weight 20671663 (19.88/col) Sat Feb 19 19:36:43 2011 saving the first 48 matrix rows for later Sat Feb 19 19:36:43 2011 matrix includes 64 packed rows Sat Feb 19 19:37:36 2011 matrix is 1172150 x 1039795 (106.3 MB) with weight 24404657 (23.47/col) Sat Feb 19 19:37:36 2011 sparse part has weight 17480889 (16.81/col) Sat Feb 19 19:37:36 2011 using block size 262144 for processor cache size 6144 kB Sat Feb 19 19:37:37 2011 commencing Lanczos iteration Sat Feb 19 19:37:37 2011 memory use: 130.9 MB Sat Feb 19 19:38:23 2011 linear algebra at 0.0%, ETA 44h56m Sat Feb 19 19:38:38 2011 checkpointing every 130000 dimensions Mon Feb 21 09:04:07 2011 lanczos halted after 92679 iterations (dim = 5860478) [/CODE] Output file: [CODE]linear algebra completed 5860370 of 5860705 dimensions (100.0%, ETA 0h 0m) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found BLanczosTime: 135693 lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found BLanczosTime: 135693 commencing square root phase BLanczosTime: 135693 lanczos error: only trivial dependencies found BLanczosTime: 135693 commencing square root phase commencing square root phase reading relations for dependency 5 BLanczosTime: 135693 lanczos error: only trivial dependencies found reading relations for dependency 5 commencing square root phase reading relations for dependency 5 commencing square root phase lanczos error: only trivial dependencies found BLanczosTime: 135693 error: read_cycles can't open dependency file error: read_cycles can't open dependency file lanczos error: only trivial dependencies found reading relations for dependency 5 reading relations for dependency 5 BLanczosTime: 135693 -------------------------------------------------------------------------- mpirun.exe has exited due to process rank 8 with PID 20755 on node bro64 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun.exe (as reported here). -------------------------------------------------------------------------- error: read_cycles can't open dependency file commencing square root phase received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down BLanczosTime: 135693 received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down [/CODE] I have a .cyc, .mat, .mat.idx, but no .dep file. Jeff. |
| All times are UTC. The time now is 04:53. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.