![]() |
|
|
#1 |
|
Jun 2003
Ottawa, Canada
3×17×23 Posts |
Running msieve 1.48 on 64bit Linux, I'm running into a problem re-starting a linear algebra and I'm not sure if it is msieve or something wrong with the cluster.
Using -nc2 5,5 with 25 CPUs, it started the linear algebra and completed about 89%. Now when I try to restart it with -ncr 5,5 Some of the nodes are reporting: Code:
Sat Feb 19 06:39:51 2011 commencing Lanczos iteration Sat Feb 19 06:39:51 2011 memory use: 90.7 MB Sat Feb 19 06:39:55 2011 restarting at iteration 82229 (dim = 5200062) Code:
Sat Feb 19 06:39:47 2011 commencing Lanczos iteration Sat Feb 19 06:39:47 2011 memory use: 91.0 MB If I got back to the previous checkpoint file, same thing. Jeff. |
|
|
|
|
|
#2 |
|
Jun 2003
Ottawa, Canada
3×17×23 Posts |
Reverting back to version 1.47 it is now working. Something seems to be broken in 1.48 (official release, not using latest SVN).
Jeff Last fiddled with by Jeff Gilchrist on 2011-02-19 at 19:12 |
|
|
|
|
|
#3 |
|
Jun 2003
Ottawa, Canada
3·17·23 Posts |
Uh-oh, it didn't like that:
Code:
linear algebra completed 5860421 of 5860705 dimensions (100.0%, ETA 0h 0m) lanczos halted after 92675 iterations (dim = 5860478) lanczos halted after 92675 iterations (dim = 5860478) [... several deleted...] lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found BLanczosTime: 8967 lanczos error: only trivial dependencies found BLanczosTime: 8968 lanczos error: only trivial dependencies found |
|
|
|
|
|
#4 |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
36×13 Posts |
Full log, please? It could be the gross oversieving symptom (I am shooting in the dark).
|
|
|
|
|
|
#5 | |
|
Jun 2003
Ottawa, Canada
22258 Posts |
Quote:
Filtering (with relation error reads removed): See attached Linear algebra (from my restart): Code:
Sat Feb 19 19:22:33 2011 Sat Feb 19 19:22:33 2011 Sat Feb 19 19:22:33 2011 Msieve v. 1.48 Sat Feb 19 19:22:33 2011 random seeds: 7b2cfd83 3a840734 Sat Feb 19 19:22:33 2011 MPI process 0 of 25 Sat Feb 19 19:22:33 2011 factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits) Sat Feb 19 19:22:35 2011 no P-1/P+1/ECM available, skipping Sat Feb 19 19:22:35 2011 commencing number field sieve (226-digit input) Sat Feb 19 19:22:35 2011 R0: -10000000000000000000000000000000000000 Sat Feb 19 19:22:35 2011 R1: 1 Sat Feb 19 19:22:35 2011 A0: -7 Sat Feb 19 19:22:35 2011 A1: 0 Sat Feb 19 19:22:35 2011 A2: 0 Sat Feb 19 19:22:35 2011 A3: 0 Sat Feb 19 19:22:35 2011 A4: 0 Sat Feb 19 19:22:35 2011 A5: 0 Sat Feb 19 19:22:35 2011 A6: 430000 Sat Feb 19 19:22:35 2011 skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2 Sat Feb 19 19:22:35 2011 Sat Feb 19 19:22:35 2011 commencing linear algebra Sat Feb 19 19:22:35 2011 initialized process (0,0) of 5 x 5 grid Sat Feb 19 19:22:38 2011 read 5860705 cycles Sat Feb 19 19:22:54 2011 cycles contain 16443107 unique relations Sat Feb 19 19:27:05 2011 read 16443107 relations Sat Feb 19 19:27:48 2011 using 20 quadratic characters above 1073737242 Sat Feb 19 19:29:09 2011 building initial matrix Sat Feb 19 19:34:22 2011 memory use: 2172.4 MB Sat Feb 19 19:34:33 2011 read 5860705 cycles Sat Feb 19 19:34:34 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col) Sat Feb 19 19:34:34 2011 sparse part has weight 447476324 (76.35/col) Sat Feb 19 19:35:42 2011 filtering completed in 1 passes Sat Feb 19 19:35:43 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col) Sat Feb 19 19:35:43 2011 sparse part has weight 447476324 (76.35/col) Sat Feb 19 19:36:43 2011 matrix starts at (0, 0) Sat Feb 19 19:36:43 2011 matrix is 1172198 x 1039795 (122.5 MB) with weight 42759959 (41.12/col) Sat Feb 19 19:36:43 2011 sparse part has weight 20671663 (19.88/col) Sat Feb 19 19:36:43 2011 saving the first 48 matrix rows for later Sat Feb 19 19:36:43 2011 matrix includes 64 packed rows Sat Feb 19 19:37:36 2011 matrix is 1172150 x 1039795 (106.3 MB) with weight 24404657 (23.47/col) Sat Feb 19 19:37:36 2011 sparse part has weight 17480889 (16.81/col) Sat Feb 19 19:37:36 2011 using block size 262144 for processor cache size 6144 kB Sat Feb 19 19:37:37 2011 commencing Lanczos iteration Sat Feb 19 19:37:37 2011 memory use: 130.9 MB Sat Feb 19 19:38:23 2011 linear algebra at 0.0%, ETA 44h56m Sat Feb 19 19:38:38 2011 checkpointing every 130000 dimensions . . . <end of log from previous> . . . linear algebra completed 5860421 of 5860705 dimensions (100.0%, ETA 0h 0m) lanczos halted after 92675 iterations (dim = 5860478) lanczos halted after 92675 iterations (dim = 5860478) [... several deleted...] lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found BLanczosTime: 8967 lanczos error: only trivial dependencies found BLanczosTime: 8968 lanczos error: only trivial dependencies found Code:
Sat Feb 19 16:47:36 2011 Sat Feb 19 16:47:36 2011 Sat Feb 19 16:47:36 2011 Msieve v. 1.48 Sat Feb 19 16:47:36 2011 random seeds: ab68941b 6b786b14 Sat Feb 19 16:47:36 2011 factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits) Sat Feb 19 16:47:39 2011 no P-1/P+1/ECM available, skipping Sat Feb 19 16:47:39 2011 commencing number field sieve (226-digit input) Sat Feb 19 16:47:39 2011 R0: -10000000000000000000000000000000000000 Sat Feb 19 16:47:39 2011 R1: 1 Sat Feb 19 16:47:39 2011 A0: -7 Sat Feb 19 16:47:39 2011 A1: 0 Sat Feb 19 16:47:39 2011 A2: 0 Sat Feb 19 16:47:39 2011 A3: 0 Sat Feb 19 16:47:39 2011 A4: 0 Sat Feb 19 16:47:39 2011 A5: 0 Sat Feb 19 16:47:39 2011 A6: 430000 Sat Feb 19 16:47:39 2011 skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2 Sat Feb 19 16:47:39 2011 Sat Feb 19 16:47:39 2011 commencing square root phase Sat Feb 19 16:47:39 2011 error: read_cycles can't open dependency file |
|
|
|
|
|
|
#6 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
The 'only trivial dependencies found' message will appear for all the machines except for process 0, which writes the dependency file. Though in this case process 0 seems to have messed up too :)
Regarding the deadlock, could you build with debug symbols, then build the matrix and restart from a checkpoint, then use 'pstack' on some of the nodes to see where they are hanging? I think the hang point is after the matrix gets read in, so it may be a race in the checkpoint reading code. AFAIK there aren't any disk format changes between 1.47 and 1.48, although there were a few changes to the MPI lanczos initialization, including the checkpoint reading code. Last fiddled with by jasonp on 2011-02-20 at 15:52 |
|
|
|
|
|
#7 |
|
Tribal Bullet
Oct 2004
1101110101012 Posts |
Well, there's one typo that should be fixed...does applying the following patch make an LA restart behave better?
Code:
===================================================================
--- common/lanczos/lanczos.c (revision 541)
+++ common/lanczos/lanczos.c (working copy)
@@ -628,7 +628,7 @@
#ifdef HAVE_MPI
/* push the full-size vectors to the top grid row */
- if (obj->mpi_ncols > 1) {
+ if (obj->mpi_ncols > 1 && obj->mpi_la_row_rank == 0) {
MPI_TRY(MPI_Scatterv(x, packed_matrix->col_counts,
packed_matrix->col_offsets,
MPI_LONG_LONG, x, n, MPI_LONG_LONG, 0,
|
|
|
|
|
|
#8 | |
|
Jun 2003
Ottawa, Canada
22258 Posts |
Quote:
Unfortunately for me I forgot to save the .cyc file and .mat files before I restarted with -nc2 (the -nc1 was not redone), can I still use an old checkpoint file (at 89%) or would everything be all messed up? I will let you know when it finishes the second time if the last step actually works. Jeff. |
|
|
|
|
|
|
#9 |
|
Tribal Bullet
Oct 2004
67258 Posts |
I'm not sure you can restart from an old checkpoint; if the restart code has a race then it's possible you previously restarted from a checkpoint and corrupted the result. But 11% of your run sounds like less than overnight, so it's worth a shot.
Sorry for the churn... PS: If you don't have the .cyc file that built the matrix then you'll have to start over. The square root will need it even if the LA did not, and I can't guarantee that you'll get the same matrix even from the same initial dataset. Last fiddled with by jasonp on 2011-02-21 at 02:19 |
|
|
|
|
|
#10 | |
|
Jun 2003
Ottawa, Canada
3×17×23 Posts |
Quote:
Jeff. |
|
|
|
|
|
|
#11 |
|
Jun 2003
Ottawa, Canada
117310 Posts |
Ok, while restarting works, it failed again at the LA stage. Here is what is in the log. You already have the filtering stage log posted previously.
Rank 0 log: Code:
Sat Feb 19 19:22:33 2011 Sat Feb 19 19:22:33 2011 Sat Feb 19 19:22:33 2011 Msieve v. 1.48 Sat Feb 19 19:22:33 2011 random seeds: 7b2cfd83 3a840734 Sat Feb 19 19:22:33 2011 MPI process 0 of 25 Sat Feb 19 19:22:33 2011 factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits) Sat Feb 19 19:22:35 2011 no P-1/P+1/ECM available, skipping Sat Feb 19 19:22:35 2011 commencing number field sieve (226-digit input) Sat Feb 19 19:22:35 2011 R0: -10000000000000000000000000000000000000 Sat Feb 19 19:22:35 2011 R1: 1 Sat Feb 19 19:22:35 2011 A0: -7 Sat Feb 19 19:22:35 2011 A1: 0 Sat Feb 19 19:22:35 2011 A2: 0 Sat Feb 19 19:22:35 2011 A3: 0 Sat Feb 19 19:22:35 2011 A4: 0 Sat Feb 19 19:22:35 2011 A5: 0 Sat Feb 19 19:22:35 2011 A6: 430000 Sat Feb 19 19:22:35 2011 skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2 Sat Feb 19 19:22:35 2011 Sat Feb 19 19:22:35 2011 commencing linear algebra Sat Feb 19 19:22:35 2011 initialized process (0,0) of 5 x 5 grid Sat Feb 19 19:22:38 2011 read 5860705 cycles Sat Feb 19 19:22:54 2011 cycles contain 16443107 unique relations Sat Feb 19 19:27:05 2011 read 16443107 relations Sat Feb 19 19:27:48 2011 using 20 quadratic characters above 1073737242 Sat Feb 19 19:29:09 2011 building initial matrix Sat Feb 19 19:34:22 2011 memory use: 2172.4 MB Sat Feb 19 19:34:33 2011 read 5860705 cycles Sat Feb 19 19:34:34 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col) Sat Feb 19 19:34:34 2011 sparse part has weight 447476324 (76.35/col) Sat Feb 19 19:35:42 2011 filtering completed in 1 passes Sat Feb 19 19:35:43 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col) Sat Feb 19 19:35:43 2011 sparse part has weight 447476324 (76.35/col) Sat Feb 19 19:36:43 2011 matrix starts at (0, 0) Sat Feb 19 19:36:43 2011 matrix is 1172198 x 1039795 (122.5 MB) with weight 42759959 (41.12/col) Sat Feb 19 19:36:43 2011 sparse part has weight 20671663 (19.88/col) Sat Feb 19 19:36:43 2011 saving the first 48 matrix rows for later Sat Feb 19 19:36:43 2011 matrix includes 64 packed rows Sat Feb 19 19:37:36 2011 matrix is 1172150 x 1039795 (106.3 MB) with weight 24404657 (23.47/col) Sat Feb 19 19:37:36 2011 sparse part has weight 17480889 (16.81/col) Sat Feb 19 19:37:36 2011 using block size 262144 for processor cache size 6144 kB Sat Feb 19 19:37:37 2011 commencing Lanczos iteration Sat Feb 19 19:37:37 2011 memory use: 130.9 MB Sat Feb 19 19:38:23 2011 linear algebra at 0.0%, ETA 44h56m Sat Feb 19 19:38:38 2011 checkpointing every 130000 dimensions Mon Feb 21 09:04:07 2011 lanczos halted after 92679 iterations (dim = 5860478) Code:
linear algebra completed 5860370 of 5860705 dimensions (100.0%, ETA 0h 0m) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found BLanczosTime: 135693 lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found BLanczosTime: 135693 commencing square root phase BLanczosTime: 135693 lanczos error: only trivial dependencies found BLanczosTime: 135693 commencing square root phase commencing square root phase reading relations for dependency 5 BLanczosTime: 135693 lanczos error: only trivial dependencies found reading relations for dependency 5 commencing square root phase reading relations for dependency 5 commencing square root phase lanczos error: only trivial dependencies found BLanczosTime: 135693 error: read_cycles can't open dependency file error: read_cycles can't open dependency file lanczos error: only trivial dependencies found reading relations for dependency 5 reading relations for dependency 5 BLanczosTime: 135693 -------------------------------------------------------------------------- mpirun.exe has exited due to process rank 8 with PID 20755 on node bro64 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun.exe (as reported here). -------------------------------------------------------------------------- error: read_cycles can't open dependency file commencing square root phase received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down BLanczosTime: 135693 received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down received signal 15; shutting down Jeff. |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Msieve 1.53 feedback | xilman | Msieve | 149 | 2018-11-12 06:37 |
| Msieve 1.50 feedback | firejuggler | Msieve | 99 | 2013-02-17 11:53 |
| Msieve 1.43 feedback | Jeff Gilchrist | Msieve | 47 | 2009-11-24 15:53 |
| Msieve 1.42 feedback | Andi47 | Msieve | 167 | 2009-10-18 19:37 |
| Msieve 1.41 Feedback | Batalov | Msieve | 130 | 2009-06-09 16:01 |