mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve v1.48 feedback (https://www.mersenneforum.org/showthread.php?t=15276)

Jeff Gilchrist 2011-02-19 18:41

Msieve v1.48 feedback
 
Running msieve 1.48 on 64bit Linux, I'm running into a problem re-starting a linear algebra and I'm not sure if it is msieve or something wrong with the cluster.

Using -nc2 5,5 with 25 CPUs, it started the linear algebra and completed about 89%. Now when I try to restart it with -ncr 5,5

Some of the nodes are reporting:
[CODE]Sat Feb 19 06:39:51 2011 commencing Lanczos iteration
Sat Feb 19 06:39:51 2011 memory use: 90.7 MB
Sat Feb 19 06:39:55 2011 restarting at iteration 82229 (dim = 5200062)
[/CODE]

While others are stuck at this without the "restarting" line:
[CODE]Sat Feb 19 06:39:47 2011 commencing Lanczos iteration
Sat Feb 19 06:39:47 2011 memory use: 91.0 MB
[/CODE]

So there is never any output in any of the log files or stdout it just "sits" there even those there are 25 processes running using 100% of the CPU for each core. Any idea what is going on or how to correct this?

If I got back to the previous checkpoint file, same thing.

Jeff.

Jeff Gilchrist 2011-02-19 19:11

Reverting back to version 1.47 it is now working. Something seems to be broken in 1.48 (official release, not using latest SVN).

Jeff

Jeff Gilchrist 2011-02-19 22:06

Uh-oh, it didn't like that:

[CODE]linear algebra completed 5860421 of 5860705 dimensions (100.0%, ETA 0h 0m)
lanczos halted after 92675 iterations (dim = 5860478)
lanczos halted after 92675 iterations (dim = 5860478)
[... several deleted...]
lanczos error: only trivial dependencies found
lanczos error: only trivial dependencies found
BLanczosTime: 8967
lanczos error: only trivial dependencies found
BLanczosTime: 8968
lanczos error: only trivial dependencies found
[/CODE]

Is that because I tried to mix 1.47 and 1.48 or is this another issue?

Batalov 2011-02-19 22:25

Full log, please? It could be the gross oversieving symptom (I am shooting in the dark).

Jeff Gilchrist 2011-02-20 12:15

1 Attachment(s)
[QUOTE=Batalov;253062]Full log, please? It could be the gross oversieving symptom (I am shooting in the dark).[/QUOTE]

Darn, some how my linear algebra log got flushed when I tried to restart so I only have the initial filtering part, that start of the LA, and the error of the square root.

Filtering (with relation error reads removed):
[I]See attached[/I]

Linear algebra (from my restart):
[CODE]
Sat Feb 19 19:22:33 2011
Sat Feb 19 19:22:33 2011
Sat Feb 19 19:22:33 2011 Msieve v. 1.48
Sat Feb 19 19:22:33 2011 random seeds: 7b2cfd83 3a840734
Sat Feb 19 19:22:33 2011 MPI process 0 of 25
Sat Feb 19 19:22:33 2011 factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits)
Sat Feb 19 19:22:35 2011 no P-1/P+1/ECM available, skipping
Sat Feb 19 19:22:35 2011 commencing number field sieve (226-digit input)
Sat Feb 19 19:22:35 2011 R0: -10000000000000000000000000000000000000
Sat Feb 19 19:22:35 2011 R1: 1
Sat Feb 19 19:22:35 2011 A0: -7
Sat Feb 19 19:22:35 2011 A1: 0
Sat Feb 19 19:22:35 2011 A2: 0
Sat Feb 19 19:22:35 2011 A3: 0
Sat Feb 19 19:22:35 2011 A4: 0
Sat Feb 19 19:22:35 2011 A5: 0
Sat Feb 19 19:22:35 2011 A6: 430000
Sat Feb 19 19:22:35 2011 skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2
Sat Feb 19 19:22:35 2011
Sat Feb 19 19:22:35 2011 commencing linear algebra
Sat Feb 19 19:22:35 2011 initialized process (0,0) of 5 x 5 grid
Sat Feb 19 19:22:38 2011 read 5860705 cycles
Sat Feb 19 19:22:54 2011 cycles contain 16443107 unique relations
Sat Feb 19 19:27:05 2011 read 16443107 relations
Sat Feb 19 19:27:48 2011 using 20 quadratic characters above 1073737242
Sat Feb 19 19:29:09 2011 building initial matrix
Sat Feb 19 19:34:22 2011 memory use: 2172.4 MB
Sat Feb 19 19:34:33 2011 read 5860705 cycles
Sat Feb 19 19:34:34 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col)
Sat Feb 19 19:34:34 2011 sparse part has weight 447476324 (76.35/col)
Sat Feb 19 19:35:42 2011 filtering completed in 1 passes
Sat Feb 19 19:35:43 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col)
Sat Feb 19 19:35:43 2011 sparse part has weight 447476324 (76.35/col)
Sat Feb 19 19:36:43 2011 matrix starts at (0, 0)
Sat Feb 19 19:36:43 2011 matrix is 1172198 x 1039795 (122.5 MB) with weight 42759959 (41.12/col)
Sat Feb 19 19:36:43 2011 sparse part has weight 20671663 (19.88/col)
Sat Feb 19 19:36:43 2011 saving the first 48 matrix rows for later
Sat Feb 19 19:36:43 2011 matrix includes 64 packed rows
Sat Feb 19 19:37:36 2011 matrix is 1172150 x 1039795 (106.3 MB) with weight 24404657 (23.47/col)
Sat Feb 19 19:37:36 2011 sparse part has weight 17480889 (16.81/col)
Sat Feb 19 19:37:36 2011 using block size 262144 for processor cache size 6144 kB
Sat Feb 19 19:37:37 2011 commencing Lanczos iteration
Sat Feb 19 19:37:37 2011 memory use: 130.9 MB
Sat Feb 19 19:38:23 2011 linear algebra at 0.0%, ETA 44h56m
Sat Feb 19 19:38:38 2011 checkpointing every 130000 dimensions
.
.
.
<end of log from previous>
.
.
.
linear algebra completed 5860421 of 5860705 dimensions (100.0%, ETA 0h 0m)
lanczos halted after 92675 iterations (dim = 5860478)
lanczos halted after 92675 iterations (dim = 5860478)
[... several deleted...]
lanczos error: only trivial dependencies found
lanczos error: only trivial dependencies found
BLanczosTime: 8967
lanczos error: only trivial dependencies found
BLanczosTime: 8968
lanczos error: only trivial dependencies found
[/CODE]

Attempted square root:
[CODE]
Sat Feb 19 16:47:36 2011
Sat Feb 19 16:47:36 2011
Sat Feb 19 16:47:36 2011 Msieve v. 1.48
Sat Feb 19 16:47:36 2011 random seeds: ab68941b 6b786b14
Sat Feb 19 16:47:36 2011 factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits)
Sat Feb 19 16:47:39 2011 no P-1/P+1/ECM available, skipping
Sat Feb 19 16:47:39 2011 commencing number field sieve (226-digit input)
Sat Feb 19 16:47:39 2011 R0: -10000000000000000000000000000000000000
Sat Feb 19 16:47:39 2011 R1: 1
Sat Feb 19 16:47:39 2011 A0: -7
Sat Feb 19 16:47:39 2011 A1: 0
Sat Feb 19 16:47:39 2011 A2: 0
Sat Feb 19 16:47:39 2011 A3: 0
Sat Feb 19 16:47:39 2011 A4: 0
Sat Feb 19 16:47:39 2011 A5: 0
Sat Feb 19 16:47:39 2011 A6: 430000
Sat Feb 19 16:47:39 2011 skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2
Sat Feb 19 16:47:39 2011
Sat Feb 19 16:47:39 2011 commencing square root phase
Sat Feb 19 16:47:39 2011 error: read_cycles can't open dependency file
[/CODE]

jasonp 2011-02-20 15:37

The 'only trivial dependencies found' message will appear for all the machines except for process 0, which writes the dependency file. Though in this case process 0 seems to have messed up too :)

Regarding the deadlock, could you build with debug symbols, then build the matrix and restart from a checkpoint, then use 'pstack' on some of the nodes to see where they are hanging? I think the hang point is after the matrix gets read in, so it may be a race in the checkpoint reading code.

AFAIK there aren't any disk format changes between 1.47 and 1.48, although there were a few changes to the MPI lanczos initialization, including the checkpoint reading code.

jasonp 2011-02-20 16:18

Well, there's one typo that should be fixed...does applying the following patch make an LA restart behave better?

[code]
===================================================================
--- common/lanczos/lanczos.c (revision 541)
+++ common/lanczos/lanczos.c (working copy)
@@ -628,7 +628,7 @@
#ifdef HAVE_MPI
/* push the full-size vectors to the top grid row */

- if (obj->mpi_ncols > 1) {
+ if (obj->mpi_ncols > 1 && obj->mpi_la_row_rank == 0) {
MPI_TRY(MPI_Scatterv(x, packed_matrix->col_counts,
packed_matrix->col_offsets,
MPI_LONG_LONG, x, n, MPI_LONG_LONG, 0,
[/code]

Jeff Gilchrist 2011-02-20 20:27

[QUOTE=jasonp;253151]Well, there's one typo that should be fixed...does applying the following patch make an LA restart behave better?
[/QUOTE]

Yes, that fixes the restart. Yesterday I started again with -nc2, and with the old binary can reproduce the -ncr not working every time, with the new fix it restarts every time.

Unfortunately for me I forgot to save the .cyc file and .mat files before I restarted with -nc2 (the -nc1 was not redone), can I still use an old checkpoint file (at 89%) or would everything be all messed up?

I will let you know when it finishes the second time if the last step actually works.

Jeff.

jasonp 2011-02-21 02:15

I'm not sure you can restart from an old checkpoint; if the restart code has a race then it's possible you previously restarted from a checkpoint and corrupted the result. But 11% of your run sounds like less than overnight, so it's worth a shot.

Sorry for the churn...

PS: If you don't have the .cyc file that built the matrix then you'll have to start over. The square root will need it even if the LA did not, and I can't guarantee that you'll get the same matrix even from the same initial dataset.

Jeff Gilchrist 2011-02-21 11:09

[QUOTE=jasonp;253219]I'm not sure you can restart from an old checkpoint; if the restart code has a race then it's possible you previously restarted from a checkpoint and corrupted the result. But 11% of your run sounds like less than overnight, so it's worth a shot.

Sorry for the churn...

PS: If you don't have the .cyc file that built the matrix then you'll have to start over. The square root will need it even if the LA did not, and I can't guarantee that you'll get the same matrix even from the same initial dataset.[/QUOTE]

I tried it just to see assuming that it wouldn't work and it failed, gave an corruption message. That is ok, my fault for not saving the file. I'm down to 3.5h ETA to finish now so no big deal. Should be interesting to see what happens after that. At least I will have the full log this time.

Jeff.

Jeff Gilchrist 2011-02-21 21:01

Ok, while restarting works, it failed again at the LA stage. Here is what is in the log. You already have the filtering stage log posted previously.

Rank 0 log:
[CODE]Sat Feb 19 19:22:33 2011
Sat Feb 19 19:22:33 2011
Sat Feb 19 19:22:33 2011 Msieve v. 1.48
Sat Feb 19 19:22:33 2011 random seeds: 7b2cfd83 3a840734
Sat Feb 19 19:22:33 2011 MPI process 0 of 25
Sat Feb 19 19:22:33 2011 factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits)
Sat Feb 19 19:22:35 2011 no P-1/P+1/ECM available, skipping
Sat Feb 19 19:22:35 2011 commencing number field sieve (226-digit input)
Sat Feb 19 19:22:35 2011 R0: -10000000000000000000000000000000000000
Sat Feb 19 19:22:35 2011 R1: 1
Sat Feb 19 19:22:35 2011 A0: -7
Sat Feb 19 19:22:35 2011 A1: 0
Sat Feb 19 19:22:35 2011 A2: 0
Sat Feb 19 19:22:35 2011 A3: 0
Sat Feb 19 19:22:35 2011 A4: 0
Sat Feb 19 19:22:35 2011 A5: 0
Sat Feb 19 19:22:35 2011 A6: 430000
Sat Feb 19 19:22:35 2011 skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2
Sat Feb 19 19:22:35 2011
Sat Feb 19 19:22:35 2011 commencing linear algebra
Sat Feb 19 19:22:35 2011 initialized process (0,0) of 5 x 5 grid
Sat Feb 19 19:22:38 2011 read 5860705 cycles
Sat Feb 19 19:22:54 2011 cycles contain 16443107 unique relations
Sat Feb 19 19:27:05 2011 read 16443107 relations
Sat Feb 19 19:27:48 2011 using 20 quadratic characters above 1073737242
Sat Feb 19 19:29:09 2011 building initial matrix
Sat Feb 19 19:34:22 2011 memory use: 2172.4 MB
Sat Feb 19 19:34:33 2011 read 5860705 cycles
Sat Feb 19 19:34:34 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col)
Sat Feb 19 19:34:34 2011 sparse part has weight 447476324 (76.35/col)
Sat Feb 19 19:35:42 2011 filtering completed in 1 passes
Sat Feb 19 19:35:43 2011 matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col)
Sat Feb 19 19:35:43 2011 sparse part has weight 447476324 (76.35/col)
Sat Feb 19 19:36:43 2011 matrix starts at (0, 0)
Sat Feb 19 19:36:43 2011 matrix is 1172198 x 1039795 (122.5 MB) with weight 42759959 (41.12/col)
Sat Feb 19 19:36:43 2011 sparse part has weight 20671663 (19.88/col)
Sat Feb 19 19:36:43 2011 saving the first 48 matrix rows for later
Sat Feb 19 19:36:43 2011 matrix includes 64 packed rows
Sat Feb 19 19:37:36 2011 matrix is 1172150 x 1039795 (106.3 MB) with weight 24404657 (23.47/col)
Sat Feb 19 19:37:36 2011 sparse part has weight 17480889 (16.81/col)
Sat Feb 19 19:37:36 2011 using block size 262144 for processor cache size 6144 kB
Sat Feb 19 19:37:37 2011 commencing Lanczos iteration
Sat Feb 19 19:37:37 2011 memory use: 130.9 MB
Sat Feb 19 19:38:23 2011 linear algebra at 0.0%, ETA 44h56m
Sat Feb 19 19:38:38 2011 checkpointing every 130000 dimensions
Mon Feb 21 09:04:07 2011 lanczos halted after 92679 iterations (dim = 5860478)
[/CODE]

Output file:
[CODE]linear algebra completed 5860370 of 5860705 dimensions (100.0%, ETA 0h 0m)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos error: only trivial dependencies found
lanczos error: only trivial dependencies found
BLanczosTime: 135693
lanczos error: only trivial dependencies found

lanczos error: only trivial dependencies found
BLanczosTime: 135693
commencing square root phase
BLanczosTime: 135693


lanczos error: only trivial dependencies found
BLanczosTime: 135693
commencing square root phase
commencing square root phase
reading relations for dependency 5

BLanczosTime: 135693
lanczos error: only trivial dependencies found
reading relations for dependency 5
commencing square root phase

reading relations for dependency 5
commencing square root phase
lanczos error: only trivial dependencies found
BLanczosTime: 135693
error: read_cycles can't open dependency file
error: read_cycles can't open dependency file
lanczos error: only trivial dependencies found

reading relations for dependency 5
reading relations for dependency 5
BLanczosTime: 135693
--------------------------------------------------------------------------
mpirun.exe has exited due to process rank 8 with PID 20755 on
node bro64 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun.exe (as reported here).
--------------------------------------------------------------------------
error: read_cycles can't open dependency file
commencing square root phase

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down
BLanczosTime: 135693

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down
[/CODE]

I have a .cyc, .mat, .mat.idx, but no .dep file.

Jeff.


All times are UTC. The time now is 04:53.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.