mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2011-02-19, 18:41   #1
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

3×17×23 Posts
Default Msieve v1.48 feedback

Running msieve 1.48 on 64bit Linux, I'm running into a problem re-starting a linear algebra and I'm not sure if it is msieve or something wrong with the cluster.

Using -nc2 5,5 with 25 CPUs, it started the linear algebra and completed about 89%. Now when I try to restart it with -ncr 5,5

Some of the nodes are reporting:
Code:
Sat Feb 19 06:39:51 2011  commencing Lanczos iteration
Sat Feb 19 06:39:51 2011  memory use: 90.7 MB
Sat Feb 19 06:39:55 2011  restarting at iteration 82229 (dim = 5200062)
While others are stuck at this without the "restarting" line:
Code:
Sat Feb 19 06:39:47 2011  commencing Lanczos iteration
Sat Feb 19 06:39:47 2011  memory use: 91.0 MB
So there is never any output in any of the log files or stdout it just "sits" there even those there are 25 processes running using 100% of the CPU for each core. Any idea what is going on or how to correct this?

If I got back to the previous checkpoint file, same thing.

Jeff.
Jeff Gilchrist is offline   Reply With Quote
Old 2011-02-19, 19:11   #2
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

3×17×23 Posts
Default

Reverting back to version 1.47 it is now working. Something seems to be broken in 1.48 (official release, not using latest SVN).

Jeff

Last fiddled with by Jeff Gilchrist on 2011-02-19 at 19:12
Jeff Gilchrist is offline   Reply With Quote
Old 2011-02-19, 22:06   #3
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

3·17·23 Posts
Default

Uh-oh, it didn't like that:

Code:
linear algebra completed 5860421 of 5860705 dimensions (100.0%, ETA 0h 0m)    
lanczos halted after 92675 iterations (dim = 5860478)
lanczos halted after 92675 iterations (dim = 5860478)
[... several deleted...]
lanczos error: only trivial dependencies found
lanczos error: only trivial dependencies found
BLanczosTime: 8967
lanczos error: only trivial dependencies found
BLanczosTime: 8968
lanczos error: only trivial dependencies found
Is that because I tried to mix 1.47 and 1.48 or is this another issue?
Jeff Gilchrist is offline   Reply With Quote
Old 2011-02-19, 22:25   #4
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

36×13 Posts
Default

Full log, please? It could be the gross oversieving symptom (I am shooting in the dark).
Batalov is offline   Reply With Quote
Old 2011-02-20, 12:15   #5
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

22258 Posts
Default

Quote:
Originally Posted by Batalov View Post
Full log, please? It could be the gross oversieving symptom (I am shooting in the dark).
Darn, some how my linear algebra log got flushed when I tried to restart so I only have the initial filtering part, that start of the LA, and the error of the square root.

Filtering (with relation error reads removed):
See attached

Linear algebra (from my restart):
Code:
Sat Feb 19 19:22:33 2011  
Sat Feb 19 19:22:33 2011  
Sat Feb 19 19:22:33 2011  Msieve v. 1.48
Sat Feb 19 19:22:33 2011  random seeds: 7b2cfd83 3a840734
Sat Feb 19 19:22:33 2011  MPI process 0 of 25
Sat Feb 19 19:22:33 2011  factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits)
Sat Feb 19 19:22:35 2011  no P-1/P+1/ECM available, skipping
Sat Feb 19 19:22:35 2011  commencing number field sieve (226-digit input)
Sat Feb 19 19:22:35 2011  R0: -10000000000000000000000000000000000000
Sat Feb 19 19:22:35 2011  R1:  1
Sat Feb 19 19:22:35 2011  A0: -7
Sat Feb 19 19:22:35 2011  A1:  0
Sat Feb 19 19:22:35 2011  A2:  0
Sat Feb 19 19:22:35 2011  A3:  0
Sat Feb 19 19:22:35 2011  A4:  0
Sat Feb 19 19:22:35 2011  A5:  0
Sat Feb 19 19:22:35 2011  A6:  430000
Sat Feb 19 19:22:35 2011  skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2
Sat Feb 19 19:22:35 2011  
Sat Feb 19 19:22:35 2011  commencing linear algebra
Sat Feb 19 19:22:35 2011  initialized process (0,0) of 5 x 5 grid
Sat Feb 19 19:22:38 2011  read 5860705 cycles
Sat Feb 19 19:22:54 2011  cycles contain 16443107 unique relations
Sat Feb 19 19:27:05 2011  read 16443107 relations
Sat Feb 19 19:27:48 2011  using 20 quadratic characters above 1073737242
Sat Feb 19 19:29:09 2011  building initial matrix
Sat Feb 19 19:34:22 2011  memory use: 2172.4 MB
Sat Feb 19 19:34:33 2011  read 5860705 cycles
Sat Feb 19 19:34:34 2011  matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col)
Sat Feb 19 19:34:34 2011  sparse part has weight 447476324 (76.35/col)
Sat Feb 19 19:35:42 2011  filtering completed in 1 passes
Sat Feb 19 19:35:43 2011  matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col)
Sat Feb 19 19:35:43 2011  sparse part has weight 447476324 (76.35/col)
Sat Feb 19 19:36:43 2011  matrix starts at (0, 0)
Sat Feb 19 19:36:43 2011  matrix is 1172198 x 1039795 (122.5 MB) with weight 42759959 (41.12/col)
Sat Feb 19 19:36:43 2011  sparse part has weight 20671663 (19.88/col)
Sat Feb 19 19:36:43 2011  saving the first 48 matrix rows for later
Sat Feb 19 19:36:43 2011  matrix includes 64 packed rows
Sat Feb 19 19:37:36 2011  matrix is 1172150 x 1039795 (106.3 MB) with weight 24404657 (23.47/col)
Sat Feb 19 19:37:36 2011  sparse part has weight 17480889 (16.81/col)
Sat Feb 19 19:37:36 2011  using block size 262144 for processor cache size 6144 kB
Sat Feb 19 19:37:37 2011  commencing Lanczos iteration
Sat Feb 19 19:37:37 2011  memory use: 130.9 MB
Sat Feb 19 19:38:23 2011  linear algebra at 0.0%, ETA 44h56m
Sat Feb 19 19:38:38 2011  checkpointing every 130000 dimensions
.
.
.
<end of log from previous>
.
.
.
linear algebra completed 5860421 of 5860705 dimensions (100.0%, ETA 0h 0m)    
lanczos halted after 92675 iterations (dim = 5860478)
lanczos halted after 92675 iterations (dim = 5860478)
[... several deleted...]
lanczos error: only trivial dependencies found
lanczos error: only trivial dependencies found
BLanczosTime: 8967
lanczos error: only trivial dependencies found
BLanczosTime: 8968
lanczos error: only trivial dependencies found
Attempted square root:
Code:
Sat Feb 19 16:47:36 2011  
Sat Feb 19 16:47:36 2011  
Sat Feb 19 16:47:36 2011  Msieve v. 1.48
Sat Feb 19 16:47:36 2011  random seeds: ab68941b 6b786b14
Sat Feb 19 16:47:36 2011  factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits)
Sat Feb 19 16:47:39 2011  no P-1/P+1/ECM available, skipping
Sat Feb 19 16:47:39 2011  commencing number field sieve (226-digit input)
Sat Feb 19 16:47:39 2011  R0: -10000000000000000000000000000000000000
Sat Feb 19 16:47:39 2011  R1:  1
Sat Feb 19 16:47:39 2011  A0: -7
Sat Feb 19 16:47:39 2011  A1:  0
Sat Feb 19 16:47:39 2011  A2:  0
Sat Feb 19 16:47:39 2011  A3:  0
Sat Feb 19 16:47:39 2011  A4:  0
Sat Feb 19 16:47:39 2011  A5:  0
Sat Feb 19 16:47:39 2011  A6:  430000
Sat Feb 19 16:47:39 2011  skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2
Sat Feb 19 16:47:39 2011  
Sat Feb 19 16:47:39 2011  commencing square root phase
Sat Feb 19 16:47:39 2011  error: read_cycles can't open dependency file
Attached Files
File Type: zip log.zip (1.8 KB, 170 views)
Jeff Gilchrist is offline   Reply With Quote
Old 2011-02-20, 15:37   #6
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

The 'only trivial dependencies found' message will appear for all the machines except for process 0, which writes the dependency file. Though in this case process 0 seems to have messed up too :)

Regarding the deadlock, could you build with debug symbols, then build the matrix and restart from a checkpoint, then use 'pstack' on some of the nodes to see where they are hanging? I think the hang point is after the matrix gets read in, so it may be a race in the checkpoint reading code.

AFAIK there aren't any disk format changes between 1.47 and 1.48, although there were a few changes to the MPI lanczos initialization, including the checkpoint reading code.

Last fiddled with by jasonp on 2011-02-20 at 15:52
jasonp is offline   Reply With Quote
Old 2011-02-20, 16:18   #7
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

1101110101012 Posts
Default

Well, there's one typo that should be fixed...does applying the following patch make an LA restart behave better?

Code:
===================================================================
--- common/lanczos/lanczos.c    (revision 541)
+++ common/lanczos/lanczos.c    (working copy)
@@ -628,7 +628,7 @@
 #ifdef HAVE_MPI
        /* push the full-size vectors to the top grid row */

-       if (obj->mpi_ncols > 1) {
+       if (obj->mpi_ncols > 1 && obj->mpi_la_row_rank == 0) {
                MPI_TRY(MPI_Scatterv(x, packed_matrix->col_counts,
                                packed_matrix->col_offsets,
                                MPI_LONG_LONG, x, n, MPI_LONG_LONG, 0,
jasonp is offline   Reply With Quote
Old 2011-02-20, 20:27   #8
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

22258 Posts
Default

Quote:
Originally Posted by jasonp View Post
Well, there's one typo that should be fixed...does applying the following patch make an LA restart behave better?
Yes, that fixes the restart. Yesterday I started again with -nc2, and with the old binary can reproduce the -ncr not working every time, with the new fix it restarts every time.

Unfortunately for me I forgot to save the .cyc file and .mat files before I restarted with -nc2 (the -nc1 was not redone), can I still use an old checkpoint file (at 89%) or would everything be all messed up?

I will let you know when it finishes the second time if the last step actually works.

Jeff.
Jeff Gilchrist is offline   Reply With Quote
Old 2011-02-21, 02:15   #9
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

67258 Posts
Default

I'm not sure you can restart from an old checkpoint; if the restart code has a race then it's possible you previously restarted from a checkpoint and corrupted the result. But 11% of your run sounds like less than overnight, so it's worth a shot.

Sorry for the churn...

PS: If you don't have the .cyc file that built the matrix then you'll have to start over. The square root will need it even if the LA did not, and I can't guarantee that you'll get the same matrix even from the same initial dataset.

Last fiddled with by jasonp on 2011-02-21 at 02:19
jasonp is offline   Reply With Quote
Old 2011-02-21, 11:09   #10
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

3×17×23 Posts
Default

Quote:
Originally Posted by jasonp View Post
I'm not sure you can restart from an old checkpoint; if the restart code has a race then it's possible you previously restarted from a checkpoint and corrupted the result. But 11% of your run sounds like less than overnight, so it's worth a shot.

Sorry for the churn...

PS: If you don't have the .cyc file that built the matrix then you'll have to start over. The square root will need it even if the LA did not, and I can't guarantee that you'll get the same matrix even from the same initial dataset.
I tried it just to see assuming that it wouldn't work and it failed, gave an corruption message. That is ok, my fault for not saving the file. I'm down to 3.5h ETA to finish now so no big deal. Should be interesting to see what happens after that. At least I will have the full log this time.

Jeff.
Jeff Gilchrist is offline   Reply With Quote
Old 2011-02-21, 21:01   #11
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

117310 Posts
Default

Ok, while restarting works, it failed again at the LA stage. Here is what is in the log. You already have the filtering stage log posted previously.

Rank 0 log:
Code:
Sat Feb 19 19:22:33 2011  
Sat Feb 19 19:22:33 2011  
Sat Feb 19 19:22:33 2011  Msieve v. 1.48
Sat Feb 19 19:22:33 2011  random seeds: 7b2cfd83 3a840734
Sat Feb 19 19:22:33 2011  MPI process 0 of 25
Sat Feb 19 19:22:33 2011  factoring 2514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883040935672514619883 (226 digits)
Sat Feb 19 19:22:35 2011  no P-1/P+1/ECM available, skipping
Sat Feb 19 19:22:35 2011  commencing number field sieve (226-digit input)
Sat Feb 19 19:22:35 2011  R0: -10000000000000000000000000000000000000
Sat Feb 19 19:22:35 2011  R1:  1
Sat Feb 19 19:22:35 2011  A0: -7
Sat Feb 19 19:22:35 2011  A1:  0
Sat Feb 19 19:22:35 2011  A2:  0
Sat Feb 19 19:22:35 2011  A3:  0
Sat Feb 19 19:22:35 2011  A4:  0
Sat Feb 19 19:22:35 2011  A5:  0
Sat Feb 19 19:22:35 2011  A6:  430000
Sat Feb 19 19:22:35 2011  skew 0.16, size 2.179e-11, alpha -1.900, combined = 1.138e-12 rroots = 2
Sat Feb 19 19:22:35 2011  
Sat Feb 19 19:22:35 2011  commencing linear algebra
Sat Feb 19 19:22:35 2011  initialized process (0,0) of 5 x 5 grid
Sat Feb 19 19:22:38 2011  read 5860705 cycles
Sat Feb 19 19:22:54 2011  cycles contain 16443107 unique relations
Sat Feb 19 19:27:05 2011  read 16443107 relations
Sat Feb 19 19:27:48 2011  using 20 quadratic characters above 1073737242
Sat Feb 19 19:29:09 2011  building initial matrix
Sat Feb 19 19:34:22 2011  memory use: 2172.4 MB
Sat Feb 19 19:34:33 2011  read 5860705 cycles
Sat Feb 19 19:34:34 2011  matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col)
Sat Feb 19 19:34:34 2011  sparse part has weight 447476324 (76.35/col)
Sat Feb 19 19:35:42 2011  filtering completed in 1 passes
Sat Feb 19 19:35:43 2011  matrix is 5860526 x 5860705 (1952.9 MB) with weight 573681918 (97.89/col)
Sat Feb 19 19:35:43 2011  sparse part has weight 447476324 (76.35/col)
Sat Feb 19 19:36:43 2011  matrix starts at (0, 0)
Sat Feb 19 19:36:43 2011  matrix is 1172198 x 1039795 (122.5 MB) with weight 42759959 (41.12/col)
Sat Feb 19 19:36:43 2011  sparse part has weight 20671663 (19.88/col)
Sat Feb 19 19:36:43 2011  saving the first 48 matrix rows for later
Sat Feb 19 19:36:43 2011  matrix includes 64 packed rows
Sat Feb 19 19:37:36 2011  matrix is 1172150 x 1039795 (106.3 MB) with weight 24404657 (23.47/col)
Sat Feb 19 19:37:36 2011  sparse part has weight 17480889 (16.81/col)
Sat Feb 19 19:37:36 2011  using block size 262144 for processor cache size 6144 kB
Sat Feb 19 19:37:37 2011  commencing Lanczos iteration
Sat Feb 19 19:37:37 2011  memory use: 130.9 MB
Sat Feb 19 19:38:23 2011  linear algebra at 0.0%, ETA 44h56m
Sat Feb 19 19:38:38 2011  checkpointing every 130000 dimensions
Mon Feb 21 09:04:07 2011  lanczos halted after 92679 iterations (dim = 5860478)
Output file:
Code:
linear algebra completed 5860370 of 5860705 dimensions (100.0%, ETA 0h 0m)    
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos halted after 92679 iterations (dim = 5860478)
lanczos error: only trivial dependencies found
lanczos error: only trivial dependencies found
BLanczosTime: 135693
lanczos error: only trivial dependencies found

lanczos error: only trivial dependencies found
BLanczosTime: 135693
commencing square root phase
BLanczosTime: 135693


lanczos error: only trivial dependencies found
BLanczosTime: 135693
commencing square root phase
commencing square root phase
reading relations for dependency 5

BLanczosTime: 135693
lanczos error: only trivial dependencies found
reading relations for dependency 5
commencing square root phase

reading relations for dependency 5
commencing square root phase
lanczos error: only trivial dependencies found
BLanczosTime: 135693
error: read_cycles can't open dependency file
error: read_cycles can't open dependency file
lanczos error: only trivial dependencies found

reading relations for dependency 5
reading relations for dependency 5
BLanczosTime: 135693
--------------------------------------------------------------------------
mpirun.exe has exited due to process rank 8 with PID 20755 on
node bro64 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun.exe (as reported here).
--------------------------------------------------------------------------
error: read_cycles can't open dependency file
commencing square root phase

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down
BLanczosTime: 135693

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down

received signal 15; shutting down
I have a .cyc, .mat, .mat.idx, but no .dep file.

Jeff.
Jeff Gilchrist is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Msieve 1.53 feedback xilman Msieve 149 2018-11-12 06:37
Msieve 1.50 feedback firejuggler Msieve 99 2013-02-17 11:53
Msieve 1.43 feedback Jeff Gilchrist Msieve 47 2009-11-24 15:53
Msieve 1.42 feedback Andi47 Msieve 167 2009-10-18 19:37
Msieve 1.41 Feedback Batalov Msieve 130 2009-06-09 16:01

All times are UTC. The time now is 00:59.


Sat Jul 17 00:59:53 UTC 2021 up 49 days, 22:47, 1 user, load averages: 1.84, 1.43, 1.37

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.