mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2021-06-23, 15:25   #1
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

2×29×71 Posts
Default Msieve mpi Difficulty

I'm running Msieve LA (-nc2) via mpi on a machine with (seemingly) no issue so far. It's a rather large number (201 digits). I'm concerned about it, though.

I tried to set up a second machine in the same manner. But, this one doesn't want to run LA (-nc2) via mpi. I'm going to try with non-mpi Msieve, but wondered if there might be some idea why I can't get the mpi to work. Is it possible this candidate (156 digits) is too small to work well with mpi?

The failure message:
Code:
double free or corruption (out) of 3816044 dimensions (21.2%, ETA 3h22m)     
[math99:03317] *** Process received signal *** 
[math99:03317] Signal: Aborted (6) 
[math99:03317] Signal code:  (-6) 
[math99:03317] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f02646623c0] 
[math99:03317] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f026437a18b] 
[math99:03317] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f0264359859] 
[math99:03317] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7f02643c43ee] 
[math99:03317] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7f02643cc47c] 
[math99:03317] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x9a120)[0x7f02643ce120] 
[math99:03317] [ 6] ./msieve(+0x4a026)[0x55db1c2a0026] 
[math99:03317] [ 7] ./msieve(+0x4b80d)[0x55db1c2a180d] 
[math99:03317] [ 8] ./msieve(+0x3dc58)[0x55db1c293c58] 
[math99:03317] [ 9] ./msieve(+0x18055)[0x55db1c26e055] 
[math99:03317] [10] ./msieve(+0x7dde)[0x55db1c25ddde] 
[math99:03317] [11] ./msieve(+0x670c)[0x55db1c25c70c] 
[math99:03317] [12] ./msieve(+0x5575)[0x55db1c25b575] 
[math99:03317] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f026435b0b3] 
[math99:03317] [14] ./msieve(+0x633e)[0x55db1c25c33e] 
[math99:03317] *** End of error message *** 
-------------------------------------------------------------------------- 
Primary job  terminated normally, but 1 process returned 
a non-zero exit code. Per user-direction, the job has been aborted. 
-------------------------------------------------------------------------- 
 
received signal 15; shutting down 
-------------------------------------------------------------------------- 
mpirun noticed that process rank 0 with PID 0 on node math99 exited on signal 6 (Aborted). 
-------------------------------------------------------------------------- 
 
$ mpirun --bind-to none -np 2 ./msieve -i comp.n -s comp.dat -l comp.log -nf comp.fb -ncr 2,1 -t 12 
error: cannot open matrix checkpoint file 
-------------------------------------------------------------------------- 
Primary job  terminated normally, but 1 process returned 
a non-zero exit code. Per user-direction, the job has been aborted. 
-------------------------------------------------------------------------- 
error: cannot open matrix checkpoint file 
-------------------------------------------------------------------------- 
mpirun detected that one or more processes exited with non-zero status, thus causing 
the job to be terminated. The first process to do so was: 
 
  Process name: [[41651,1],1] 
  Exit code:    255 
--------------------------------------------------------------------------
All comments welcome. . .
EdH is offline   Reply With Quote
Old 2021-06-23, 19:25   #2
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×3×7×53 Posts
Default

It looks like it happened about when the first checkpoint would be expected to be written. Does a msieve.dat.chk or msieve.dat.bak.chk file exist in the directory?
Is space available on the drive?
Is the error reproducible?
Does it work without MPI?

The size of the number isn't an issue. I'm working on optimizing a MPI+CUDA implementation right now, and I'm using RSA-120 as my test case. It has a 602082 x 602260 matrix.
frmky is online now   Reply With Quote
Old 2021-06-23, 20:20   #3
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

100268 Posts
Default

Quote:
Originally Posted by frmky View Post
It looks like it happened about when the first checkpoint would be expected to be written. Does a msieve.dat.chk or msieve.dat.bak.chk file exist in the directory?
Is space available on the drive?
Is the error reproducible?
Does it work without MPI?

The size of the number isn't an issue. I'm working on optimizing a MPI+CUDA implementation right now, and I'm using RSA-120 as my test case. It has a 602082 x 602260 matrix.
The referenced files are not there.
There is over 400GB space available on the drive.
I've had the same thing happen a day or two ago with a third machine and different composite, with the same setup, but I haven't tried this one again.
LA with standard Msieve just finished and the SR phase is underway, so it looks like, yes. However, I'm using an Msieve binary that was built without "MPI=1." I still need to check with the "MPI=1" built binary without MPI.
EdH is offline   Reply With Quote
Old 2021-06-23, 22:17   #4
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

2×29×71 Posts
Default

An update:
Yes, it's reproducible without running within mpi. I have done a make clean - svn up - remake and am testing the binary now. It did bring in a few updated files. Let's hope they're "the fix."

Last fiddled with by EdH on 2021-06-23 at 22:17
EdH is offline   Reply With Quote
Old 2021-06-24, 02:31   #5
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

2·29·71 Posts
Default

The mpi invocation seems to be running fine now*, but I'm going to let it complete just to make sure. I guess the issue was a slightly older Msieve. Sorry for not checking that first.

* I am slightly disappointed. The standard Msieve run turned in an estimated time of 4h53m for one process of 12 threads, while the mpi version turned in an estimated time of 4h11m for two processes of 12 threads, each. The machine has 12 cores and 24 threads total. I expected better than a savings of only 40 minutes.
EdH is offline   Reply With Quote
Old 2021-06-26, 06:06   #6
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

507510 Posts
Default

Is the machine two 6-core sockets, or a single 12-core socket? If it's a single chip, I suspect a single 24-threaded run will get you a time pretty close to the MPI run.

If it's two sockets, then I'm surprised the timings are close- perhaps some taskset-ing gadgetry is needed to keep each MPI instance on a single socket?
VBCurtis is offline   Reply With Quote
Old 2021-06-26, 12:21   #7
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

2·29·71 Posts
Default

It's dual Xeon 6c/12t and through testing with charybis' help, we determined "--bind-to none" gave the best performance for a similar machine with a different composite.
EdH is offline   Reply With Quote
Old 2021-06-26, 13:17   #8
charybdis
 
charybdis's Avatar
 
Apr 2020

22416 Posts
Default

For some reason when there are only 2 processes MPI unhelpfully defaults to binding each one to a core. The solution ought to be "--bind-to socket", but when Ed tried this it appeared to assign each process to the same socket rather than giving one to each. "--bind-to none" performed best, but it was only slightly faster than without MPI.

Are there any resident MPI experts who know how to make "--bind-to socket" work properly? Can it perhaps be combined with some taskset wizardry?
charybdis is offline   Reply With Quote
Old 2021-07-05, 21:36   #9
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

101616 Posts
Default

After quite a while of just letting everything sit, I thought about this again. Everything I was reading was pointing to the brand Mellanox. Of course, the brand of cards I had acquired was HP. Well, I bought two Mellanox cards and three cables (because they came that way). These actually gave me connectivity betwen the two machines without much trouble, kind of. I have Infiniband connected between the two machines, but now my Ethernet cluster for ecmpi doesn't work when I have the Infiniband enabled, even though the hostfile still uses the Ethernet addresses. It actually fails trying to use the Infiniband node for the localhost machine.

So, I have made some progress and am looking forward to making more as I spend more time "playing."
EdH is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Difficulty of SNFS polynomials StrongestStrike Factoring 10 2021-07-10 20:18
Yafu SNFS difficulty calculation jyb YAFU 4 2021-02-24 16:23
HOW TO BYPASS RSA DIFFICULTY... not! Alberico Lepore Alberico Lepore 3 2020-04-29 16:28
Better measures of SNFS difficulty fivemack Factoring 4 2007-06-30 09:20
Difficulty Running Prime with DSL Rosenfeld Lounge 16 2004-07-31 22:15

All times are UTC. The time now is 07:04.


Wed Dec 8 07:04:47 UTC 2021 up 138 days, 1:33, 1 user, load averages: 1.34, 1.30, 1.28

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.