mersenneforum.org Msieve mpi Difficulty
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2021-06-23, 15:25 #1 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 2×29×71 Posts Msieve mpi Difficulty I'm running Msieve LA (-nc2) via mpi on a machine with (seemingly) no issue so far. It's a rather large number (201 digits). I'm concerned about it, though. I tried to set up a second machine in the same manner. But, this one doesn't want to run LA (-nc2) via mpi. I'm going to try with non-mpi Msieve, but wondered if there might be some idea why I can't get the mpi to work. Is it possible this candidate (156 digits) is too small to work well with mpi? The failure message: Code: double free or corruption (out) of 3816044 dimensions (21.2%, ETA 3h22m) [math99:03317] *** Process received signal *** [math99:03317] Signal: Aborted (6) [math99:03317] Signal code: (-6) [math99:03317] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f02646623c0] [math99:03317] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f026437a18b] [math99:03317] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f0264359859] [math99:03317] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7f02643c43ee] [math99:03317] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7f02643cc47c] [math99:03317] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x9a120)[0x7f02643ce120] [math99:03317] [ 6] ./msieve(+0x4a026)[0x55db1c2a0026] [math99:03317] [ 7] ./msieve(+0x4b80d)[0x55db1c2a180d] [math99:03317] [ 8] ./msieve(+0x3dc58)[0x55db1c293c58] [math99:03317] [ 9] ./msieve(+0x18055)[0x55db1c26e055] [math99:03317] [10] ./msieve(+0x7dde)[0x55db1c25ddde] [math99:03317] [11] ./msieve(+0x670c)[0x55db1c25c70c] [math99:03317] [12] ./msieve(+0x5575)[0x55db1c25b575] [math99:03317] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f026435b0b3] [math99:03317] [14] ./msieve(+0x633e)[0x55db1c25c33e] [math99:03317] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- received signal 15; shutting down -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node math99 exited on signal 6 (Aborted). -------------------------------------------------------------------------- \$ mpirun --bind-to none -np 2 ./msieve -i comp.n -s comp.dat -l comp.log -nf comp.fb -ncr 2,1 -t 12 error: cannot open matrix checkpoint file -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- error: cannot open matrix checkpoint file -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[41651,1],1] Exit code: 255 -------------------------------------------------------------------------- All comments welcome. . .
 2021-06-23, 19:25 #2 frmky     Jul 2003 So Cal 2×3×7×53 Posts It looks like it happened about when the first checkpoint would be expected to be written. Does a msieve.dat.chk or msieve.dat.bak.chk file exist in the directory? Is space available on the drive? Is the error reproducible? Does it work without MPI? The size of the number isn't an issue. I'm working on optimizing a MPI+CUDA implementation right now, and I'm using RSA-120 as my test case. It has a 602082 x 602260 matrix.
2021-06-23, 20:20   #3
EdH

"Ed Hall"
Dec 2009

100268 Posts

Quote:
 Originally Posted by frmky It looks like it happened about when the first checkpoint would be expected to be written. Does a msieve.dat.chk or msieve.dat.bak.chk file exist in the directory? Is space available on the drive? Is the error reproducible? Does it work without MPI? The size of the number isn't an issue. I'm working on optimizing a MPI+CUDA implementation right now, and I'm using RSA-120 as my test case. It has a 602082 x 602260 matrix.
The referenced files are not there.
There is over 400GB space available on the drive.
I've had the same thing happen a day or two ago with a third machine and different composite, with the same setup, but I haven't tried this one again.
LA with standard Msieve just finished and the SR phase is underway, so it looks like, yes. However, I'm using an Msieve binary that was built without "MPI=1." I still need to check with the "MPI=1" built binary without MPI.

 2021-06-23, 22:17 #4 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 2×29×71 Posts An update: Yes, it's reproducible without running within mpi. I have done a make clean - svn up - remake and am testing the binary now. It did bring in a few updated files. Let's hope they're "the fix." Last fiddled with by EdH on 2021-06-23 at 22:17
 2021-06-24, 02:31 #5 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 2·29·71 Posts The mpi invocation seems to be running fine now*, but I'm going to let it complete just to make sure. I guess the issue was a slightly older Msieve. Sorry for not checking that first. * I am slightly disappointed. The standard Msieve run turned in an estimated time of 4h53m for one process of 12 threads, while the mpi version turned in an estimated time of 4h11m for two processes of 12 threads, each. The machine has 12 cores and 24 threads total. I expected better than a savings of only 40 minutes.
 2021-06-26, 06:06 #6 VBCurtis     "Curtis" Feb 2005 Riverside, CA 507510 Posts Is the machine two 6-core sockets, or a single 12-core socket? If it's a single chip, I suspect a single 24-threaded run will get you a time pretty close to the MPI run. If it's two sockets, then I'm surprised the timings are close- perhaps some taskset-ing gadgetry is needed to keep each MPI instance on a single socket?
 2021-06-26, 12:21 #7 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 2·29·71 Posts It's dual Xeon 6c/12t and through testing with charybis' help, we determined "--bind-to none" gave the best performance for a similar machine with a different composite.
 2021-06-26, 13:17 #8 charybdis     Apr 2020 22416 Posts For some reason when there are only 2 processes MPI unhelpfully defaults to binding each one to a core. The solution ought to be "--bind-to socket", but when Ed tried this it appeared to assign each process to the same socket rather than giving one to each. "--bind-to none" performed best, but it was only slightly faster than without MPI. Are there any resident MPI experts who know how to make "--bind-to socket" work properly? Can it perhaps be combined with some taskset wizardry?
 2021-07-05, 21:36 #9 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 101616 Posts After quite a while of just letting everything sit, I thought about this again. Everything I was reading was pointing to the brand Mellanox. Of course, the brand of cards I had acquired was HP. Well, I bought two Mellanox cards and three cables (because they came that way). These actually gave me connectivity betwen the two machines without much trouble, kind of. I have Infiniband connected between the two machines, but now my Ethernet cluster for ecmpi doesn't work when I have the Infiniband enabled, even though the hostfile still uses the Ethernet addresses. It actually fails trying to use the Infiniband node for the localhost machine. So, I have made some progress and am looking forward to making more as I spend more time "playing."

 Similar Threads Thread Thread Starter Forum Replies Last Post StrongestStrike Factoring 10 2021-07-10 20:18 jyb YAFU 4 2021-02-24 16:23 Alberico Lepore Alberico Lepore 3 2020-04-29 16:28 fivemack Factoring 4 2007-06-30 09:20 Rosenfeld Lounge 16 2004-07-31 22:15

All times are UTC. The time now is 07:04.

Wed Dec 8 07:04:47 UTC 2021 up 138 days, 1:33, 1 user, load averages: 1.34, 1.30, 1.28