![]() |
![]() |
#1 |
"Ed Hall"
Dec 2009
Adirondack Mtns
23×653 Posts |
![]()
I'm running Msieve LA (-nc2) via mpi on a machine with (seemingly) no issue so far. It's a rather large number (201 digits). I'm concerned about it, though.
I tried to set up a second machine in the same manner. But, this one doesn't want to run LA (-nc2) via mpi. I'm going to try with non-mpi Msieve, but wondered if there might be some idea why I can't get the mpi to work. Is it possible this candidate (156 digits) is too small to work well with mpi? The failure message: Code:
double free or corruption (out) of 3816044 dimensions (21.2%, ETA 3h22m) [math99:03317] *** Process received signal *** [math99:03317] Signal: Aborted (6) [math99:03317] Signal code: (-6) [math99:03317] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f02646623c0] [math99:03317] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f026437a18b] [math99:03317] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f0264359859] [math99:03317] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7f02643c43ee] [math99:03317] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7f02643cc47c] [math99:03317] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x9a120)[0x7f02643ce120] [math99:03317] [ 6] ./msieve(+0x4a026)[0x55db1c2a0026] [math99:03317] [ 7] ./msieve(+0x4b80d)[0x55db1c2a180d] [math99:03317] [ 8] ./msieve(+0x3dc58)[0x55db1c293c58] [math99:03317] [ 9] ./msieve(+0x18055)[0x55db1c26e055] [math99:03317] [10] ./msieve(+0x7dde)[0x55db1c25ddde] [math99:03317] [11] ./msieve(+0x670c)[0x55db1c25c70c] [math99:03317] [12] ./msieve(+0x5575)[0x55db1c25b575] [math99:03317] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f026435b0b3] [math99:03317] [14] ./msieve(+0x633e)[0x55db1c25c33e] [math99:03317] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- received signal 15; shutting down -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node math99 exited on signal 6 (Aborted). -------------------------------------------------------------------------- $ mpirun --bind-to none -np 2 ./msieve -i comp.n -s comp.dat -l comp.log -nf comp.fb -ncr 2,1 -t 12 error: cannot open matrix checkpoint file -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- error: cannot open matrix checkpoint file -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[41651,1],1] Exit code: 255 -------------------------------------------------------------------------- |
![]() |
![]() |
![]() |
#2 |
Jul 2003
So Cal
3×5×173 Posts |
![]()
It looks like it happened about when the first checkpoint would be expected to be written. Does a msieve.dat.chk or msieve.dat.bak.chk file exist in the directory?
Is space available on the drive? Is the error reproducible? Does it work without MPI? The size of the number isn't an issue. I'm working on optimizing a MPI+CUDA implementation right now, and I'm using RSA-120 as my test case. It has a 602082 x 602260 matrix. |
![]() |
![]() |
![]() |
#3 | |
"Ed Hall"
Dec 2009
Adirondack Mtns
121508 Posts |
![]() Quote:
There is over 400GB space available on the drive. I've had the same thing happen a day or two ago with a third machine and different composite, with the same setup, but I haven't tried this one again. LA with standard Msieve just finished and the SR phase is underway, so it looks like, yes. However, I'm using an Msieve binary that was built without "MPI=1." I still need to check with the "MPI=1" built binary without MPI. |
|
![]() |
![]() |
![]() |
#4 |
"Ed Hall"
Dec 2009
Adirondack Mtns
121508 Posts |
![]()
An update:
Yes, it's reproducible without running within mpi. I have done a make clean - svn up - remake and am testing the binary now. It did bring in a few updated files. Let's hope they're "the fix." Last fiddled with by EdH on 2021-06-23 at 22:17 |
![]() |
![]() |
![]() |
#5 |
"Ed Hall"
Dec 2009
Adirondack Mtns
23×653 Posts |
![]()
The mpi invocation seems to be running fine now*, but I'm going to let it complete just to make sure. I guess the issue was a slightly older Msieve. Sorry for not checking that first.
* I am slightly disappointed. The standard Msieve run turned in an estimated time of 4h53m for one process of 12 threads, while the mpi version turned in an estimated time of 4h11m for two processes of 12 threads, each. The machine has 12 cores and 24 threads total. I expected better than a savings of only 40 minutes. |
![]() |
![]() |
![]() |
#6 |
"Curtis"
Feb 2005
Riverside, CA
10101111111002 Posts |
![]()
Is the machine two 6-core sockets, or a single 12-core socket? If it's a single chip, I suspect a single 24-threaded run will get you a time pretty close to the MPI run.
If it's two sockets, then I'm surprised the timings are close- perhaps some taskset-ing gadgetry is needed to keep each MPI instance on a single socket? |
![]() |
![]() |
![]() |
#7 |
"Ed Hall"
Dec 2009
Adirondack Mtns
23×653 Posts |
![]()
It's dual Xeon 6c/12t and through testing with charybis' help, we determined "--bind-to none" gave the best performance for a similar machine with a different composite.
|
![]() |
![]() |
![]() |
#8 |
Apr 2020
13·71 Posts |
![]()
For some reason when there are only 2 processes MPI unhelpfully defaults to binding each one to a core. The solution ought to be "--bind-to socket", but when Ed tried this it appeared to assign each process to the same socket rather than giving one to each. "--bind-to none" performed best, but it was only slightly faster than without MPI.
Are there any resident MPI experts who know how to make "--bind-to socket" work properly? Can it perhaps be combined with some taskset wizardry? |
![]() |
![]() |
![]() |
#9 |
"Ed Hall"
Dec 2009
Adirondack Mtns
23·653 Posts |
![]()
I upgraded my 6c/12t dual machine to 10c/20t dual processor and am revisiting mpi with this machine. Per earlier testing I found the following to be the optimum command at the time:
Code:
mpirun --bind-to none -np 2 ../../msieve -nc2 2,1 -t 12 Code:
$ mpirun --bind-to none -np 2 ../../msieve -nc2 2,1 -t 20 ETA 7h 3m Code:
$ mpirun --bind-to socket -np 2 ../../msieve -nc2 2,1 -t 20 ETA 9h34m Code:
$ ../../msieve -nc2 -t 40 ETA 6h16m |
![]() |
![]() |
![]() |
#10 |
"Curtis"
Feb 2005
Riverside, CA
562810 Posts |
![]()
I thought msieve has a 32-thread max for each instance... are you sure the last run was 40-threaded? msieve may have reverted to 32 threads.
Maybe hyperthreading helps less on these 10-cores than the previous 6-cores? What happens if you change that -t 20 to -t 10 in the first invocation? |
![]() |
![]() |
![]() |
#11 | |
Apr 2020
16338 Posts |
![]() Quote:
Code:
#define MAX_THREADS 32 |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Difficulty of SNFS polynomials | StrongestStrike | Factoring | 10 | 2021-07-10 20:18 |
Yafu SNFS difficulty calculation | jyb | YAFU | 4 | 2021-02-24 16:23 |
HOW TO BYPASS RSA DIFFICULTY... not! | Alberico Lepore | Alberico Lepore | 3 | 2020-04-29 16:28 |
Better measures of SNFS difficulty | fivemack | Factoring | 4 | 2007-06-30 09:20 |
Difficulty Running Prime with DSL | Rosenfeld | Lounge | 16 | 2004-07-31 22:15 |