mersenneforum.org Running msieve LA with openmpi - do all machines need to be same/similar
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2013-10-16, 20:57 #1 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 2·3·619 Posts Running msieve LA with openmpi - do all machines need to be same/similar If I try to set up openmpi with msieve, will it work across varying hardware, or does it need all the machines to be similar? I have machines that range from single core P4, 1.3GHz through Core(TM)2 Quad, 2.66GHz. Would the slow machines get the same amount of work as the faster ones, or is there a way to match the matrix portions to the individual machines?
 2013-10-16, 23:59 #2 jasonp Tribal Bullet     Oct 2004 2·29·61 Posts MPI doesn't care about whether the machines are the same or not, but the communication pattern in Msieve's LA does; it will work, but the slowest machine will hold back all the others.
2013-10-17, 04:15   #3
EdH

"Ed Hall"
Dec 2009

2×3×619 Posts

Quote:
 Originally Posted by jasonp MPI doesn't care about whether the machines are the same or not, but the communication pattern in Msieve's LA does; it will work, but the slowest machine will hold back all the others.
Thanks! Can I overcome this by making the grid resolution small enough? Or, does msieve just make that many threads all at once rather than issuing out segments and waiting for returns?

Maybe I'm seeing this wrong, but won't openmpi allow for issuing tasks such that all "slots" run one process to completion before they accept another? Maybe this would allow for the slower machines to process fewer portions than the faster and balance out in the end? Of course, I suppose the added communication overhead may offset any potential gain.

I have openmpi running on several machines now and will try some experiments as soon as I can make some more time. And, msieve compiled with MPI=1 with no issues.

Thanks for all...

 2013-10-17, 11:27 #4 jasonp Tribal Bullet     Oct 2004 67228 Posts There is no dynamic parallelism here, all the MPI processes are assigned to physical machines at the outset. You can assign more MPI processes to the faster machines and they will do comparatively more work, but if your P4 is 5x slower than the other machines then unbalancing the workload will not correct that. The LA is a tightly-coupled job, all the machines have to frequently synchronize with each other. The LA attempts to give all MPI processes the same amount of work to do, and forcing the faster machines to do 5x as much as the slower ones will make the total time worse to the point it would would be better to run multithreaded on a single machine. That's the comparison to beat, not making an MPI keep your 'cluster' as busy as possible. Also, you get a single binary for all the machines to use, so beware that new CPU instructions are not given to old machines (this has bitten you before) Last fiddled with by jasonp on 2013-10-17 at 11:29
2013-10-17, 12:53   #5
EdH

"Ed Hall"
Dec 2009

72028 Posts

Thanks again. I've got two core(tm)2 quads that aren't too far apart in speed. Maybe I'll restrict the LA portion to them for now and see how that works out. Actually, due to memory restrictions, they run LA faster on only two cores, so maybe I can set up mpi with two slots each and see how it compares.

Quote:
 Also, you get a single binary for all the machines to use, so beware that new CPU instructions are not given to old machines (this has bitten you before)
Yeah, I've got an AMD machine sitting here, waiting for other work because it doesn't have SSE2.

Thanks for all...

 2013-10-17, 22:34 #6 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 2·3·619 Posts Apparently, I'm not in the region to need mpi yet. Code: error: MPI size 1 incompatible with 2 x 1 grid, or 1 x 2 or etc... The only thing that would work was 1 x 1. I did try a rather small set of data from a recent c116. I'll play more later when something larger comes along...
 2013-10-18, 01:12 #7 jasonp Tribal Bullet     Oct 2004 2×29×61 Posts If you are using mpirun, you need to add '-np 2' and also pass '-nc2 1,2' in the Msieve command line. There is no lower bound on the size of problem where MPI is turned off, but you will get silent failures for matrices below 50k in size. If you don't want multithreaded runs, don't pass any '-t' option. Last fiddled with by jasonp on 2013-10-18 at 01:13
 2013-10-19, 15:03 #8 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 72028 Posts Well, I appear to be missing something. I thought perhaps I needed to start from scratch with msieve, so I just manually stepped through a c127. I'm having the same results: msieve appears to be locked into a single MPI setting. The log entry for each step says: Code: MPI process 0 of 1 Do I need to start earlier in the process to invoke more MPI processes? I performed the following: ./msieve -i number.ini -np used gnfs-lasieve4I14e across several machines to collect and combine relations cat number.dat | ./remdups4 200 -v > numberd.dat ./msieve -i number.ini -s numberd.dat -l number.log -nf msieve.fb -t 4 -nc1 ./msieve -i number.ini -s numberd.dat -l number.log -nf msieve.fb -nc2 2,2 All I get is: Code: error: MPI size 1 incompatible with 2 x 2 grid -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 11. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- I get the same for nc2 2,1 and nc2 1,2. The log shows: Code: Sat Oct 19 10:11:48 2013 Sat Oct 19 10:11:48 2013 Sat Oct 19 10:11:48 2013 Msieve v. 1.52 (SVN 886M) Sat Oct 19 10:11:48 2013 random seeds: f2bb119f 236e6081 Sat Oct 19 10:11:48 2013 MPI process 0 of 1 Sat Oct 19 10:11:48 2013 factoring 6784444815212871648987113498975198812582601017150300513612649329477342951718412056224904684878476341466112558732473153528742691 (127 digits) Sat Oct 19 10:11:49 2013 searching for 15-digit factors Sat Oct 19 10:11:49 2013 commencing number field sieve (127-digit input) Sat Oct 19 10:11:49 2013 R0: -6466522433420943924999750 Sat Oct 19 10:11:49 2013 R1: 21859154389253 Sat Oct 19 10:11:49 2013 A0: 141396535680280515764228165941687 Sat Oct 19 10:11:49 2013 A1: -198220819293965173286828926 Sat Oct 19 10:11:49 2013 A2: -2508680094167911717131 Sat Oct 19 10:11:49 2013 A3: -2449278542247386 Sat Oct 19 10:11:49 2013 A4: 3727847252 Sat Oct 19 10:11:49 2013 A5: 600 Sat Oct 19 10:11:49 2013 skew 934828.37, size 3.065e-12, alpha -7.032, combined = 1.147e-10 rroots = 3 Sat Oct 19 10:11:49 2013 Sat Oct 19 10:11:49 2013 commencing linear algebra And, then the error message above. Sorry that I always seem to have these troubles... Thanks for all.
 2013-10-19, 15:16 #9 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 24×3×7×19 Posts You need to start msieve with 'mpirun -n {number of machines} msieve ...'
2013-10-19, 15:26   #10
EdH

"Ed Hall"
Dec 2009

2·3·619 Posts

Quote:
 Originally Posted by fivemack You need to start msieve with 'mpirun -n {number of machines} msieve ...'
I think that clears it up. Thanks. I thought msieve would give me a command line for mpi, but I see I need to run the msieve command in the mpirun. Sorry I'm so dense...

 2013-10-19, 15:47 #11 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 2·3·619 Posts Thanks! That seems to have gotten it running...

 Similar Threads Thread Thread Starter Forum Replies Last Post EdH EdH 7 2019-08-21 02:26 EdH EdH 0 2018-02-23 14:43 appleseed Msieve 12 2016-04-10 02:31 Helfire Software 8 2004-01-14 00:09 lycorn Software 10 2003-01-13 19:34

All times are UTC. The time now is 09:08.

Thu May 13 09:08:55 UTC 2021 up 35 days, 3:49, 1 user, load averages: 1.52, 2.09, 1.90