20170103, 18:51  #1 
"Ed Hall"
Dec 2009
Adirondack Mtns
2×11×167 Posts 
More openmpi questions...
I am having trouble getting msieve threads to pass on into the openmpi processes, especially t 4. All of the following are using the same instance of mpiaware msieve. All these observations are after waiting for the LA to get past the ETA messages by a bit.
If I run msieve with t 4, without calling it via mpirun, using top, I can see one instance of msieve using ~350% of my quad core CPU: Code:
../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 4 nc2 Code:
mpirun np 1 hostfile ./mpi_hosts111 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 4 nc2 1,1 Code:
mpirun np 2 hostfile ./mpi_hosts221 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 2 nc2 2,1 Code:
mpirun np 4 hostfile ./mpi_hosts221 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 2 nc2 4,1 Code:
mpirun np 2 hostfile ./mpi_hosts111 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 4 nc2 2,1 It appears that t 2 works with multiple machines, but t 4 will not work at all via mpi. Any thoughts on the above observations are welcome... 
20170105, 00:14  #2 
Tribal Bullet
Oct 2004
3536_{10} Posts 
How big a matrix is this? I would expect to need a matrix size above maybe 2M to expect a speedup from MPI, especially with multiple threads. If your machines are still connected with gigabit then they still will spend a lot of time waiting for data transfers.

20170105, 03:12  #3 
"Ed Hall"
Dec 2009
Adirondack Mtns
2·11·167 Posts 
I think my matrix is just over 4M, if I'm reading it right. I was thinking if I could increase threads and reduce mpi processes, I could decrease data transfers, but I might be thinking this backwards. Practice appears to show that t 2 and three machines is optimum for my setup. Yes, I'm still on Gigabit. Adding a third machine does reduce time, so I was thinking that meant the Gigabit wasn't saturated by the first two. But memory transfer might be my issue. Even though I'm not filling it, there may not be enough bandwidth, perhaps?
If this is helpful: Code:
mpirun np 2 hostfile ./mpi_hosts111 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 4 nc2 2,1 Here are the two logs: Code:
Wed Jan 4 21:26:34 2017 Msieve v. 1.53 (SVN 993) Wed Jan 4 21:26:34 2017 random seeds: 99435217 8a405357 Wed Jan 4 21:26:34 2017 MPI process 0 of 2 Wed Jan 4 21:26:34 2017 factoring 820542702287058139583300542461757119495935711084069870517652403589147165539358552360109600961345804476958004926044416408854122278694458926677 (141 digits) Wed Jan 4 21:26:35 2017 searching for 15digit factors Wed Jan 4 21:26:36 2017 commencing number field sieve (141digit input) Wed Jan 4 21:26:36 2017 R0: 8068224187348260061731767540 Wed Jan 4 21:26:36 2017 R1: 7392072149387 Wed Jan 4 21:26:36 2017 A0: 20423341607513397403579630437539211 Wed Jan 4 21:26:36 2017 A1: 529896443757128435451449388665 Wed Jan 4 21:26:36 2017 A2: 137820022314972661814868 Wed Jan 4 21:26:36 2017 A3: 11802326769047736 Wed Jan 4 21:26:36 2017 A4: 9623375962 Wed Jan 4 21:26:36 2017 A5: 24 Wed Jan 4 21:26:36 2017 skew 6219102.56, size 7.941e14, alpha 5.813, combined = 1.417e11 rroots = 3 Wed Jan 4 21:26:36 2017 Wed Jan 4 21:26:36 2017 commencing linear algebra Wed Jan 4 21:26:36 2017 initialized process (0,0) of 2 x 1 grid Wed Jan 4 21:26:36 2017 read 2124888 cycles Wed Jan 4 21:26:40 2017 cycles contain 6333818 unique relations Wed Jan 4 21:27:50 2017 read 6333818 relations Wed Jan 4 21:27:58 2017 using 20 quadratic characters above 4294917295 Wed Jan 4 21:28:33 2017 building initial matrix Wed Jan 4 21:30:01 2017 memory use: 851.5 MB Wed Jan 4 21:30:03 2017 read 2124888 cycles Wed Jan 4 21:30:04 2017 matrix is 2124709 x 2124888 (638.7 MB) with weight 201591455 (94.87/col) Wed Jan 4 21:30:04 2017 sparse part has weight 144046076 (67.79/col) Wed Jan 4 21:30:19 2017 filtering completed in 1 passes Wed Jan 4 21:30:20 2017 matrix is 2124709 x 2124888 (638.7 MB) with weight 201591455 (94.87/col) Wed Jan 4 21:30:20 2017 sparse part has weight 144046076 (67.79/col) Wed Jan 4 21:30:38 2017 matrix starts at (0, 0) Wed Jan 4 21:30:38 2017 matrix is 1062411 x 2124888 (370.2 MB) with weight 131215183 (61.75/col) Wed Jan 4 21:30:38 2017 sparse part has weight 73669804 (34.67/col) Wed Jan 4 21:30:38 2017 saving the first 48 matrix rows for later Wed Jan 4 21:30:39 2017 matrix includes 64 packed rows Wed Jan 4 21:30:39 2017 matrix is 1062363 x 2124888 (350.4 MB) with weight 90149514 (42.43/col) Wed Jan 4 21:30:39 2017 sparse part has weight 70607640 (33.23/col) Wed Jan 4 21:30:39 2017 using block size 8192 and superblock size 196608 for processor cache size 2048 kB Wed Jan 4 21:30:44 2017 commencing Lanczos iteration (4 threads) Wed Jan 4 21:30:44 2017 memory use: 261.8 MB Wed Jan 4 21:31:07 2017 linear algebra at 0.1%, ETA 8h35m Wed Jan 4 21:31:15 2017 checkpointing every 250000 dimensions Code:
Wed Jan 4 21:26:36 2017 commencing linear algebra Wed Jan 4 21:26:36 2017 initialized process (1,0) of 2 x 1 grid Wed Jan 4 21:30:38 2017 matrix starts at (1062411, 0) Wed Jan 4 21:30:38 2017 matrix is 1062298 x 2124888 (333.3 MB) with weight 70376272 (33.12/col) Wed Jan 4 21:30:38 2017 sparse part has weight 70376272 (33.12/col) Wed Jan 4 21:30:39 2017 matrix is 1062298 x 2124888 (333.3 MB) with weight 70376272 (33.12/col) Wed Jan 4 21:30:39 2017 sparse part has weight 70376272 (33.12/col) Wed Jan 4 21:30:39 2017 using block size 8192 and superblock size 196608 for processor cache size 2048 kB Wed Jan 4 21:30:44 2017 commencing Lanczos iteration (4 threads) Wed Jan 4 21:30:44 2017 memory use: 244.7 MB Code:
mpirun np 4 hostfile ./mpi_hosts221 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 2 nc2 4,1 Code:
Wed Jan 4 21:42:40 2017 commencing linear algebra Wed Jan 4 21:42:40 2017 initialized process (0,0) of 4 x 1 grid Wed Jan 4 21:42:41 2017 read 2124888 cycles Wed Jan 4 21:42:45 2017 cycles contain 6333818 unique relations Wed Jan 4 21:43:56 2017 read 6333818 relations Wed Jan 4 21:44:04 2017 using 20 quadratic characters above 4294917295 Wed Jan 4 21:44:38 2017 building initial matrix Wed Jan 4 21:46:04 2017 memory use: 851.5 MB Wed Jan 4 21:46:06 2017 read 2124888 cycles Wed Jan 4 21:46:07 2017 matrix is 2124709 x 2124888 (638.7 MB) with weight 201591455 (94.87/col) Wed Jan 4 21:46:07 2017 sparse part has weight 144046076 (67.79/col) Wed Jan 4 21:46:22 2017 filtering completed in 1 passes Wed Jan 4 21:46:23 2017 matrix is 2124709 x 2124888 (638.7 MB) with weight 201591455 (94.87/col) Wed Jan 4 21:46:23 2017 sparse part has weight 144046076 (67.79/col) Wed Jan 4 21:46:43 2017 matrix starts at (0, 0) Wed Jan 4 21:46:44 2017 matrix is 531262 x 2124888 (236.0 MB) with weight 96032795 (45.19/col) Wed Jan 4 21:46:44 2017 sparse part has weight 38487416 (18.11/col) Wed Jan 4 21:46:44 2017 saving the first 48 matrix rows for later Wed Jan 4 21:46:44 2017 matrix includes 64 packed rows Wed Jan 4 21:46:47 2017 matrix is 531214 x 2124888 (216.2 MB) with weight 54967126 (25.87/col) Wed Jan 4 21:46:47 2017 sparse part has weight 35425252 (16.67/col) Wed Jan 4 21:46:47 2017 using block size 8192 and superblock size 196608 for processor cache size 2048 kB Wed Jan 4 21:46:50 2017 commencing Lanczos iteration (2 threads) Wed Jan 4 21:46:50 2017 memory use: 146.6 MB Wed Jan 4 21:47:06 2017 linear algebra at 0.1%, ETA 5h49m Wed Jan 4 21:47:11 2017 checkpointing every 370000 dimensions The logs do show the appropriate threads, but the CPU use, just doesn't seem to. Thanks for the reply. I'll go back to my studies... Last fiddled with by EdH on 20170105 at 03:13 
20170105, 13:57  #4 
(loop (#_fork))
Feb 2006
Cambridge, England
13·491 Posts 
Could you post the hosts files?
My suspicion is that mpirun has decided that it should bind processes to CPUs, and that you've somehow not told it that some of the hosts have more than one CPU ... what does 'taskset p {process ID}' tell you when a process is running with insufficient CPU usage? Aha, in a document at the Oxford supercomputer centre website, I found Code:
Finally, versions higher than 1.8.0 in OpenMPI bind automatically processes to threads. Thus, export OMPI_MCA_hwloc_base_binding_policy=none Supercomputer centres almost always use something like Slurm or Torque for job submission, so I'm having a little trouble tying down how to get onejobpermachine in the case without an extra layer. Last fiddled with by fivemack on 20170105 at 14:03 
20170105, 18:40  #5 
"Ed Hall"
Dec 2009
Adirondack Mtns
2×11×167 Posts 
Thanks, fivemack! This does make a difference. After exporting the policy value on the host and first slave, the msieve threads have increased their CPU usage. The host machine is up to just under 200% and the slave machine is just over 200%. During this run, your taskset query returns:
Code:
pid 8003's current affinity mask: f OK, taskset now returns: Code:
pid 8850's current affinity mask: 1 In answer to your other request, my mpi_host files follow this theme: mpi_hosts111: Code:
localhost slots=1 math59@192.168.0.58 slots=1 math59@192.168.0.60 slots=1 Code:
localhost slots=2 math59@192.168.0.58 slots=2 math59@192.168.0.60 slots=1 The host and the first slave are quad core and the second slave is dual core. Also, the host and first slave are maxed out at 4G, while the second slave has only 3G of RAM. I am probably going to change the second slave for a quad core with more RAM, but the current second slave is the identical architecture as the other two. I thought that might be an advantage at this point. 
20170116, 17:22  #6 
"Ed Hall"
Dec 2009
Adirondack Mtns
3674_{10} Posts 
Well, I guess it's time to give up on this for a while again. Too much frustration!
No matter what combination I try, I can't get a savings in time over the bare initial machine running four cores. The only advantage the additional machines do give me is the capability to handle larger matrices since the additional machines do add their memory to the mix and all are maxed at 4GB. Since my current play area only entails working with composites that are less than 150 digits and only take two to four days to factor, I'll let this slide into the background for a bit. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Running msieve LA with openmpi  do all machines need to be same/similar  EdH  Msieve  32  20131108 17:57 
GPU questions  c10ck3r  GPU Computing  1  20120508 02:48 
some FFT Mul questions  rapso  Math  7  20120126 18:59 
Two questions:  Dubslow  GPU Computing  1  20110805 18:22 
Some questions...  OmbooHankvald  PSearch  3  20050917 19:29 