More openmpi questions...
I am having trouble getting msieve threads to pass on into the openmpi processes, especially t 4. All of the following are using the same instance of mpiaware msieve. All these observations are after waiting for the LA to get past the ETA messages by a bit.
If I run msieve with t 4, without calling it via mpirun, using top, I can see one instance of msieve using ~350% of my quad core CPU: [code] ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 4 nc2 [/code]If I then run the same command via mpi, top shows one instance of msieve running at <=100% and the ETA is appropriately longer. [code] mpirun np 1 hostfile ./mpi_hosts111 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 4 nc2 1,1 [/code]If, still using one machine, I change to 2 mpi processes and t 2, top will show 2 instances, both are <= 100%. [code] mpirun np 2 hostfile ./mpi_hosts221 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 2 nc2 2,1 [/code]If I add in a second machine, top shows 2 instances on both machines with all instances showing <=150%. [code] mpirun np 4 hostfile ./mpi_hosts221 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 2 nc2 4,1 [/code]If I then adjust to 2 mpi processes with t 4, top changes to one instance on each machine, but <=100% for both. [code] mpirun np 2 hostfile ./mpi_hosts111 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 4 nc2 2,1 [/code]Note, that the mpi_hosts' suffix is representative of the slots to use per three machines. It appears that t 2 works with multiple machines, but t 4 will not work at all via mpi. Any thoughts on the above observations are welcome... 
How big a matrix is this? I would expect to need a matrix size above maybe 2M to expect a speedup from MPI, especially with multiple threads. If your machines are still connected with gigabit then they still will spend a lot of time waiting for data transfers.

I think my matrix is just over 4M, if I'm reading it right. I was thinking if I could increase threads and reduce mpi processes, I could decrease data transfers, but I might be thinking this backwards. Practice appears to show that t 2 and three machines is optimum for my setup. Yes, I'm still on Gigabit. Adding a third machine does reduce time, so I was thinking that meant the Gigabit wasn't saturated by the first two. But memory transfer might be my issue. Even though I'm not filling it, there may not be enough bandwidth, perhaps?
If this is helpful: [code] mpirun np 2 hostfile ./mpi_hosts111 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 4 nc2 2,1 [/code]gives me one process on each machine with top showing <=100%. Here are the two logs: [code] Wed Jan 4 21:26:34 2017 Msieve v. 1.53 (SVN 993) Wed Jan 4 21:26:34 2017 random seeds: 99435217 8a405357 Wed Jan 4 21:26:34 2017 MPI process 0 of 2 Wed Jan 4 21:26:34 2017 factoring 820542702287058139583300542461757119495935711084069870517652403589147165539358552360109600961345804476958004926044416408854122278694458926677 (141 digits) Wed Jan 4 21:26:35 2017 searching for 15digit factors Wed Jan 4 21:26:36 2017 commencing number field sieve (141digit input) Wed Jan 4 21:26:36 2017 R0: 8068224187348260061731767540 Wed Jan 4 21:26:36 2017 R1: 7392072149387 Wed Jan 4 21:26:36 2017 A0: 20423341607513397403579630437539211 Wed Jan 4 21:26:36 2017 A1: 529896443757128435451449388665 Wed Jan 4 21:26:36 2017 A2: 137820022314972661814868 Wed Jan 4 21:26:36 2017 A3: 11802326769047736 Wed Jan 4 21:26:36 2017 A4: 9623375962 Wed Jan 4 21:26:36 2017 A5: 24 Wed Jan 4 21:26:36 2017 skew 6219102.56, size 7.941e14, alpha 5.813, combined = 1.417e11 rroots = 3 Wed Jan 4 21:26:36 2017 Wed Jan 4 21:26:36 2017 commencing linear algebra Wed Jan 4 21:26:36 2017 initialized process (0,0) of 2 x 1 grid Wed Jan 4 21:26:36 2017 read 2124888 cycles Wed Jan 4 21:26:40 2017 cycles contain 6333818 unique relations Wed Jan 4 21:27:50 2017 read 6333818 relations Wed Jan 4 21:27:58 2017 using 20 quadratic characters above 4294917295 Wed Jan 4 21:28:33 2017 building initial matrix Wed Jan 4 21:30:01 2017 memory use: 851.5 MB Wed Jan 4 21:30:03 2017 read 2124888 cycles Wed Jan 4 21:30:04 2017 matrix is 2124709 x 2124888 (638.7 MB) with weight 201591455 (94.87/col) Wed Jan 4 21:30:04 2017 sparse part has weight 144046076 (67.79/col) Wed Jan 4 21:30:19 2017 filtering completed in 1 passes Wed Jan 4 21:30:20 2017 matrix is 2124709 x 2124888 (638.7 MB) with weight 201591455 (94.87/col) Wed Jan 4 21:30:20 2017 sparse part has weight 144046076 (67.79/col) Wed Jan 4 21:30:38 2017 matrix starts at (0, 0) Wed Jan 4 21:30:38 2017 matrix is 1062411 x 2124888 (370.2 MB) with weight 131215183 (61.75/col) Wed Jan 4 21:30:38 2017 sparse part has weight 73669804 (34.67/col) Wed Jan 4 21:30:38 2017 saving the first 48 matrix rows for later Wed Jan 4 21:30:39 2017 matrix includes 64 packed rows Wed Jan 4 21:30:39 2017 matrix is 1062363 x 2124888 (350.4 MB) with weight 90149514 (42.43/col) Wed Jan 4 21:30:39 2017 sparse part has weight 70607640 (33.23/col) Wed Jan 4 21:30:39 2017 using block size 8192 and superblock size 196608 for processor cache size 2048 kB Wed Jan 4 21:30:44 2017 commencing Lanczos iteration (4 threads) Wed Jan 4 21:30:44 2017 memory use: 261.8 MB Wed Jan 4 21:31:07 2017 linear algebra at 0.1%, ETA 8h35m Wed Jan 4 21:31:15 2017 checkpointing every 250000 dimensions [/code][code] Wed Jan 4 21:26:36 2017 commencing linear algebra Wed Jan 4 21:26:36 2017 initialized process (1,0) of 2 x 1 grid Wed Jan 4 21:30:38 2017 matrix starts at (1062411, 0) Wed Jan 4 21:30:38 2017 matrix is 1062298 x 2124888 (333.3 MB) with weight 70376272 (33.12/col) Wed Jan 4 21:30:38 2017 sparse part has weight 70376272 (33.12/col) Wed Jan 4 21:30:39 2017 matrix is 1062298 x 2124888 (333.3 MB) with weight 70376272 (33.12/col) Wed Jan 4 21:30:39 2017 sparse part has weight 70376272 (33.12/col) Wed Jan 4 21:30:39 2017 using block size 8192 and superblock size 196608 for processor cache size 2048 kB Wed Jan 4 21:30:44 2017 commencing Lanczos iteration (4 threads) Wed Jan 4 21:30:44 2017 memory use: 244.7 MB [/code]Here is changing to 4 processes with t 2: [code] mpirun np 4 hostfile ./mpi_hosts221 ../msieve/msieve i number.ini s number.dat l number.log nf number.fb t 2 nc2 4,1 [/code]And the first log: [code] Wed Jan 4 21:42:40 2017 commencing linear algebra Wed Jan 4 21:42:40 2017 initialized process (0,0) of 4 x 1 grid Wed Jan 4 21:42:41 2017 read 2124888 cycles Wed Jan 4 21:42:45 2017 cycles contain 6333818 unique relations Wed Jan 4 21:43:56 2017 read 6333818 relations Wed Jan 4 21:44:04 2017 using 20 quadratic characters above 4294917295 Wed Jan 4 21:44:38 2017 building initial matrix Wed Jan 4 21:46:04 2017 memory use: 851.5 MB Wed Jan 4 21:46:06 2017 read 2124888 cycles Wed Jan 4 21:46:07 2017 matrix is 2124709 x 2124888 (638.7 MB) with weight 201591455 (94.87/col) Wed Jan 4 21:46:07 2017 sparse part has weight 144046076 (67.79/col) Wed Jan 4 21:46:22 2017 filtering completed in 1 passes Wed Jan 4 21:46:23 2017 matrix is 2124709 x 2124888 (638.7 MB) with weight 201591455 (94.87/col) Wed Jan 4 21:46:23 2017 sparse part has weight 144046076 (67.79/col) Wed Jan 4 21:46:43 2017 matrix starts at (0, 0) Wed Jan 4 21:46:44 2017 matrix is 531262 x 2124888 (236.0 MB) with weight 96032795 (45.19/col) Wed Jan 4 21:46:44 2017 sparse part has weight 38487416 (18.11/col) Wed Jan 4 21:46:44 2017 saving the first 48 matrix rows for later Wed Jan 4 21:46:44 2017 matrix includes 64 packed rows Wed Jan 4 21:46:47 2017 matrix is 531214 x 2124888 (216.2 MB) with weight 54967126 (25.87/col) Wed Jan 4 21:46:47 2017 sparse part has weight 35425252 (16.67/col) Wed Jan 4 21:46:47 2017 using block size 8192 and superblock size 196608 for processor cache size 2048 kB Wed Jan 4 21:46:50 2017 commencing Lanczos iteration (2 threads) Wed Jan 4 21:46:50 2017 memory use: 146.6 MB Wed Jan 4 21:47:06 2017 linear algebra at 0.1%, ETA 5h49m Wed Jan 4 21:47:11 2017 checkpointing every 370000 dimensions [/code]Both machines show two processes with <150% each, shown via top. The logs do show the appropriate threads, but the CPU use, just doesn't seem to. Thanks for the reply. I'll go back to my studies... 
Could you post the hosts files?
My suspicion is that mpirun has decided that it should bind processes to CPUs, and that you've somehow not told it that some of the hosts have more than one CPU ... what does 'taskset p {process ID}' tell you when a process is running with insufficient CPU usage? Aha, in a document at the Oxford supercomputer centre website, I found [code] Finally, versions higher than 1.8.0 in OpenMPI bind automatically processes to threads. Thus, export OMPI_MCA_hwloc_base_binding_policy=none [/code] so maybe see if doing that changes what you see happening? Supercomputer centres almost always use something like Slurm or Torque for job submission, so I'm having a little trouble tying down how to get onejobpermachine in the case without an extra layer. 
Thanks, fivemack! This does make a difference. After exporting the policy value on the host and first slave, the msieve threads have increased their CPU usage. The host machine is up to just under 200% and the slave machine is just over 200%. During this run, your taskset query returns:
[code] pid 8003's current affinity mask: f [/code]I'll clear the policy and see what I get with the machine in the earlier state... OK, taskset now returns: [code] pid 8850's current affinity mask: 1 [/code]and, top is back to showing <=100% for both msieve processes. In answer to your other request, my mpi_host files follow this theme: mpi_hosts111: [code] localhost slots=1 math59@192.168.0.58 slots=1 math59@192.168.0.60 slots=1 [/code]mpi_hosts221: [code] localhost slots=2 math59@192.168.0.58 slots=2 math59@192.168.0.60 slots=1 [/code]They appear to track directly with my grid values. The host and the first slave are quad core and the second slave is dual core. Also, the host and first slave are maxed out at 4G, while the second slave has only 3G of RAM. I am probably going to change the second slave for a quad core with more RAM, but the current second slave is the identical architecture as the other two. I thought that might be an advantage at this point. 
Well, I guess it's time to give up on this for a while again. Too much frustration!
No matter what combination I try, I can't get a savings in time over the bare initial machine running four cores. The only advantage the additional machines do give me is the capability to handle larger matrices since the additional machines do add their memory to the mix and all are maxed at 4GB. Since my current play area only entails working with composites that are less than 150 digits and only take two to four days to factor, I'll let this slide into the background for a bit. 
All times are UTC. The time now is 09:37. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2021, Jelsoft Enterprises Ltd.