mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2010-08-05, 21:22   #45
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2·34·13 Posts
Default

After you remove the AMD-related -march flags in the Makefile, compiling using icc works fine.

Let's take the example that you have 16 nodes, each with a quad-core processor. In this case, there are 64 total cores. In the simplest (put probably non-optimal) case, you would just run

mpirun -np 64 ./msieve -nc2 8,8 -v

which launches 64 MPI processes using an 8x8 grid. Use a grid size m x n where m*n equals the total number of processes, m <= n, and it is as close to square as possible.

For Infiniband-connected Core 2-based nodes with DDR2 memory, my tests indicate that the best speed is when I use two threads per process and run only two processes per node. On a quad-core node this is perfect, but on dual-quad nodes, this leaves 4 cores idle. Using GigE interconnect and/or nodes with DDR3 memory probably changes this. Anyway, these arrangements require telling MPI to launch fewer processes per node than the number of cores in the node. How to do this depends on the particular MPI implementation you are using. This is complicated further by the fact that larger clusters require batch job submission, and the batch software has its own method of specifying the number of cores per process.

Let me know the details of the cluster, if you're using batch submission or running interactively, and the exact MPI version you are using, and I can help you further.
frmky is online now   Reply With Quote
Old 2010-08-05, 22:06   #46
CRGreathouse
 
CRGreathouse's Avatar
 
Aug 2006

3·1,993 Posts
Default

Quote:
Originally Posted by frmky View Post
How fast you ask?
Code:
Sat Jun 26 22:24:21 2010  matrix is 9140582 x 9140759 (3918.8 MB) with weight 1121375182 (122.68/col)
Wed Jul 14 10:21:15 2010  initialized process (0,0) of 4 x 8 grid
Wed Jul 14 10:23:01 2010  linear algebra at 0.0%, ETA 39h58m
Granted this is using 8 nodes of an Infiniband-connected cluster, not your average PC. Eight nodes of a Gigabit Ethernet connected cluster takes more like 90 hours. But still...
I'm amazed that there's so much communication between nodes that there can be a 60% speedup just from increasing the link speed. Why so much? Any layman explanations?
CRGreathouse is offline   Reply With Quote
Old 2010-08-05, 22:32   #47
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

351310 Posts
Default

Quote:
Originally Posted by CRGreathouse View Post
I'm amazed that there's so much communication between nodes that there can be a 60% speedup just from increasing the link speed. Why so much? Any layman explanations?
Infiniband uses a switch fabric that is much more efficient than ethernet switching - much better latency in addition to the higher throughput from a faster serial data rate. This might not be the only reason, but I'm sure it's part of it.
bsquared is offline   Reply With Quote
Old 2010-08-06, 02:30   #48
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2·34·13 Posts
Default

Quote:
Originally Posted by CRGreathouse View Post
I'm amazed that there's so much communication between nodes that there can be a 60% speedup just from increasing the link speed. Why so much? Any layman explanations?
Right now, I'm running the matrix for 5,448+ using 64 nodes of an Infiniband cluster. The matrix is about 16.1M x 16.1M in size, and I'm using an 8x16 grid. The total runtime is about 37 hours. This matrix will take a bit over 254,000 iterations to solve, which means that each iteration is taking about 0.5 seconds.

The size-N vectors are 16.1M * 8 bytes = 128.8 MB in size. These are split across the row grid, so for this calculation each computer has a vector that's about 16.1 MB in size. In each iteration, these vectors much be updated across the entire grid twice, once across the rows and once across the columns. For a row update, each computer must send and receive this data log28 = 3 times, and likewise 4 times for a column update, so a total of 7 transfers of 16.1MB of data. That's 7*16.1MB = 112.7MB of data for each iteration. At 1 Gb/s, that will take 112.7MB * 8 b/B / (1000 Mb/s) = 0.9 seconds for the data transfer at full Gb speed. In practice we see only about 70% of full speed. So budget 1.3 seconds just for the transfer of the large vectors. Now, there are also two transfers of 512 bytes of data across the grid. With Infiniband this also runs at full speed, but GigE runs slower for small transfers thanks to the latency of the kernel interrupts so budget say 0.1 second or so for these. Add in 0.4 seconds or so to do the actual calculation, you're nearing 1.8 seconds/iteration, or over 3x the Infiniband iteration time, and you've spend over 75% of your time simply transferring data around.

With Infiniband, 1.4 seconds of data transfer becomes 0.1 seconds of data transfer, with 0.4 seconds of calculation, so you're only spending about 20% of your time in the data transfer.
frmky is online now   Reply With Quote
Old 2010-08-06, 10:03   #49
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

72·131 Posts
Default

Ah, and that explains the faster-than-square-root speedup, since the model that gives you sqrt(N) assumes that you're spending all your time on data transfer ...
fivemack is offline   Reply With Quote
Old 2010-08-06, 15:10   #50
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

3·17·23 Posts
Default

Quote:
Originally Posted by frmky View Post
After you remove the AMD-related -march flags in the Makefile, compiling using icc works fine.
Yes, that is how I got it to compile.

Quote:
Originally Posted by frmky View Post
Let's take the example that you have 16 nodes, each with a quad-core processor. In this case, there are 64 total cores. In the simplest (put probably non-optimal) case, you would just run

mpirun -np 64 ./msieve -nc2 8,8 -v

which launches 64 MPI processes using an 8x8 grid. Use a grid size m x n where m*n equals the total number of processes, m <= n, and it is as close to square as possible.
I have to run it through a batch submitted into a queue and they have their own custom scripts for that. For MPI jobs the best I can do is just specify how many processes I want, so if I said -np 64, it would figure out which nodes in the clusters had free processors on them and assign it using its own scheduling system. Most likely it would schedule 8 processes on an 8 core system if it was empty. If the entire cluster was free then it would use up 8 dual-quad-core nodes so all 64 processes would be running on 8 physical systems. Or it could be spread out a few processes on different systems depending on the cluster usage.

Either way, is it still best to use an 8x8 grid or a grid as square as possible even though I have no way of knowing before hand on what nodes it might be scheduled to run?

To re-start from a checkpoint I guess I can just use -ncr 8,8 or whatever MPI grid that I used before? Is it possible to restart with a different grid size in case say more or less nodes are available when I go to restart?

Thanks.
Jeff.

Last fiddled with by Jeff Gilchrist on 2010-08-06 at 15:13
Jeff Gilchrist is offline   Reply With Quote
Old 2010-08-06, 16:03   #51
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

Running with MPI has some caveats. First, only one process builds the matrix, so if you use '-nc2 8,8' then 63 other processes will be idle until the iteration actually starts. This sucks if you have a quota of compute time; Greg builds the matrix on one machine and then restarts the iteration from scratch with a hacked copy of msieve that skips the matrix build.

When the iteration starts, each process pulls in a portion of the complete matrix and then performs a row permutation that makes each submatrix look as similar as possible to all the others. That permutation must be identical if you restart from a checkpoint, so you can have a different number of columns on a restart but *not* a different number of rows.
jasonp is offline   Reply With Quote
Old 2010-08-06, 18:03   #52
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

83A16 Posts
Default

Many batch systems give you more flexibility than that. On the Teragrid systems, I can specify that I want a full node, and specify how many MPI processes I want on that node. This has been designed to accommodate hybrid MPI/OpenMP programs, but also works perfectly with normal threaded programs as well. If you don't have this flexibility, then the best you can do is to not use threads.

Stay with a square grid, it's the most efficient. Use -ncr to restart just as you said. But on restart, you have to keep the number of rows the same. If you start with -nc2 8,8 you have to restart with -ncr 8,x where x can be any number.
frmky is online now   Reply With Quote
Old 2010-08-06, 18:25   #53
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

100100101012 Posts
Default

I just tried running -nc2 with a 3x3 grid to use 9 processors to see if it works, it aborted with this error:

"commencing linear algebra
error: MPI size 1 incompatible with 3 x 3 grid"

Trying now with a 2 x 2 grid to see if that works. Do you need to use even numbers or is this a bug?
Jeff Gilchrist is offline   Reply With Quote
Old 2010-08-06, 18:28   #54
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×34×13 Posts
Default

Quote:
Originally Posted by Jeff Gilchrist View Post
error: MPI size 1 incompatible with 3 x 3 grid"
MPI thought you were only requesting 1 process, not 9. You may have to include -np 9 on your MPI command line.
frmky is online now   Reply With Quote
Old 2010-08-06, 19:45   #55
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

3×17×23 Posts
Default

Quote:
Originally Posted by frmky View Post
MPI thought you were only requesting 1 process, not 9. You may have to include -np 9 on your MPI command line.
Found my problem, the system automatically calls mpirun with -np 9 but I found out there are 3 different MPI products/systems in use on the clusters so I had to re-compile my binary to work on the specific one I was using. It is running now at least.
Jeff Gilchrist is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
block wiedemann and block lanczos ravlyuchenko Msieve 5 2011-05-09 13:16
Why is lanczos hard to distribute? Christenson Factoring 39 2011-04-08 09:44
Block Lanczos with a reordering pass jasonp Msieve 18 2010-02-07 08:33
Lanczos error Andi47 Msieve 7 2009-01-11 19:33
Msieve Lanczos scalability Jeff Gilchrist Msieve 1 2009-01-02 09:32

All times are UTC. The time now is 00:53.


Sat Jul 17 00:53:33 UTC 2021 up 49 days, 22:40, 1 user, load averages: 1.28, 1.44, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.