![]() |
|
|
#45 |
|
Jul 2003
So Cal
2·34·13 Posts |
After you remove the AMD-related -march flags in the Makefile, compiling using icc works fine.
Let's take the example that you have 16 nodes, each with a quad-core processor. In this case, there are 64 total cores. In the simplest (put probably non-optimal) case, you would just run mpirun -np 64 ./msieve -nc2 8,8 -v which launches 64 MPI processes using an 8x8 grid. Use a grid size m x n where m*n equals the total number of processes, m <= n, and it is as close to square as possible. For Infiniband-connected Core 2-based nodes with DDR2 memory, my tests indicate that the best speed is when I use two threads per process and run only two processes per node. On a quad-core node this is perfect, but on dual-quad nodes, this leaves 4 cores idle. Using GigE interconnect and/or nodes with DDR3 memory probably changes this. Anyway, these arrangements require telling MPI to launch fewer processes per node than the number of cores in the node. How to do this depends on the particular MPI implementation you are using. This is complicated further by the fact that larger clusters require batch job submission, and the batch software has its own method of specifying the number of cores per process. Let me know the details of the cluster, if you're using batch submission or running interactively, and the exact MPI version you are using, and I can help you further. |
|
|
|
|
|
#46 | |
|
Aug 2006
3·1,993 Posts |
Quote:
|
|
|
|
|
|
|
#47 |
|
"Ben"
Feb 2007
351310 Posts |
Infiniband uses a switch fabric that is much more efficient than ethernet switching - much better latency in addition to the higher throughput from a faster serial data rate. This might not be the only reason, but I'm sure it's part of it.
|
|
|
|
|
|
#48 | |
|
Jul 2003
So Cal
2·34·13 Posts |
Quote:
The size-N vectors are 16.1M * 8 bytes = 128.8 MB in size. These are split across the row grid, so for this calculation each computer has a vector that's about 16.1 MB in size. In each iteration, these vectors much be updated across the entire grid twice, once across the rows and once across the columns. For a row update, each computer must send and receive this data log28 = 3 times, and likewise 4 times for a column update, so a total of 7 transfers of 16.1MB of data. That's 7*16.1MB = 112.7MB of data for each iteration. At 1 Gb/s, that will take 112.7MB * 8 b/B / (1000 Mb/s) = 0.9 seconds for the data transfer at full Gb speed. In practice we see only about 70% of full speed. So budget 1.3 seconds just for the transfer of the large vectors. Now, there are also two transfers of 512 bytes of data across the grid. With Infiniband this also runs at full speed, but GigE runs slower for small transfers thanks to the latency of the kernel interrupts so budget say 0.1 second or so for these. Add in 0.4 seconds or so to do the actual calculation, you're nearing 1.8 seconds/iteration, or over 3x the Infiniband iteration time, and you've spend over 75% of your time simply transferring data around. With Infiniband, 1.4 seconds of data transfer becomes 0.1 seconds of data transfer, with 0.4 seconds of calculation, so you're only spending about 20% of your time in the data transfer. |
|
|
|
|
|
|
#49 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
72·131 Posts |
Ah, and that explains the faster-than-square-root speedup, since the model that gives you sqrt(N) assumes that you're spending all your time on data transfer ...
|
|
|
|
|
|
#50 | ||
|
Jun 2003
Ottawa, Canada
3·17·23 Posts |
Quote:
Quote:
Either way, is it still best to use an 8x8 grid or a grid as square as possible even though I have no way of knowing before hand on what nodes it might be scheduled to run? To re-start from a checkpoint I guess I can just use -ncr 8,8 or whatever MPI grid that I used before? Is it possible to restart with a different grid size in case say more or less nodes are available when I go to restart? Thanks. Jeff. Last fiddled with by Jeff Gilchrist on 2010-08-06 at 15:13 |
||
|
|
|
|
|
#51 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
Running with MPI has some caveats. First, only one process builds the matrix, so if you use '-nc2 8,8' then 63 other processes will be idle until the iteration actually starts. This sucks if you have a quota of compute time; Greg builds the matrix on one machine and then restarts the iteration from scratch with a hacked copy of msieve that skips the matrix build.
When the iteration starts, each process pulls in a portion of the complete matrix and then performs a row permutation that makes each submatrix look as similar as possible to all the others. That permutation must be identical if you restart from a checkpoint, so you can have a different number of columns on a restart but *not* a different number of rows. |
|
|
|
|
|
#52 |
|
Jul 2003
So Cal
83A16 Posts |
Many batch systems give you more flexibility than that. On the Teragrid systems, I can specify that I want a full node, and specify how many MPI processes I want on that node. This has been designed to accommodate hybrid MPI/OpenMP programs, but also works perfectly with normal threaded programs as well. If you don't have this flexibility, then the best you can do is to not use threads.
Stay with a square grid, it's the most efficient. Use -ncr to restart just as you said. But on restart, you have to keep the number of rows the same. If you start with -nc2 8,8 you have to restart with -ncr 8,x where x can be any number. |
|
|
|
|
|
#53 |
|
Jun 2003
Ottawa, Canada
100100101012 Posts |
I just tried running -nc2 with a 3x3 grid to use 9 processors to see if it works, it aborted with this error:
"commencing linear algebra error: MPI size 1 incompatible with 3 x 3 grid" Trying now with a 2 x 2 grid to see if that works. Do you need to use even numbers or is this a bug? |
|
|
|
|
|
#54 |
|
Jul 2003
So Cal
2×34×13 Posts |
|
|
|
|
|
|
#55 |
|
Jun 2003
Ottawa, Canada
3×17×23 Posts |
Found my problem, the system automatically calls mpirun with -np 9 but I found out there are 3 different MPI products/systems in use on the clusters so I had to re-compile my binary to work on the specific one I was using. It is running now at least.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| block wiedemann and block lanczos | ravlyuchenko | Msieve | 5 | 2011-05-09 13:16 |
| Why is lanczos hard to distribute? | Christenson | Factoring | 39 | 2011-04-08 09:44 |
| Block Lanczos with a reordering pass | jasonp | Msieve | 18 | 2010-02-07 08:33 |
| Lanczos error | Andi47 | Msieve | 7 | 2009-01-11 19:33 |
| Msieve Lanczos scalability | Jeff Gilchrist | Msieve | 1 | 2009-01-02 09:32 |