![]() |
|
|
#34 |
|
Jul 2003
So Cal
2×34×13 Posts |
|
|
|
|
|
|
#35 |
|
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
23·3·5·72 Posts |
What size matrix would frmky be able to solve in a couple of months? What size number would that apply to?
|
|
|
|
|
|
#36 |
|
Jul 2003
So Cal
83A16 Posts |
As another data point, an 18.8M matrix would take under two weeks. I estimate a 30-35M matrix would take about two months. M941, a 284-digit SNFS, resulted in a 24.1M matrix, so probably an SNFS in the high 280's would give a 30M or so matrix. The bad news is that this would consume about 100,000 hours of CPU time on the big iron, which I will only have if the grant proposal I'm writing is funded. The good news is that Jason may still have tricks up his sleeve for improving performance on the local cluster that I can use for free.
|
|
|
|
|
|
#37 |
|
Jun 2010
316 Posts |
Ok I read book about MPI and I just finished my first MPI program. The goal was to divide a search interval into slices and distribute work more evenly amongst processors. The next step for me is to implement this sliced approach to the polynomial selection.
Please take a look a let me know if I'm missing anything here http://mancoast.chickenkiller.com/primempi.tgz |
|
|
|
|
|
#38 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
If you run a complex program where the only difference between nodes is stuff that is passed in on the command line, and you want to statically allocate the search space over the nodes you have, then just put the bounds in an array and use MPI_Bcast to send them to the other nodes.
There isn't a lot of difference between that and just executing a compiled binary using RPC, though; i.e. 'ssh user@node my_binary x,y'. If you want load balancing too, then just install one of the many free batch schedulers and let your big pile of jobs queue up waiting for a free CPU. |
|
|
|
|
|
#39 |
|
Tribal Bullet
Oct 2004
DD516 Posts |
I've now modified the msieve-mpi branch to use a 2-D grid of MPI processes when running the linear algebra. This should hopefully allow the speedup on a cluster with many nodes to continue increasing as more machines and more bus wires are added to a Lanczos run. The previous code was rather limited in the total speedup possible with increasing the number of machines, and I suspect the new code will only be faster for very large problems, perhaps 10M and up.
Running with -nc2 as before will use a 1xN grid like before, given N processes by mpirun. For a 2-D grid of MxN MPI processes, run with '-nc2 M,N' or '-ncr M,N'. The code now performs a row permutation on the matrix as it is read from disk, to better balance the load across many machines; a side effect of that is that one can only restart from a checkpoint if the new M matches the old one. |
|
|
|
|
|
#40 |
|
Tribal Bullet
Oct 2004
DD516 Posts |
Now changed to be *much* faster (30-40% with many nodes)
|
|
|
|
|
|
#41 |
|
Jul 2003
So Cal
40728 Posts |
How fast you ask?
Code:
Sat Jun 26 22:24:21 2010 matrix is 9140582 x 9140759 (3918.8 MB) with weight 1121375182 (122.68/col) Wed Jul 14 10:21:15 2010 initialized process (0,0) of 4 x 8 grid Wed Jul 14 10:23:01 2010 linear algebra at 0.0%, ETA 39h58m
|
|
|
|
|
|
#42 |
|
Tribal Bullet
Oct 2004
DD516 Posts |
The scalability with the latest code on infiniband connected nodes is also much better than previous computational experience would suggest. For N nodes we're seeing a speedup of O(N^0.75), instead of the predicted O(N^0.5)!
|
|
|
|
|
|
#43 |
|
Jul 2003
So Cal
83A16 Posts |
Actually, compiling all of the data, it empirically appears to be a bit better than that. On the newer Abe cluster, we're seeing close to N^0.86, where N is the number of computational nodes used, out to 48 nodes, and perhaps a bit better than that out to 32 nodes. On the older IB connected Lonestar cluster, it's still N^0.81 to 32 nodes. Our local GigE cluster scales as perhaps N^0.6, but there aren't enough nodes to pin the exponent down well.
Last fiddled with by frmky on 2010-07-18 at 03:42 |
|
|
|
|
|
#44 |
|
Jun 2003
Ottawa, Canada
3×17×23 Posts |
I finally have built msieve with MPI=1 after fighting with some issues. Our mpicc uses Intel's ICC compiler so not even sure if this will work or not, or the Opteron systems are using pathcc so less difficult to get to compile.
Are there any special command line options or flags with the MPI version? There isn't anything in the readme or -h that I could see. Do I just use the command line as I normally would and tell my mpi launcher to use say 16 nodes and then msieve will automatically figure out how many ranks there are or do I need to use -t 16 as well to tell it there will be 16 "threads"? Greg, can you post the command line you use for post-processing so I can see an example? Thanks. |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| block wiedemann and block lanczos | ravlyuchenko | Msieve | 5 | 2011-05-09 13:16 |
| Why is lanczos hard to distribute? | Christenson | Factoring | 39 | 2011-04-08 09:44 |
| Block Lanczos with a reordering pass | jasonp | Msieve | 18 | 2010-02-07 08:33 |
| Lanczos error | Andi47 | Msieve | 7 | 2009-01-11 19:33 |
| Msieve Lanczos scalability | Jeff Gilchrist | Msieve | 1 | 2009-01-02 09:32 |