![]() |
|
|
#23 | |
|
Tribal Bullet
Oct 2004
354110 Posts |
Quote:
If the CWI code is like the GGNFS code, then about half the speedup comes from rearranging the matrix entries into a block structure that allows somewhat better cache efficiency. The other half of the speedup boils down to using a little assembly code that uses a single MMX register. The reduction in the number of memory operations lets the processor buffer many more of them and leads to a big improvement. |
|
|
|
|
|
|
#24 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
Just fixed the multithreading problems, if anyone has a spare cluster to test with :)
|
|
|
|
|
|
#25 |
|
Jun 2010
3 Posts |
I am new to this subject but i have solid knowledge off c++, linux, and cryptography. I have been doing some reading on cluster computer and I noticed that MPI seems to be a favorable clustering solution here.
What do you think about the information on this page: http://trac.nchc.org.tw/grid/wiki/krg_DRBL It uses PXE to remote boot and combine the machines together into a virtual SMP machine. I am curious if this information is useful enough to try to piece together another possible approach to this problem. I have about 20 machines networked together and I am trying to find a good starting point for my cluster [Merged in moderation] rob pancoast ECENG-BS |
|
|
|
|
|
#26 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
I have neither the patience nor the hardware or power budget to actually build a cluster, SSI or otherwise. You can see how such a system would perform by just running a single msieve instance with a huge number of threads, far more than would comfortably fit on one of the nodes. But I'm not confident it would do better than using MPI, since the MPI version is designed to segregate the working set appropriately.
|
|
|
|
|
|
#27 | |
|
Jul 2003
So Cal
2×34×13 Posts |
Quote:
|
|
|
|
|
|
|
#28 |
|
Jul 2003
So Cal
2·34·13 Posts |
Here's a run using 8 nodes, 8 cores on each node (64 total cores), and Infiniband interconnect on a 9.1M matrix:
Code:
Mon Jun 28 19:50:24 2010 Msieve v. 1.46 Mon Jun 28 19:50:24 2010 random seeds: a4a5c0c7 82022f64 Mon Jun 28 19:50:24 2010 MPI process 0 of 8 Mon Jun 28 19:50:24 2010 factoring 294802397078927227288585541386814044901913163659871493128870987573325727083092714349137568636824376635906641811553077082126879354347764678937922257093212975292094986773568248621375431629 (186 digits) Mon Jun 28 19:50:26 2010 no P-1/P+1/ECM available, skipping Mon Jun 28 19:50:26 2010 commencing number field sieve (186-digit input) Mon Jun 28 19:50:26 2010 R0: -4978518112499354698647829163838661251242411 Mon Jun 28 19:50:26 2010 R1: 1 Mon Jun 28 19:50:26 2010 A0: 1 Mon Jun 28 19:50:26 2010 A1: 1 Mon Jun 28 19:50:26 2010 A2: 1 Mon Jun 28 19:50:26 2010 A3: 1 Mon Jun 28 19:50:26 2010 A4: 1 Mon Jun 28 19:50:26 2010 A5: 1 Mon Jun 28 19:50:26 2010 A6: 1 Mon Jun 28 19:50:26 2010 skew 1.00, size 1.447e-12, alpha 2.428, combined = 1.525e-13 rroots = 0 Mon Jun 28 19:50:26 2010 Mon Jun 28 19:50:26 2010 commencing linear algebra Mon Jun 28 19:50:47 2010 matrix is 9140582 x 1045213 (485.8 MB) with weight 137297278 (131.36/col) Mon Jun 28 19:50:47 2010 sparse part has weight 115841864 (110.83/col) Mon Jun 28 19:50:47 2010 saving the first 48 matrix rows for later Mon Jun 28 19:50:48 2010 matrix is 9140534 x 1045213 (466.0 MB) with weight 119000049 (113.85/col) Mon Jun 28 19:50:48 2010 sparse part has weight 111698868 (106.87/col) Mon Jun 28 19:50:48 2010 matrix includes 64 packed rows Mon Jun 28 19:50:52 2010 using block size 65536 for processor cache size 4096 kB Mon Jun 28 19:51:20 2010 commencing Lanczos iteration (8 threads) Mon Jun 28 19:51:20 2010 memory use: 939.9 MB Mon Jun 28 19:51:21 2010 restarting at iteration 633 (dim = 40040) Mon Jun 28 19:51:44 2010 linear algebra at 0.4%, ETA 76h29m |
|
|
|
|
|
#29 |
|
Jun 2010
38 Posts |
ok so I have come to the conclusion that MPI is more powerful, but the only problem I face is that I have a few quad core P4's, some dual core P4's and Athlons, and some old P4's with hyperthreading. It appears that MPICH2 allows for more control, but requires that the cluster is composed of homogenous platforms. I think I am going to have to try make the best of things by using MPICH1. What do you think?
|
|
|
|
|
|
#30 |
|
Jul 2003
So Cal
210610 Posts |
OpenMPI works in heterogeneous environments and supports MPI 2.1.
Last fiddled with by frmky on 2010-07-01 at 08:26 |
|
|
|
|
|
#31 |
|
Jul 2003
So Cal
2×34×13 Posts |
An update... After further benchmarking, I discovered that using only 6 nodes and only 4 cores on each node actually gave the same runtime. So the LA was run on 6 nodes of the Infiniband-connected Abe cluster at NCSA, U. of Illinois, and the completion of a 9.1 million nearly-square matrix with weight of 1.1 billion took only 70 hours!
|
|
|
|
|
|
#32 | |
|
Oct 2004
Austria
9B216 Posts |
Quote:
And cudos to JasonP!BTW: Did I get this right - this was a SNFS-257 from 11^287-1? |
|
|
|
|
|
|
#33 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
Thanks. Note that the cluster nodes here are fairly serious big iron; on Greg's local cluster using gigabit ethernet, the current code spends half its time broadcasting vectors over its network. I'm working on that too.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| block wiedemann and block lanczos | ravlyuchenko | Msieve | 5 | 2011-05-09 13:16 |
| Why is lanczos hard to distribute? | Christenson | Factoring | 39 | 2011-04-08 09:44 |
| Block Lanczos with a reordering pass | jasonp | Msieve | 18 | 2010-02-07 08:33 |
| Lanczos error | Andi47 | Msieve | 7 | 2009-01-11 19:33 |
| Msieve Lanczos scalability | Jeff Gilchrist | Msieve | 1 | 2009-01-02 09:32 |