mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2010-06-23, 01:28   #23
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

354110 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
I was not aware that the msieve code was 4x faster.
Do you know the cause of the difference?
I've never seen the CWI code, and AFAIK only Paul has run the two packages side-by-side on the same initial dataset (I've lost the email with his results).

If the CWI code is like the GGNFS code, then about half the speedup comes from rearranging the matrix entries into a block structure that allows somewhat better cache efficiency. The other half of the speedup boils down to using a little assembly code that uses a single MMX register. The reduction in the number of memory operations lets the processor buffer many more of them and leads to a big improvement.
jasonp is offline   Reply With Quote
Old 2010-06-24, 01:06   #24
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

Just fixed the multithreading problems, if anyone has a spare cluster to test with :)
jasonp is offline   Reply With Quote
Old 2010-06-28, 19:01   #25
pancoast.3
 
Jun 2010

3 Posts
Default alternate approach

I am new to this subject but i have solid knowledge off c++, linux, and cryptography. I have been doing some reading on cluster computer and I noticed that MPI seems to be a favorable clustering solution here.

What do you think about the information on this page:
http://trac.nchc.org.tw/grid/wiki/krg_DRBL

It uses PXE to remote boot and combine the machines together into a virtual SMP machine. I am curious if this information is useful enough to try to piece together another possible approach to this problem.


I have about 20 machines networked together and I am trying to find a good starting point for my cluster
[Merged in moderation]

rob pancoast
ECENG-BS
pancoast.3 is offline   Reply With Quote
Old 2010-06-29, 00:48   #26
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

I have neither the patience nor the hardware or power budget to actually build a cluster, SSI or otherwise. You can see how such a system would perform by just running a single msieve instance with a huge number of threads, far more than would comfortably fit on one of the nodes. But I'm not confident it would do better than using MPI, since the MPI version is designed to segregate the working set appropriately.
jasonp is offline   Reply With Quote
Old 2010-06-29, 01:19   #27
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×34×13 Posts
Default

Quote:
Originally Posted by pancoast.3 View Post
It uses PXE to remote boot and combine the machines together into a virtual SMP machine. I am curious if this information is useful enough to try to piece together another possible approach to this problem.
This may work well for small clusters. Larger clusters all use MPI for communication between nodes, threads or OpenMP for multithreaded applications on a single node, and a job queue. I set up a similar format (with no virtual SMP) for our small cluster here using CAOS Linux. Our compute nodes are diskless and use PXE for booting, and CAOS makes that easy to set up.
frmky is online now   Reply With Quote
Old 2010-06-29, 01:24   #28
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2·34·13 Posts
Default

Here's a run using 8 nodes, 8 cores on each node (64 total cores), and Infiniband interconnect on a 9.1M matrix:

Code:
Mon Jun 28 19:50:24 2010  Msieve v. 1.46
Mon Jun 28 19:50:24 2010  random seeds: a4a5c0c7 82022f64
Mon Jun 28 19:50:24 2010  MPI process 0 of 8
Mon Jun 28 19:50:24 2010  factoring 294802397078927227288585541386814044901913163659871493128870987573325727083092714349137568636824376635906641811553077082126879354347764678937922257093212975292094986773568248621375431629 (186 digits)
Mon Jun 28 19:50:26 2010  no P-1/P+1/ECM available, skipping
Mon Jun 28 19:50:26 2010  commencing number field sieve (186-digit input)
Mon Jun 28 19:50:26 2010  R0: -4978518112499354698647829163838661251242411
Mon Jun 28 19:50:26 2010  R1:  1
Mon Jun 28 19:50:26 2010  A0:  1
Mon Jun 28 19:50:26 2010  A1:  1
Mon Jun 28 19:50:26 2010  A2:  1
Mon Jun 28 19:50:26 2010  A3:  1
Mon Jun 28 19:50:26 2010  A4:  1
Mon Jun 28 19:50:26 2010  A5:  1
Mon Jun 28 19:50:26 2010  A6:  1
Mon Jun 28 19:50:26 2010  skew 1.00, size 1.447e-12, alpha 2.428, combined = 1.525e-13 rroots = 0
Mon Jun 28 19:50:26 2010  
Mon Jun 28 19:50:26 2010  commencing linear algebra
Mon Jun 28 19:50:47 2010  matrix is 9140582 x 1045213 (485.8 MB) with weight 137297278 (131.36/col)
Mon Jun 28 19:50:47 2010  sparse part has weight 115841864 (110.83/col)
Mon Jun 28 19:50:47 2010  saving the first 48 matrix rows for later
Mon Jun 28 19:50:48 2010  matrix is 9140534 x 1045213 (466.0 MB) with weight 119000049 (113.85/col)
Mon Jun 28 19:50:48 2010  sparse part has weight 111698868 (106.87/col)
Mon Jun 28 19:50:48 2010  matrix includes 64 packed rows
Mon Jun 28 19:50:52 2010  using block size 65536 for processor cache size 4096 kB
Mon Jun 28 19:51:20 2010  commencing Lanczos iteration (8 threads)
Mon Jun 28 19:51:20 2010  memory use: 939.9 MB
Mon Jun 28 19:51:21 2010  restarting at iteration 633 (dim = 40040)
Mon Jun 28 19:51:44 2010  linear algebra at 0.4%, ETA 76h29m
frmky is online now   Reply With Quote
Old 2010-07-01, 02:12   #29
pancoast.3
 
Jun 2010

38 Posts
Default decisions

ok so I have come to the conclusion that MPI is more powerful, but the only problem I face is that I have a few quad core P4's, some dual core P4's and Athlons, and some old P4's with hyperthreading. It appears that MPICH2 allows for more control, but requires that the cluster is composed of homogenous platforms. I think I am going to have to try make the best of things by using MPICH1. What do you think?
pancoast.3 is offline   Reply With Quote
Old 2010-07-01, 08:25   #30
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

210610 Posts
Default

OpenMPI works in heterogeneous environments and supports MPI 2.1.

Last fiddled with by frmky on 2010-07-01 at 08:26
frmky is online now   Reply With Quote
Old 2010-07-02, 18:46   #31
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×34×13 Posts
Default

Quote:
Originally Posted by frmky View Post
Here's a run using 8 nodes, 8 cores on each node (64 total cores), and Infiniband interconnect on a 9.1M matrix:
An update... After further benchmarking, I discovered that using only 6 nodes and only 4 cores on each node actually gave the same runtime. So the LA was run on 6 nodes of the Infiniband-connected Abe cluster at NCSA, U. of Illinois, and the completion of a 9.1 million nearly-square matrix with weight of 1.1 billion took only 70 hours!
frmky is online now   Reply With Quote
Old 2010-07-02, 20:50   #32
Andi47
 
Andi47's Avatar
 
Oct 2004
Austria

9B216 Posts
Default

Quote:
Originally Posted by frmky View Post
Here's a run using 8 nodes, 8 cores on each node (64 total cores), and Infiniband interconnect on a 9.1M matrix:

Code:
Mon Jun 28 19:50:24 2010  Msieve v. 1.46
Mon Jun 28 19:50:24 2010  random seeds: a4a5c0c7 82022f64
Mon Jun 28 19:50:24 2010  MPI process 0 of 8
Mon Jun 28 19:50:24 2010  factoring 294802397078927227288585541386814044901913163659871493128870987573325727083092714349137568636824376635906641811553077082126879354347764678937922257093212975292094986773568248621375431629 (186 digits)
Mon Jun 28 19:50:26 2010  no P-1/P+1/ECM available, skipping
Mon Jun 28 19:50:26 2010  commencing number field sieve (186-digit input)
Mon Jun 28 19:50:26 2010  R0: -4978518112499354698647829163838661251242411
Mon Jun 28 19:50:26 2010  R1:  1
Mon Jun 28 19:50:26 2010  A0:  1
Mon Jun 28 19:50:26 2010  A1:  1
Mon Jun 28 19:50:26 2010  A2:  1
Mon Jun 28 19:50:26 2010  A3:  1
Mon Jun 28 19:50:26 2010  A4:  1
Mon Jun 28 19:50:26 2010  A5:  1
Mon Jun 28 19:50:26 2010  A6:  1
Mon Jun 28 19:50:26 2010  skew 1.00, size 1.447e-12, alpha 2.428, combined = 1.525e-13 rroots = 0
Such a matrix in 70 hours? WOW! And cudos to JasonP!

BTW: Did I get this right - this was a SNFS-257 from 11^287-1?
Andi47 is offline   Reply With Quote
Old 2010-07-02, 21:00   #33
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

Thanks. Note that the cluster nodes here are fairly serious big iron; on Greg's local cluster using gigabit ethernet, the current code spends half its time broadcasting vectors over its network. I'm working on that too.
jasonp is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
block wiedemann and block lanczos ravlyuchenko Msieve 5 2011-05-09 13:16
Why is lanczos hard to distribute? Christenson Factoring 39 2011-04-08 09:44
Block Lanczos with a reordering pass jasonp Msieve 18 2010-02-07 08:33
Lanczos error Andi47 Msieve 7 2009-01-11 19:33
Msieve Lanczos scalability Jeff Gilchrist Msieve 1 2009-01-02 09:32

All times are UTC. The time now is 00:53.


Sat Jul 17 00:53:17 UTC 2021 up 49 days, 22:40, 1 user, load averages: 1.36, 1.46, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.