mersenneforum.org big job planning
 Register FAQ Search Today's Posts Mark Forums Read

2010-08-06, 08:45   #1
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT/BST)

10111000101112 Posts
big job planning

Quote:
 Originally Posted by frmky NFS@Home has completed 5,448+ by SNFS. A 16.1M matrix was solved using 64 computers (256 cores) in a bit under 41 hours. The log is attached. Code: prp65 factor: 23371863775658144623538828456854573496104607906605333794952273409 prp141 factor: 698965867837568984299398033395117263633121275562035049049008890847786442548362928940407820059804253774285034574979465129314235059775249213313
Have you any clue how long that would have taken on four cores of one pc? What is the efficiency of running on 64 computers?

 2010-08-06, 09:24 #2 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 144418 Posts I think that would have taken something like 600 hours on my four-core i7 machine, interpolating from matrices of that sort of size that I have run, so about 2500 core-hours, versus the 10500 core-hours that it took on the cluster.
 2010-08-06, 10:16 #3 Batalov     "Serge" Mar 2008 Phi(4,2^7658614+1)/2 32×1,061 Posts Sounds about right to me. Exactly same sized matrix (16.1M) is running on a 1055T Phenom right now for estimated 520hrs (6 threads @3640MHz).
 2010-08-06, 17:08 #4 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT/BST) 10111000101112 Posts So MPI to that scale is only 25% efficient. Is that the Infiniband or the Gigabit cluster? I am guessing it must be the Infiniband. Would this method be usable to help extend the forum record gnfs? @fivemack Would your machines support MPI? Would it be useful to run a forum sized factorization LA over all your pcs or would it tie them all up for too long at once with your number of machines?
 2010-08-06, 17:24 #5 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 7·919 Posts My machines are connected via gigabit using a slow switch; that, and the fact that they are all of different spec (Core2quad, Phenomquad, i7quad, dual-Shanghaiquad), makes me rather unkeen on using MPI. If I want to run very large jobs, I would get a quad-MagnyCours machine; it's expensive, but it costs rather less than eight quad-cores connected with infiniband would, and I would expect it to work rather faster, and use significantly less electricity so not require awkward measures for cooling. I've rung Tyan this morning and they say the S8812 motherboards will start being reasonably available in September. Probably I would have to run MPI internally rather than a single job with -t 32.
2010-08-06, 17:35   #6
frmky

Jul 2003
So Cal

32×241 Posts

Quote:
 Originally Posted by henryzz So MPI to that scale is only 25% efficient. Is that the Infiniband or the Gigabit cluster? I am guessing it must be the Infiniband. Would this method be usable to help extend the forum record gnfs? @fivemack Would your machines support MPI? Would it be useful to run a forum sized factorization LA over all your pcs or would it tie them all up for too long at once with your number of machines?
These are Core 2 based computers, so a more appropriate 4-core estimate would be about 850 hours, or 3400 CPU-hours. So the efficiency is closer to 1/3. I previously found scaling on this cluster to 16 nodes to be about N^0.81, but at 64 nodes it's a bit worse than that. 64^(0.81-1) = 45%.

This is going to be used for record factorizations at NFS@Home. Once 12,254+ finishes in the next couple of weeks we will start 5,409-, a SNFS286. As far as I am aware, this will be a record factorization using open source software. Presuming that's successful, we will move up to an SNFS290 for the next one.

2010-08-06, 17:47   #7
frmky

Jul 2003
So Cal

32×241 Posts

Quote:
 Originally Posted by fivemack If I want to run very large jobs, I would get a quad-MagnyCours machine; it's expensive, but it costs rather less than eight quad-cores connected with infiniband would, and I would expect it to work rather faster
I'm not so sure. msieve bottlenecks on accesses to main memory during the matrix multiplies. On dual-quad core2 computers with DDR2 memory, I get better speeds if I leave four of the cores idle to relieve the contention. Our K10 Barcelona computer, even though it's NUMA, does the same thing. The speed tops out at 16 MPI processes, even though it has 32 cores, and is still considerably slower than an 8 quadcore GigE-connected cluster. DDR3 will help, but not much. Eight separate computers have 8x the main memory bandwidth as one.

 2010-08-06, 18:11 #8 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 7×919 Posts Eight separate Phenom machines have exactly the same main memory bandwidth as one quad-MagnyCours - each of the eight quad-core dice in the machine has its own pair of memory controllers connected to its own bank of memory. It's admittedly a bit awkward to have to upgrade memory in sixteens.
2010-08-06, 18:22   #9
frmky

Jul 2003
So Cal

32·241 Posts

Quote:
 Originally Posted by fivemack each of the eight quad-core dice in the machine has its own pair of memory controllers connected to its own bank of memory.
The same is true in the Barcelona, but unless they've significantly improved it, it doesn't work as well as you'd hope. Because it's still a shared memory architecture, there's a lot of chatter.

2010-08-06, 18:24   #10
R.D. Silverman

Nov 2003

22·5·373 Posts

Quote:
 Originally Posted by frmky These are Core 2 based computers, so a more appropriate 4-core estimate would be about 850 hours, or 3400 CPU-hours. So the efficiency is closer to 1/3. I previously found scaling on this cluster to 16 nodes to be about N^0.81, but at 64 nodes it's a bit worse than that. 64^(0.81-1) = 45%. This is going to be used for record factorizations at NFS@Home. Once 12,254+ finishes in the next couple of weeks we will start 5,409-, a SNFS286. As far as I am aware, this will be a record factorization using open source software. Presuming that's successful, we will move up to an SNFS290 for the next one.
2,964+, 2,961+, or 2, 961- would be nice candidates.....

However, I think that claiming a record simply based on being "open source"
is a little presumptuous. (No offense intended)

How about going after M1061? This would be a real record.

 2010-08-06, 18:48 #11 frmky     Jul 2003 So Cal 32×241 Posts It's not presumptuous, it's simply a fact. While excellent lattice sieving code has been released, and for which we are very appreciative, the postprocessing code has not. Jason has spent an enormous amount of time creating and releasing wonderful postprocessing code, and I believe his achievement should be highlighted. In addition, there is a huge difference, and therefore a new "benchmark," when an average member of the public can go out, spend \$50K on computers, download the software from the internet, and do these factorizations on his own in reasonable time. I do plan to work up to a kilobit, but in small steps. Start with 286, then 290, then 295...

 Similar Threads Thread Thread Starter Forum Replies Last Post patrik GPU to 72 8 2013-03-13 16:03 joblack Hardware 37 2011-06-09 08:21 wblipp ElevenSmooth 2 2004-02-19 05:25

All times are UTC. The time now is 16:02.

Tue Sep 28 16:02:30 UTC 2021 up 67 days, 10:31, 2 users, load averages: 1.22, 1.35, 1.42