![]() |
|
|
#1 |
|
"William"
May 2003
New Haven
2×7×132 Posts |
Someone is considering running a large NFS processing job on a university supercomputer that has nodes of quad-core processors running CentOS, a Red Hat Linux derivative. He has been advised that msieve cannot distribute itself across multiple processors, so he cannot use more than one node and should not set the "-t" flag to more than 4.
Is that right? When I use programs on Windows SMP machines that have multiple physical processors with multiple cores per processor, the operating system appears to spread processing among physical processors without "the program distributing itself." I don't know much about CentOS, but I naively expect that "using two or three nodes" would be configured to look like SMP for the user, and "-t 8" or "-t 12" should work. Am I just too naive? William |
|
|
|
|
|
#2 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
Others here have a lot more experience with highly parallel LA runs than I do, but the last time anyone tried large numbers of threads on a highly multi-core machine it didn't work out too well. I suspect that the scaling of the LA depends first on the bisection bandwidth of the machine currently, and then on the number of bus wires that can be brought to bear.
With sparse matrix-vector multiplies the ratio of computation to memory access is somewhat low, so the scaling to lots of machines is going to be problematic. |
|
|
|
|
|
#3 |
|
Banned
"Luigi"
Aug 2002
Team Italia
32×5×107 Posts |
On the other hand,
1 - if the cluster is configured like a Beowulf one, and 2 - appears to the application program as a bunch of different CPUs, or 3 - the application program is MP aware], the sieving phase should take advantage of the cluster configuration. Luigi |
|
|
|
|
|
#4 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
There are parallel processing libraries that attempt to make multiple computers connected by a 'slow' interconnect appear like a single giant computer; I'm thinking primarily of PVM. But PVM does best on serious big iron, where the abstraction of a single machine isn't so far from the reality of memory latencies to remote nodes.
MPI is better for this type of problem; it forces the programmer to account for where the data begins and ends, and also on how it gets there. In the case of NFS linear algebra there's no benefit to be gained from dynamically changing where data resides, it's a regular series of operations whose data placement can be statically determined. Someday I'll be able to spend the time to make it happen. |
|
|
|
|
|
#5 |
|
Jul 2003
So Cal
2×34×13 Posts |
I've got a hacked up MPI version of gnfs-lasieve that I've occasionally used. It's really the minimum set of changes necessary, and it's quite non-optimal. It simply divides the range of q up evenly among the cpus and sets them to work. This means that some cpus finish earlier than other and sit idle waiting for the others to finish, even on homogeneous clusters. Somewhat better would be to dispatch an equal number of special-q's to each CPU, but this too only works on homogeneous clusters. Best would to hand out small ranges for each computer to process, giving new ones as each finishes. I'll hunt down and post this file later today.
If you can rsh or ssh into the compute nodes from the head node without a password, then it's perhaps easier to modify the for loop in the script to use ssh to launch the standard gnfs-lasieve binary on the remote host. Just modify it to run ssh remotehost <standard command> where remotehost changes appropriately for each loop. You'll need to modify the output file for each, then recombine them in the next step. This definitely only works for homogeneous clusters, and too suffers from the fact that some finish earlier than others, but is what I most often use. |
|
|
|
|
|
#6 | |
|
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
10,753 Posts |
Quote:
Is anyone interested in it? The source really is almost trivial. Anyone who wants to improve it is welcome to try. Parallelizing the sieving is very very easy, as is finding polynomials and extracting square roots. It's the linear algebra stages that are distinctly non-trivial to parallelize efficiently. Paul |
|
|
|
|
|
|
#7 |
|
Jul 2003
So Cal
2·34·13 Posts |
Attached is the diff for adding MPI to the lattice sieve. This is a really old version of the source so it probably won't apply cleanly, but there are few enough changes and enough context to track down and apply the changes manually if necessary. Don't use the -k switch with this as I didn't modify the name of the fb file (didn't bother at the time since I don't use the -k switch), and they'll end up stomping on each other.
|
|
|
|
|
|
#8 |
|
"William"
May 2003
New Haven
2·7·132 Posts |
The OP job has already been sieved. Only post-processing is being planned for the university's super computer. Must he stick to a single quad-core processor? Should he?
|
|
|
|
|
|
#9 |
|
Nov 2008
2·33·43 Posts |
|
|
|
|
|
|
#10 |
|
Sep 2009
977 Posts |
wblipp: if the matrix is larger than, say, 10Mx10M, and you want the result ASAP, you'll probably want to run the linear algebra on a 6-core or 8-core computer instead.
|
|
|
|
|
|
#11 |
|
Tribal Bullet
Oct 2004
354110 Posts |
There isn't any harm in trying msieve with lots of threads; the on-disk storage format is independent of how the matrix is partitioned in a multithreaded run, so you can restart with different values of -t and not lose any work figuring out the best number of threads. In a batch environment, you can add '-d <num>' so the binary will halt itself after a preset time.
Last fiddled with by jasonp on 2010-03-14 at 13:28 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| msieve on KNL | frmky | Msieve | 3 | 2016-11-06 11:45 |
| Msieve on a Mac (Help) | pxp | Msieve | 1 | 2013-02-28 14:56 |
| Using msieve with c | burrobert | Msieve | 9 | 2012-10-26 22:46 |
| msieve help | em99010pepe | Msieve | 23 | 2009-09-27 16:13 |
| fun with msieve | masser | Sierpinski/Riesel Base 5 | 83 | 2007-11-17 19:39 |