mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve and multiprocessors? (https://www.mersenneforum.org/showthread.php?t=13168)

wblipp 2010-03-13 07:59

Msieve and multiprocessors?
 
Someone is considering running a large NFS processing job on a university supercomputer that has nodes of quad-core processors running CentOS, a Red Hat Linux derivative. He has been advised that msieve cannot distribute itself across multiple processors, so he cannot use more than one node and should not set the "-t" flag to more than 4.

Is that right? When I use programs on Windows SMP machines that have multiple physical processors with multiple cores per processor, the operating system appears to spread processing among physical processors without "the program distributing itself." I don't know much about CentOS, but I naively expect that "using two or three nodes" would be configured to look like SMP for the user, and "-t 8" or "-t 12" should work.

Am I just too naive?

William

jasonp 2010-03-13 11:16

Others here have a lot more experience with highly parallel LA runs than I do, but the last time anyone tried large numbers of threads on a highly multi-core machine it didn't work out too well. I suspect that the scaling of the LA depends first on the bisection bandwidth of the machine currently, and then on the number of bus wires that can be brought to bear.

With sparse matrix-vector multiplies the ratio of computation to memory access is somewhat low, so the scaling to lots of machines is going to be problematic.

ET_ 2010-03-13 11:59

On the other hand,

1 - if the cluster is configured like a Beowulf one, and
2 - appears to the application program as a bunch of different CPUs, or
3 - the application program is MP aware],

the sieving phase should take advantage of the cluster configuration.

Luigi

jasonp 2010-03-13 18:11

There are parallel processing libraries that attempt to make multiple computers connected by a 'slow' interconnect appear like a single giant computer; I'm thinking primarily of PVM. But PVM does best on serious big iron, where the abstraction of a single machine isn't so far from the reality of memory latencies to remote nodes.

MPI is better for this type of problem; it forces the programmer to account for where the data begins and ends, and also on how it gets there. In the case of NFS linear algebra there's no benefit to be gained from dynamically changing where data resides, it's a regular series of operations whose data placement can be statically determined.

Someday I'll be able to spend the time to make it happen.

frmky 2010-03-13 18:49

I've got a hacked up MPI version of gnfs-lasieve that I've occasionally used. It's really the minimum set of changes necessary, and it's quite non-optimal. It simply divides the range of q up evenly among the cpus and sets them to work. This means that some cpus finish earlier than other and sit idle waiting for the others to finish, even on homogeneous clusters. Somewhat better would be to dispatch an equal number of special-q's to each CPU, but this too only works on homogeneous clusters. Best would to hand out small ranges for each computer to process, giving new ones as each finishes. I'll hunt down and post this file later today.

If you can rsh or ssh into the compute nodes from the head node without a password, then it's perhaps easier to modify the for loop in the script to use ssh to launch the standard gnfs-lasieve binary on the remote host. Just modify it to run
ssh remotehost <standard command>
where remotehost changes appropriately for each loop. You'll need to modify the output file for each, then recombine them in the next step. This definitely only works for homogeneous clusters, and too suffers from the fact that some finish earlier than others, but is what I most often use.

xilman 2010-03-13 19:22

[quote=frmky;208287]I've got a hacked up MPI version of gnfs-lasieve that I've occasionally used. It's really the minimum set of changes necessary, and it's quite non-optimal...[/quote]Many years ago I wrote an almost trivial client/server harness for NFS sievers. You can tell how old it is from its name --- cabald/cabalc --- and it originally ran line sievers. Its most recent application was for sieving RSA-768. I find that it's pretty close to optimal, from a compute-time perspective if not from its intrinsic elegance.

Is anyone interested in it? The source really is almost trivial. Anyone who wants to improve it is welcome to try.

Parallelizing the sieving is very very easy, as is finding polynomials and extracting square roots. It's the linear algebra stages that are distinctly non-trivial to parallelize efficiently.

Paul

frmky 2010-03-14 08:03

1 Attachment(s)
Attached is the diff for adding MPI to the lattice sieve. This is a really old version of the source so it probably won't apply cleanly, but there are few enough changes and enough context to track down and apply the changes manually if necessary. Don't use the -k switch with this as I didn't modify the name of the fb file (didn't bother at the time since I don't use the -k switch), and they'll end up stomping on each other.

wblipp 2010-03-14 08:50

The OP job has already been sieved. Only post-processing is being planned for the university's super computer. Must he stick to a single quad-core processor? Should he?

10metreh 2010-03-14 12:59

[QUOTE=wblipp;208346]The OP job has already been sieved.[/QUOTE]

If OP = odd perfect, is this sigma(2801^78)?

debrouxl 2010-03-14 13:15

wblipp: if the matrix is larger than, say, 10Mx10M, and you want the result ASAP, you'll [i]probably[/i] want to run the linear algebra on a 6-core or 8-core computer instead.

jasonp 2010-03-14 13:27

There isn't any harm in trying msieve with lots of threads; the on-disk storage format is independent of how the matrix is partitioned in a multithreaded run, so you can restart with different values of -t and not lose any work figuring out the best number of threads. In a batch environment, you can add '-d <num>' so the binary will halt itself after a preset time.


All times are UTC. The time now is 04:50.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.