mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Supercomputer P95? (https://www.mersenneforum.org/showthread.php?t=16125)

Dubslow 2011-10-11 05:32

Supercomputer P95?
 
Dear Mods: If this belongs in a different forum, please move this.

Dear Programmers/Developers etc. (i.e. Prime95): If one wanted to use a couple of days of time of a supercomputer, of the PFLOPS variety, how much modification of the standard P95 be required? I know they use LINPACK for the TOP500 list, and I know you can get LINPACK for desktops, but are they the same? Does something about whatever operating system they use make it possible to run standard desktop applications, or is some modification directly to the program required? I know at the very least for our purposes that to complete any LL tests in a couple of days, you'd need to run each test on many many processors to get it done.

My reasons for asking do not at the moment have any practical purpose, though in a [URL="http://en.wikipedia.org/wiki/Blue_Waters"]year or two's time[/URL] there's a ridiculously small chance it might. For now, I'm just curious. :geek::whistle:

Christenson 2011-10-11 13:13

Well, P95 is actually a handful of programs rolled together --
LL testing
ECM testing
P-1 testing
TF testing
communications to Primenet
a GUI

Drop the GUI, you have mprime.....I'm pretty sure it will at least build and run correctly for a small number of cores on any supercomputer you may want to think of....though I don't think P95 goes off and carefully hand-optimises assembler every time a new instruction comes out the way he does on x86 architectures....

We have both mfaktc and CUDALucas, which run on GPUs...otherwise known as desktop supercomputing cards....

The problem is that the "real" supercomputers (Lomonosov cluster, Teragrid) offer something that is not needed to run P95 effectively -- large amounts of high-bandwidth, low-latency inter-processor communications. LL is a sequential algorithm -- you need fourier transform after fourier transform, with just a small subtraction between the steps...TF is well-parallelised in GPUs. P-1 spends much of its time sieving -- and that's been shown, even on factoring problems, to be parallelisable without much interprocessor communications. ECM is also quite easily parallelised...but probably inefficient compared to LL testing for proving compositeness of numbers >~2^20M.

Now, if you want to factor something, like 2^1061-1, you sieve in parallel, and at the end, you have this large pile of relations and you need to do some serious linear algebra...and the guys DO borrow supercomputers for a few hours for those final steps....because the linear algebra requires the interprocessor communications....

But given the link, I'm thinking you might get a small piece of one of these supercomputers on the cheap...one motherboard/blade with a few CPUs on it...which will probably be a very energy-efficient way of running mprime.

axn 2011-10-11 13:39

[QUOTE=Christenson;274109]P-1 spends much of its time sieving[/QUOTE]

No, it doesn't. You must be thinking of something else.

Mr. P-1 2011-10-11 14:05

P-1 is bit like LL - Fourier transform after Fourier transform.

kjaget 2011-10-11 14:23

[QUOTE=Christenson;274109]Drop the GUI, you have mprime.....I'm pretty sure it will at least build and run correctly for a small number of cores on any supercomputer you may want to think of[/QUOTE]

If it's an x86-based supercomputer. Assembler isn't portable, and you'll likely even have problems because assembler syntax varies between different tool chains.

Check out mlucas for an example of relatively portable source code instead.

fivemack 2011-10-11 16:46

Basically, supercomputers are very bad at doing small Fourier transforms, and what prime95 does is nothing but a series of small Fourier transforms.

They're not great at doing large Fourier transforms - a 3D FFT that just about fits in the machine's memory will run at best with about 5% of the machine's peak flops - but small FFTs are hard to parallelise even on tightly-connected systems and essentially impossible on things as loose as supercomputers.

Running on a six-core i7/970 we have for 2048k FFT length

2/1 32.244
4/2 17.390
6/3 13.849
8/4 12.010
10/5 11.648
12/6 12.106

so there's really not much advantage running on more than three cores

(perhaps I am being stupid, I can't see how to get mprime to give me the figures for running three threads on three cores, rather than six)

Christenson 2011-10-12 02:46

[QUOTE=kjaget;274118]If it's an x86-based supercomputer. Assembler isn't portable, and you'll likely even have problems because assembler syntax varies between different tool chains.

Check out mlucas for an example of relatively portable source code instead.[/QUOTE]

I'm betting that there's still a non-x86 based pure C-language FFT core still in mprime -- just that much more optimal versions are what we x86-bigots always see.

Dubslow 2011-10-12 05:12

I'm aware that it wouldn't be the most optimal use of resources ever, and paralellism and yadda yadda...
I'm just asking, taking M/Prime/95 on some (let's go with x86-based) supercomputer, and I mean the actual PFLOPS range supercomputers, would it run? Would it be able to use all the cores, or would one need to start a bunch of separate P95 executions?

bsquared 2011-10-12 05:27

[QUOTE=Dubslow;274197]I'm aware that it wouldn't be the most optimal use of resources ever, and paralellism and yadda yadda...
I'm just asking, taking M/Prime/95 on some (let's go with x86-based) supercomputer, and I mean the actual PFLOPS range supercomputers, would it run? Would it be able to use all the cores, or would one need to start a bunch of separate P95 executions?[/QUOTE]

Sure, if it was an x86 based supercomputer it would probably run. But look at fivemack's post - I would not be surprised if using all N thousand cores made it run *significantly* slower than if you just used 2 or 3 cores. Communication bandwidth (and latency!) can't just be dismissed, no matter how "powerful" the supercomputer is.

debrouxl 2011-10-12 05:58

On supercomputers / grids, one could however run one Prime95 worker thread per core, and run an arbitrary assortment of TF/P-1/ECM/LL work on many numbers in parallel.

Dubslow 2011-10-12 06:13

[QUOTE=bsquared;274202]Sure, if it was an x86 based supercomputer it would probably run. But look at fivemack's post - I would not be surprised if using all N thousand cores made it run *significantly* slower than if you just used 2 or 3 cores. Communication bandwidth (and latency!) can't just be dismissed, no matter how "powerful" the supercomputer is.[/QUOTE]

I know, this is mostly about proof on concept. I'm not talking about using more than a few cores per assignment, but rather could all cores be used by a single instance of P95 or would many be needed?


All times are UTC. The time now is 06:48.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.