mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2006-02-03, 18:50   #1
T.Rex
 
T.Rex's Avatar
 
Feb 2004
France

2·461 Posts
Default Scalability of Glucas on large machines: A future //ed prime 95 ?

Hi,

Here are some results I've built about the scalability of Glucas (multi-threaded with POSIX Threads) on large machines.
I plan to build figures later.

The goal is to provide George with data in order to parallelize prime95 in the most efficient way.

The machines have 16 processors. First one is Itanium2. Second one is PPC.
I will not give details about the frequency for now.

Itanium2 machine is NUMA, with 2 NUMA factors. Glucas is launched with a numactl command that binds threads on a block of CPUs/memory and optimizes memory allocation (with about 30% improvement of performance compared to no numactl command used).

PPC machine is also NUMA but with a very small NUMA factor, so it can be considered as nearly SMP.

Scalability with N proc is computed as: \large \frac{100}{N} \ \frac{\text{sec/iter with 1 proc}}{\text{ sec/iter with N procs}}.

A 83% scalability with 16 CPUs means that Glucas is as fast as if it was run as Glucas 1 thread in // on 13.3 CPUs.

Glucas with 1 thread here-below is multi-threaded. It runs about 13% slower than Glucas with no multi-threading, on Itanium2.

Code:
#proc             Itanium2                   PPC
          sec/iter   scalability    sec/iter  scalability
-------------------------------------------------------------
1          0.1462      100%          0.1668    100%
8          0.0245       75%          0.0230     91%
16         0.0149       61%          0.0125     83%
The data show that Glucas is not completely parallelized on SMP machine.
About 9% of perf is lost every 8 processors.
(Or the machine is not pure SMP. I'll get more information.)

The data also shows that a NUMA machine has a big impact to Glucas performance.
This should be due to the fact that the memory used by each Glucas thread has been allocated before the threads have been started, on a different block of CPUs/memory than the block where the threads are running. This leads to latency and memory bandwith bottle-neck.

Conclusions:

prime95 runs on ia32 and x86_64. Not ia64 nor PPC.
So the kind of machines a future parallelized prime95 could run on now is limited to dual-core. But we may expect Intel and AMD to provide many-core processors in the next years, like the processor T1 of Sun (2 to 8 cores with 4 threads per core: 32 threads on 1 chip !).
Already are available 8x IA32 machines made of 2 blocks of 4xCPUs/Memory.

I'm not an expert of processor architecture ... but I guess that future multi-core processors and multi-chips machines will not be purely SMP, for cost reasons.
My experiments show that it is worth to take memory-access problems like NUMA into account in the design of a possible future parallelized prime95.

I guess parallelizing prime95 is such a complex task that it is better to integrate this memory-access constraint in the first version of design, in order that George could build a prime95P program that will nicely take profit of future 8 or 16 core processors and will last 10 more years.

Hope it helps.
Your opinion ?

Tony

Last fiddled with by T.Rex on 2006-02-03 at 18:53
T.Rex is offline   Reply With Quote
Old 2006-02-03, 20:28   #2
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Default

Quote:
Originally Posted by T.Rex
The best way to run LL in parallel is the way it is being done now.
Give separate values to separate processors and let them run independently.

The objective is to maximize throughput and NOT to minimize latency to
test any single number.
R.D. Silverman is offline   Reply With Quote
Old 2006-02-03, 21:25   #3
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111110 Posts
Default

Quote:
Originally Posted by T.Rex
..., like the processor T1 of Sun (2 to 8 cores with 4 threads per core: 32 threads on 1 chip !).
2 to 8 cores / chip, 4 threads per core... and only one weak FPU (each FP instruction takes no less than 40 cycles) ;)

Sun T1 is NOT a numbercruncher.... but it is very strong for "throughput-applications"

TheJudger
TheJudger is offline   Reply With Quote
Old 2006-02-03, 21:25   #4
T.Rex
 
T.Rex's Avatar
 
Feb 2004
France

2·461 Posts
Default

Testing one Mersenne with more than 10 million digits takes more than 1 month for a fast CPU running 24 hours a day and 7 days a week.
For PCs that run only 4 hours a day in average, that means that it takes 6 months for completing the test of one exponent. So, the risk of a reinstall of Windows or the loss of the disk or the owner of the PC being tired of waiting for a result are also growing.
It should be worst in the future, since exponents grow and computational power of one core will be stable in the future, due to Intel and AMD focusing on more cores on a single chip.
In 2 years, after the 10M digits prime is found, there may be less new people being interested by GIMPS. Much less I think if it takes so long to test one exponent.
Tony
T.Rex is offline   Reply With Quote
Old 2006-02-03, 21:25   #5
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

254268 Posts
Default

Quote:
Originally Posted by R.D. Silverman
The best way to run LL in parallel is the way it is being done now.
Give separate values to separate processors and let them run independently.

The objective is to maximize throughput and NOT to minimize latency to
test any single number.
However, when verifyng a purported new prime it's nice to have the verification as soon as possible rather than as cheaply as possible. The niceness may be worth paying the cost of reduced computational efficiency.

For making the discovery in the first place, I agree with your optimization strategy.


Paul
xilman is offline   Reply With Quote
Old 2006-02-04, 12:54   #6
T.Rex
 
T.Rex's Avatar
 
Feb 2004
France

2·461 Posts
Default

Quote:
Originally Posted by TheJudger
2 to 8 cores / chip, 4 threads per core... and only one weak FPU (each FP instruction takes no less than 40 cycles) ;) ...
Perfectly right. My comments are valid only if each core can use or share a FPU.
Tony
T.Rex is offline   Reply With Quote
Old 2006-02-04, 15:11   #7
T.Rex
 
T.Rex's Avatar
 
Feb 2004
France

2×461 Posts
Default Additional information, dealing with Linux

Using data available on SPEC web-site about Java multi-threaded SPECjbb2000 benchmark, I have some comments about the data I've provided.
The SPECjbb2000 benchmark, though it does few computation, is interesting since it uses from 1 to 2 Java/kernel threads per core.
SPEC provides a series of measures done on "IBM eServer p5 570" machine, either with AIX or Linux, with 2, 4, 8 and 16 processors.
I've computing the scalability in all cases, and it appears that, though the same benchmark is run on the same hardware and with nearly the same JVM, scalability of 16 cores compared to 2 cores is 91.7% for AIX and 82% for Linux (SLES9=kernel 2.4 I think).
So using AIX rather than Linux on the PPC machine may have increased the scalability on PPC compared to Itanium2/Linux.
A better comparison would need to use a SMP machine running Linux compared to a NUMA/Linux machine.

Other results of SPECjbb2000, run by AMD on S2882 (Opteron 252), show that the bench is about 5% faster on Windows than on Linux (SLES9).

(If you do not know, NPTL arrived within Linux kernel 2.6 and is more efficient than old "LinuxThreads" available on kernel 2.4 . But I have no figure about the performance improvement.).

Conclusion:
The impact of NUMA to a Mersenne cruncher may be less important than I said, but still important.

Who can provide details about the memory access and the FPU done in new/next-future multi-core processors from Intel and AMD ?

I think I read somewhere that Intel has a bigger memory bottle-neck compared to AMD dual-core chips.

Tony
T.Rex is offline   Reply With Quote
Old 2006-02-04, 16:48   #8
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5·79 Posts
Default

Quote:
Originally Posted by T.Rex
Testing one Mersenne with more than 10 million digits takes more than 1 month for a fast CPU running 24 hours a day and 7 days a week.
For PCs that run only 4 hours a day in average, that means that it takes 6 months for completing the test of one exponent. So, the risk of a reinstall of Windows or the loss of the disk or the owner of the PC being tired of waiting for a result are also growing.
It should be worst in the future, since exponents grow and computational power of one core will be stable in the future, due to Intel and AMD focusing on more cores on a single chip.
In 2 years, after the 10M digits prime is found, there may be less new people being interested by GIMPS. Much less I think if it takes so long to test one exponent.
Tony
This sounds to me like something that should be user-adjustable. Hard-core users could use one thread on each core, while people just trying it out could get one exponent done faster. It would be nice if the program did benchmarks on 1, 2, 4, 8,... up to all the processors, then allowed the user to select the number of cores to use with a slider, while displaying both time to complete one exponent and something like the fraction of exponents per day.
Ken_g6 is offline   Reply With Quote
Old 2006-02-05, 22:32   #9
jasong
 
jasong's Avatar
 
"Jason Goatcher"
Mar 2005

3×7×167 Posts
Default

Quote:
Originally Posted by T.Rex
Testing one Mersenne with more than 10 million digits takes more than 1 month for a fast CPU running 24 hours a day and 7 days a week.
For PCs that run only 4 hours a day in average, that means that it takes 6 months for completing the test of one exponent. So, the risk of a reinstall of Windows or the loss of the disk or the owner of the PC being tired of waiting for a result are also growing.
It should be worst in the future, since exponents grow and computational power of one core will be stable in the future, due to Intel and AMD focusing on more cores on a single chip.
In 2 years, after the 10M digits prime is found, there may be less new people being interested by GIMPS. Much less I think if it takes so long to test one exponent.
Tony
You have a point, but I believe it's only valid if the testing of the primality of these large exponents is the ONLY thing you're doing. I would think that people would be less likely to get bored if they had, say, 8 cores, and only 1 or 2 were running Prime95 and the rest were involved in other pursuits.

I think the best thing for the project would be for the program to have features that let it get along with other programs. I think the best thing for any DC project to do right now is to be open-minded about the fact that people will be running more than one DC project on their computer at once.
jasong is offline   Reply With Quote
Old 2006-02-06, 12:38   #10
T.Rex
 
T.Rex's Avatar
 
Feb 2004
France

92210 Posts
Default

Quote:
Originally Posted by jasong
I would think that people would be less likely to get bored if they had, say, 8 cores, and only 1 or 2 were running Prime95 and the rest were involved in other pursuits.
Yes. If would be nice to be able to choose between: no-thread for prime95, N threads for prime95 between 2 and the available number of cores.
Tony
T.Rex is offline   Reply With Quote
Old 2006-02-06, 12:40   #11
T.Rex
 
T.Rex's Avatar
 
Feb 2004
France

2×461 Posts
Default A figure showing scalability of Glucas on 16x CPUs

Here is the figure I've promissed.
Notice that you may have different scalability results with the same processor depending on how the machine is architectured (how the building SMP blocks are connected: frequency, bandwidth, ...).
Also, the more close to SMP is the NUMA of a machine, the more expensive it is ...
Based on my comments in a previous post, I don't know what is the (positive) impact of AIX compared to Linux.
Tony
Attached Thumbnails
Click image for larger version

Name:	Glucas.gif
Views:	290
Size:	4.8 KB
ID:	1031  

Last fiddled with by T.Rex on 2006-02-06 at 12:43
T.Rex is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Error with large number on 32-bit machines EdH GMP-ECM 43 2013-03-01 14:27
Large Prime Variation of QS Sam Kennedy Factoring 9 2012-12-18 17:30
Msieve Lanczos scalability Jeff Gilchrist Msieve 1 2009-01-02 09:32
Start and Stop Prime 95 on Large Groups of Windows XP Machines MarcGetty Software 3 2006-03-07 07:54
Putting prime 95 on a large number of machines moo Software 10 2004-12-15 13:25

All times are UTC. The time now is 21:43.


Sun Dec 5 21:43:30 UTC 2021 up 135 days, 16:12, 0 users, load averages: 1.78, 1.46, 1.42

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.