mersenneforum.org Scalability of Glucas on large machines: A future //ed prime 95 ?
 Register FAQ Search Today's Posts Mark Forums Read

 2006-02-03, 18:50 #1 T.Rex     Feb 2004 France 2·461 Posts Scalability of Glucas on large machines: A future //ed prime 95 ? Hi, Here are some results I've built about the scalability of Glucas (multi-threaded with POSIX Threads) on large machines. I plan to build figures later. The goal is to provide George with data in order to parallelize prime95 in the most efficient way. The machines have 16 processors. First one is Itanium2. Second one is PPC. I will not give details about the frequency for now. Itanium2 machine is NUMA, with 2 NUMA factors. Glucas is launched with a numactl command that binds threads on a block of CPUs/memory and optimizes memory allocation (with about 30% improvement of performance compared to no numactl command used). PPC machine is also NUMA but with a very small NUMA factor, so it can be considered as nearly SMP. Scalability with N proc is computed as: $\large \frac{100}{N} \ \frac{\text{sec/iter with 1 proc}}{\text{ sec/iter with N procs}}$. A 83% scalability with 16 CPUs means that Glucas is as fast as if it was run as Glucas 1 thread in // on 13.3 CPUs. Glucas with 1 thread here-below is multi-threaded. It runs about 13% slower than Glucas with no multi-threading, on Itanium2. Code: #proc Itanium2 PPC sec/iter scalability sec/iter scalability ------------------------------------------------------------- 1 0.1462 100% 0.1668 100% 8 0.0245 75% 0.0230 91% 16 0.0149 61% 0.0125 83% The data show that Glucas is not completely parallelized on SMP machine. About 9% of perf is lost every 8 processors. (Or the machine is not pure SMP. I'll get more information.) The data also shows that a NUMA machine has a big impact to Glucas performance. This should be due to the fact that the memory used by each Glucas thread has been allocated before the threads have been started, on a different block of CPUs/memory than the block where the threads are running. This leads to latency and memory bandwith bottle-neck. Conclusions: prime95 runs on ia32 and x86_64. Not ia64 nor PPC. So the kind of machines a future parallelized prime95 could run on now is limited to dual-core. But we may expect Intel and AMD to provide many-core processors in the next years, like the processor T1 of Sun (2 to 8 cores with 4 threads per core: 32 threads on 1 chip !). Already are available 8x IA32 machines made of 2 blocks of 4xCPUs/Memory. I'm not an expert of processor architecture ... but I guess that future multi-core processors and multi-chips machines will not be purely SMP, for cost reasons. My experiments show that it is worth to take memory-access problems like NUMA into account in the design of a possible future parallelized prime95. I guess parallelizing prime95 is such a complex task that it is better to integrate this memory-access constraint in the first version of design, in order that George could build a prime95P program that will nicely take profit of future 8 or 16 core processors and will last 10 more years. Hope it helps. Your opinion ? Tony Last fiddled with by T.Rex on 2006-02-03 at 18:53
2006-02-03, 20:28   #2
R.D. Silverman

Nov 2003

22×5×373 Posts

Quote:
 Originally Posted by T.Rex
The best way to run LL in parallel is the way it is being done now.
Give separate values to separate processors and let them run independently.

The objective is to maximize throughput and NOT to minimize latency to
test any single number.

2006-02-03, 21:25   #3
TheJudger

"Oliver"
Mar 2005
Germany

111110 Posts

Quote:
 Originally Posted by T.Rex ..., like the processor T1 of Sun (2 to 8 cores with 4 threads per core: 32 threads on 1 chip !).
2 to 8 cores / chip, 4 threads per core... and only one weak FPU (each FP instruction takes no less than 40 cycles) ;)

Sun T1 is NOT a numbercruncher.... but it is very strong for "throughput-applications"

TheJudger

 2006-02-03, 21:25 #4 T.Rex     Feb 2004 France 2·461 Posts Testing one Mersenne with more than 10 million digits takes more than 1 month for a fast CPU running 24 hours a day and 7 days a week. For PCs that run only 4 hours a day in average, that means that it takes 6 months for completing the test of one exponent. So, the risk of a reinstall of Windows or the loss of the disk or the owner of the PC being tired of waiting for a result are also growing. It should be worst in the future, since exponents grow and computational power of one core will be stable in the future, due to Intel and AMD focusing on more cores on a single chip. In 2 years, after the 10M digits prime is found, there may be less new people being interested by GIMPS. Much less I think if it takes so long to test one exponent. Tony
2006-02-03, 21:25   #5
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

254268 Posts

Quote:
 Originally Posted by R.D. Silverman The best way to run LL in parallel is the way it is being done now. Give separate values to separate processors and let them run independently. The objective is to maximize throughput and NOT to minimize latency to test any single number.
However, when verifyng a purported new prime it's nice to have the verification as soon as possible rather than as cheaply as possible. The niceness may be worth paying the cost of reduced computational efficiency.

For making the discovery in the first place, I agree with your optimization strategy.

Paul

2006-02-04, 12:54   #6
T.Rex

Feb 2004
France

2·461 Posts

Quote:
 Originally Posted by TheJudger 2 to 8 cores / chip, 4 threads per core... and only one weak FPU (each FP instruction takes no less than 40 cycles) ;) ...
Perfectly right. My comments are valid only if each core can use or share a FPU.
Tony

2006-02-04, 16:48   #8
Ken_g6

Jan 2005
Caught in a sieve

5·79 Posts

Quote:
 Originally Posted by T.Rex Testing one Mersenne with more than 10 million digits takes more than 1 month for a fast CPU running 24 hours a day and 7 days a week. For PCs that run only 4 hours a day in average, that means that it takes 6 months for completing the test of one exponent. So, the risk of a reinstall of Windows or the loss of the disk or the owner of the PC being tired of waiting for a result are also growing. It should be worst in the future, since exponents grow and computational power of one core will be stable in the future, due to Intel and AMD focusing on more cores on a single chip. In 2 years, after the 10M digits prime is found, there may be less new people being interested by GIMPS. Much less I think if it takes so long to test one exponent. Tony
This sounds to me like something that should be user-adjustable. Hard-core users could use one thread on each core, while people just trying it out could get one exponent done faster. It would be nice if the program did benchmarks on 1, 2, 4, 8,... up to all the processors, then allowed the user to select the number of cores to use with a slider, while displaying both time to complete one exponent and something like the fraction of exponents per day.

2006-02-05, 22:32   #9
jasong

"Jason Goatcher"
Mar 2005

3×7×167 Posts

Quote:
 Originally Posted by T.Rex Testing one Mersenne with more than 10 million digits takes more than 1 month for a fast CPU running 24 hours a day and 7 days a week. For PCs that run only 4 hours a day in average, that means that it takes 6 months for completing the test of one exponent. So, the risk of a reinstall of Windows or the loss of the disk or the owner of the PC being tired of waiting for a result are also growing. It should be worst in the future, since exponents grow and computational power of one core will be stable in the future, due to Intel and AMD focusing on more cores on a single chip. In 2 years, after the 10M digits prime is found, there may be less new people being interested by GIMPS. Much less I think if it takes so long to test one exponent. Tony
You have a point, but I believe it's only valid if the testing of the primality of these large exponents is the ONLY thing you're doing. I would think that people would be less likely to get bored if they had, say, 8 cores, and only 1 or 2 were running Prime95 and the rest were involved in other pursuits.

I think the best thing for the project would be for the program to have features that let it get along with other programs. I think the best thing for any DC project to do right now is to be open-minded about the fact that people will be running more than one DC project on their computer at once.

2006-02-06, 12:38   #10
T.Rex

Feb 2004
France

92210 Posts

Quote:
 Originally Posted by jasong I would think that people would be less likely to get bored if they had, say, 8 cores, and only 1 or 2 were running Prime95 and the rest were involved in other pursuits.
Yes. If would be nice to be able to choose between: no-thread for prime95, N threads for prime95 between 2 and the available number of cores.
Tony

 2006-02-06, 12:40 #11 T.Rex     Feb 2004 France 2×461 Posts A figure showing scalability of Glucas on 16x CPUs Here is the figure I've promissed. Notice that you may have different scalability results with the same processor depending on how the machine is architectured (how the building SMP blocks are connected: frequency, bandwidth, ...). Also, the more close to SMP is the NUMA of a machine, the more expensive it is ... Based on my comments in a previous post, I don't know what is the (positive) impact of AIX compared to Linux. Tony Attached Thumbnails   Last fiddled with by T.Rex on 2006-02-06 at 12:43

 Similar Threads Thread Thread Starter Forum Replies Last Post EdH GMP-ECM 43 2013-03-01 14:27 Sam Kennedy Factoring 9 2012-12-18 17:30 Jeff Gilchrist Msieve 1 2009-01-02 09:32 MarcGetty Software 3 2006-03-07 07:54 moo Software 10 2004-12-15 13:25

All times are UTC. The time now is 21:43.

Sun Dec 5 21:43:30 UTC 2021 up 135 days, 16:12, 0 users, load averages: 1.78, 1.46, 1.42