mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2016-08-25, 05:15   #12
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72×197 Posts
Default

Quote:
Originally Posted by Madpoo View Post
That's where your math always broke down for me, on my systems anyway.
Did you read my post to the end? Maybe not, because it was too long
You are saying the same thing I say.
LaurV is offline   Reply With Quote
Old 2016-08-25, 20:59   #13
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

1100111100012 Posts
Default

Quote:
Originally Posted by LaurV View Post
Did you read my post to the end? Maybe not, because it was too long
You are saying the same thing I say.
Sorry, I did in fact skim it, you caught me. LOL

I ran some actual tests on a dual 10-core system (Xeon E5-2690 v2) with DDR3 @ 1866, and I did some even quicker tests yesterday on an older dual 4-core system (Xeon X5550 @ 2.67 GHz with DDR3 @ 800).

For purposes of my test, I was only working with one of the two CPU's in the system... I did some testing with both CPUs active but for the most part it didn't make much difference (below 1% difference in per iteration speed).

The slower/older 4-core system surprisingly didn't have the same penalty when running multiple 4M FFT workers. I hadn't tested it before, and I'm guessing it's because its memory limited enough that it's slow to start with.

With a 4M FFT, a single worker on a single core is 68.25ms per iteration (I could do the inverse to get iterations-per-second, but I was just focused on per-iteration times for my comparison).

When I had 4 workers going, each on a single core, the time only crept up to 73ms / iteration, an increase (penalty) of ~ 7%.

On the other hand, if I had a single worker using all 4 cores, it was doing 19.5ms / iteration, so as far as total throughput its about 14% slower than a single worker/single core, but only 6.8% slower than 4 workers/1 core each.

6.8% isn't too bad a penalty to just get a result from a single worker back quicker...

I also tried it out with 2 workers/2 cores each, and I got 38.4ms / iteration with both workers going, or 34.5ms / iter with only one running, so it was 11.4% slower with that second worker spun up, which is pretty close to the 12.5% slower (in overall throughput again).

Once more, it's not really fair to say it's x% slower when talking about ms/iter... I'm just using an idealized "if a single-core worker runs at X than a 4-core worker would be X/4" approach, in an ideal scenario.

Anyway, it's when I got to the 10-core Xeon E5 v2 that things got interesting... I ran tests at 3 different FFT sizes: 3584K, 3840K, and 4096K. The first two happened to be the sizes of numbers I'm actually testing, and for the 4M stuff I picked the first 10 exponents > 78M just to run some benchmarks.

With a single 10-core worker going, the ms/iter times were 3.18ms, 3.42ms and 4.08ms respectively. Pretty decent times, and that's how I have the servers setup now, with all cores from each CPU dedicated to a worker.

But what if I had 10 workers using one core each? That's when things got weird.

@ 3584K by the time I had all 10 workers going, each one was 22.3% slower than when only one was running (27.95ms/iter versus 34.18ms/iter with all 10 chugging along). It really started to shoot upwards by the time the 6th and 7th workers started, and when that 8th one began, performance dropped from 7.69% slower to 14.38% slower.

@ 3840K it was the same type of progression with an even worse end result w/ 10 workers: 32.92% slower (28.74ms/iter versus 38.20ms/iter). It devolved between the 7th and 8th workers firing up.

@ 4096K it looked similar to the 3584K tests, with a worst case 21.87% slowdown with all 10 going (33.52ms/iter versus 40.85ms/iter).

Getting into some esoteric setups, like 2 workers of 5 cores each, it kind of stunk up the place... with 2 x 5-core workers, performance dropped by 58% when the 2nd worker started. From a decent 5.86ms/iter to 9.26ms/iter ... kind of getting close to that "twice as slow" threshold.

Anyway, I'll wrap it up here... The comparisons I did could probably be translated into iterations/second and idealized for comparison between # of workers, but anyway, hopefully that gets the gist of the idea across.
Madpoo is offline   Reply With Quote
Old 2016-08-26, 05:34   #14
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

226658 Posts
Default

very The post could be made sticky, and refer to it all the guys who come monthly asking the same questions.

Now we are in violent agreement. With the exception that I never had access to a more than 6 core machine (and it is great that you give an insight into such machines).
LaurV is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime 95 will not let me change cores per worker evanh Software 4 2017-12-22 22:25
Worker #5 and Worker#7 not running (Error ILLEGAL SUMOUT skrupian08 Information & Answers 9 2016-08-23 16:35
32 cores limitation gabrieltt Software 12 2010-07-15 10:26
CPU cores Unregistered Information & Answers 7 2009-11-02 08:27
A program that uses all the CPU-cores Primix Hardware 7 2008-09-06 21:09

All times are UTC. The time now is 17:49.


Sun Aug 1 17:49:36 UTC 2021 up 9 days, 12:18, 0 users, load averages: 3.19, 2.34, 1.92

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.