![]() |
|
|
#12 |
|
Romulan Interpreter
Jun 2011
Thailand
72×197 Posts |
|
|
|
|
|
|
#13 | |
|
Serpentine Vermin Jar
Jul 2014
1100111100012 Posts |
Quote:
I ran some actual tests on a dual 10-core system (Xeon E5-2690 v2) with DDR3 @ 1866, and I did some even quicker tests yesterday on an older dual 4-core system (Xeon X5550 @ 2.67 GHz with DDR3 @ 800). For purposes of my test, I was only working with one of the two CPU's in the system... I did some testing with both CPUs active but for the most part it didn't make much difference (below 1% difference in per iteration speed). The slower/older 4-core system surprisingly didn't have the same penalty when running multiple 4M FFT workers. I hadn't tested it before, and I'm guessing it's because its memory limited enough that it's slow to start with. With a 4M FFT, a single worker on a single core is 68.25ms per iteration (I could do the inverse to get iterations-per-second, but I was just focused on per-iteration times for my comparison). When I had 4 workers going, each on a single core, the time only crept up to 73ms / iteration, an increase (penalty) of ~ 7%. On the other hand, if I had a single worker using all 4 cores, it was doing 19.5ms / iteration, so as far as total throughput its about 14% slower than a single worker/single core, but only 6.8% slower than 4 workers/1 core each. 6.8% isn't too bad a penalty to just get a result from a single worker back quicker... I also tried it out with 2 workers/2 cores each, and I got 38.4ms / iteration with both workers going, or 34.5ms / iter with only one running, so it was 11.4% slower with that second worker spun up, which is pretty close to the 12.5% slower (in overall throughput again). Once more, it's not really fair to say it's x% slower when talking about ms/iter... I'm just using an idealized "if a single-core worker runs at X than a 4-core worker would be X/4" approach, in an ideal scenario. Anyway, it's when I got to the 10-core Xeon E5 v2 that things got interesting... I ran tests at 3 different FFT sizes: 3584K, 3840K, and 4096K. The first two happened to be the sizes of numbers I'm actually testing, and for the 4M stuff I picked the first 10 exponents > 78M just to run some benchmarks. With a single 10-core worker going, the ms/iter times were 3.18ms, 3.42ms and 4.08ms respectively. Pretty decent times, and that's how I have the servers setup now, with all cores from each CPU dedicated to a worker. But what if I had 10 workers using one core each? That's when things got weird. @ 3584K by the time I had all 10 workers going, each one was 22.3% slower than when only one was running (27.95ms/iter versus 34.18ms/iter with all 10 chugging along). It really started to shoot upwards by the time the 6th and 7th workers started, and when that 8th one began, performance dropped from 7.69% slower to 14.38% slower. @ 3840K it was the same type of progression with an even worse end result w/ 10 workers: 32.92% slower (28.74ms/iter versus 38.20ms/iter). It devolved between the 7th and 8th workers firing up. @ 4096K it looked similar to the 3584K tests, with a worst case 21.87% slowdown with all 10 going (33.52ms/iter versus 40.85ms/iter). Getting into some esoteric setups, like 2 workers of 5 cores each, it kind of stunk up the place... with 2 x 5-core workers, performance dropped by 58% when the 2nd worker started. From a decent 5.86ms/iter to 9.26ms/iter ... kind of getting close to that "twice as slow" threshold. Anyway, I'll wrap it up here... The comparisons I did could probably be translated into iterations/second and idealized for comparison between # of workers, but anyway, hopefully that gets the gist of the idea across. |
|
|
|
|
|
|
#14 |
|
Romulan Interpreter
Jun 2011
Thailand
226658 Posts |
The post could be made sticky, and refer to it all the guys who come monthly asking the same questions. Now we are in violent agreement. With the exception that I never had access to a more than 6 core machine (and it is great that you give an insight into such machines). |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Prime 95 will not let me change cores per worker | evanh | Software | 4 | 2017-12-22 22:25 |
| Worker #5 and Worker#7 not running (Error ILLEGAL SUMOUT | skrupian08 | Information & Answers | 9 | 2016-08-23 16:35 |
| 32 cores limitation | gabrieltt | Software | 12 | 2010-07-15 10:26 |
| CPU cores | Unregistered | Information & Answers | 7 | 2009-11-02 08:27 |
| A program that uses all the CPU-cores | Primix | Hardware | 7 | 2008-09-06 21:09 |