mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Perpetual benchmark thread... (https://www.mersenneforum.org/showthread.php?t=59)

axn 2016-02-03 06:04

[QUOTE=xtreme2k;425043]There is some magic in there when these CPU is running just 1 worker. This is at least true for 4096K[/QUOTE]

4096K FFT consumes 32MB of memory (plus change). This can run (almost) entirely out of the huge 35MB L3 cache of the Xeon. So despite the loss of efficiency due to multithreading, the 1 worker setup wins out.

Madpoo 2016-02-03 16:20

[QUOTE=axn;425048]4096K FFT consumes 32MB of memory (plus change). This can run (almost) entirely out of the huge 35MB L3 cache of the Xeon. So despite the loss of efficiency due to multithreading, the 1 worker setup wins out.[/QUOTE]

I wondered how much the L2/L3 cache plays a part in it, but since I have no idea what size each "chunk" of work in a multi-threaded worker is like, I couldn't even hazard a guess.

That's another thing where George might get some optimizations, by targeting the chunk of data being worked on to the L3 cache size of that core, be it 1.5 MB, 2 MB, 2.5 MB etc. As in, would it be faster to do several smaller chunks of work that could fit in cache, or one larger chunk that would have to, by necessity, go out to main RAM?

xtreme2k 2016-02-05 05:12

[QUOTE=axn;425048]4096K FFT consumes 32MB of memory (plus change). This can run (almost) entirely out of the huge 35MB L3 cache of the Xeon. So despite the loss of efficiency due to multithreading, the 1 worker setup wins out.[/QUOTE]

That is very interesting. I guess when we move to 5120KB FFT then I would loose the advantage. In fact we aren't that far away.

Should've got the 2696v3/2699v3 with 45MB L3 then :bow:

axn 2016-02-05 07:31

[QUOTE=xtreme2k;425285]That is very interesting. I guess when we move to 5120KB FFT then I would loose the advantage. In fact we aren't that far away. [/quote]
Possibly. It would be interesting to find out at what FFT size the advantage shifts back.

Madpoo 2016-02-05 22:56

[QUOTE=axn;425302]Possibly. It would be interesting to find out at what FFT size the advantage shifts back.[/QUOTE]

There is strong evidence to suggest there's a "sweet spot" for LL tests on different CPUs. There are so many variables involved that I haven't quite nailed it down... sure would be nice to have the benchmark test look at specific things and then have an option to automatically set Prime95 to whatever gives the peak output.

Things like L2/L3 cache sizes, memory speed, FFT sizes, # of cores (threads) per worker, total # of workers, whether you have single or dual+ socket motherboard, etc.

They all play a part and if you have the time and persistence you can figure out what works best for you, but I think the project would benefit from an automated "plug and play" configuration.

I've said it before and I'll say it again, I've seen obvious cases of horribly misconfigured systems that are doing multiple 4M+ FFT tests on a single CPU, and I just know it would be orders of magnitude more efficient if it did one worker using all cores.

Those are the systems where I can see it has 4 LL tests assigned to it and it's reporting results daily showing minimal progress. Either they only run it an hour a day, or the memory contention is as bad as I think it is, making all of them slow as molasses.

Madpoo 2016-02-05 23:11

[QUOTE=xtreme2k;425285]That is very interesting. I guess when we move to 5120KB FFT then I would loose the advantage. In fact we aren't that far away.

Should've got the 2696v3/2699v3 with 45MB L3 then :bow:[/QUOTE]

LOL... more is better. :)

In the case of the 16/18 core Haswell chips, they are clocked slightly slower than 14 core, so I guess it would be slightly faster.

AirSquirrels proved this by doing a check on the same exponent as me... him with dual 16-core Xeons, me with dual 14-core.

In my case I could only get 20-22 cores (14 on one chip, the rest on the other) before I started to see that it wasn't improving the performance, and actually started hindering it.

In his case, he had all 32 cores working together and says he didn't see any drop in speed, but I don't know if he tested other total worker counts like 24, 26, 28, 30, etc.

In the end, what took me 34 hours took him around 32 or something (it was the verification for M49).

Looks like Broadwell E5 will have the same 2.5 MB L3 cache per core... I haven't see any hard info on the L3 size (per core) on Skylake, but probably still 2.5 MB unless they surprise us all with 3-4 MB.

I'm still waiting for Knights Landing and it's up-to-16GB of fast memory, which I guess you could call L4 if the system is configured to use that as a fast cache to main RAM.

Again, no hard numbers on how the KNL memory bandwidth would compete with the L3 speed we enjoy now, but it would be faster than 6-channel DDR4, and that's a very good thing.

xtreme2k 2016-03-01 23:08

I ran the benchmark again focusing on 4096K+ FFT. It takes an incredibly long time to run and most of the time running different combos of workers and threads. It would be good if George would consider the following.
[LIST][*]giving us an indication on FFT size and (optimal/preferred/ok) cache size usage per worker/cores[*]allowing us to select 1w/n-number of cores and then n-w/n-cores and skipping the inbetween w/c. On a many core system it just takes forever to run :)[/LIST]
For my initial results, even at 8192K FFT the 2697v3 is definitely preferring 1w/14c rather than 14w/14c. However the advantage between 1w/14c vs 14w/14c is smaller compared to 4096K

WraithX 2016-06-12 14:01

1 Attachment(s)
Benchmarks for my E5-2687W v4 (12 cores, 3.0GHz) with 128GB = 8x 16GB DDR4-2133 ECC Reg CL15 on Windows 7 Pro SP1 x64.

I'm thinking of rerunning the test with 64GB = 4x 16GB DDR4-2133, to see if a single DIMM per channel makes a difference. In these initial tests, I did see a couple of oddly high timings pop up. I think this may have happened because it hit the second DIMM on the memory channel (maybe?).

The 1st benchmark was run after downloading Prime95 28.9 and selecting Benchmark from the menu.

The 2nd benchmark used the following in prime.txt:
FullBench=1

The 3rd benchmark used the following in prime.txt:
MinBenchFFT=4096
MaxBenchFFT=4096
BenchHyperthreads=0
BenchMultithreads=1

The 4th benchmark used the following in prime.txt:
MinBenchFFT=8192
MaxBenchFFT=8192
BenchHyperthreads=0
BenchMultithreads=1

The 5th benchmark used the following in prime.txt:
FullBench=1
BenchHyperthreads=0
BenchMultithreads=1

I know this is a very small part of the program, but I think it would be very helpful if the Benchmark menu option would bring up an options dialog. With this you can quickly choose from the above options and perhaps save the results to different files.

Maybe the dialog can contain:
A radio button to select between "Standard Benchmark", "Full Benchmark", and maybe "Custom Benchmark"
A check box to use MinBenchFFT, which gets its input from a drop down menu. (I know this would be large with 127 options, but better than error checking user inputs)
A check box to use MaxBenchFFT, which gets its input from a drop down menu.
A check box to bench hyperthreads, if that is a) available for the processor and b) turned on in BIOS
A check box to bench multithreads, if the processor has multiple cores.
Perhaps a way to specify the affinity map to see how that affects benchmarks.
Perhaps a way to specify a file name to save these benchmark results to a separate file.
And controls for any other benchmark options I don't know about.

Madpoo 2016-06-13 16:30

[QUOTE=WraithX;436088]Benchmarks for my E5-2687W v4 (12 cores, 3.0GHz) with 128GB = 8x 16GB DDR4-2133 ECC Reg CL15 on Windows 7 Pro SP1 x64.

I'm thinking of rerunning the test with 64GB = 4x 16GB DDR4-2133, to see if a single DIMM per channel makes a difference. In these initial tests, I did see a couple of oddly high timings pop up. I think this may have happened because it hit the second DIMM on the memory channel (maybe?).[/QUOTE]

I expect to get a delivery of a dual Xeon E5-2690 v4 (14 cores @ 2.6 GHz) in the next couple weeks. I'll be interested to put it through the benchmark and see how it compares to a E5-2697 v3 which is also 14 cores @ 2.6 GHz.

The differences will be the architecture itself (Broadwell) plus the v4 will have DDR4-2400 instead of DDR4-2133 for the v3. The mem speed alone would make a big difference.

Any reason you're using 2133 instead of 2400 memory? The E5-2687W v4 supports up to 2400, and since memory is the bottleneck with Prime95, you might take a look at that.

Depending on your motherboard, it may or may not reduce the clock on the memory if you're running 2 DPC. HP servers are "fun" in that if you're using official HP memory, you can run full speed with 2 DPC, but if you cheap out and use 3rd party memory, the BIOS will step down the memory speed from, for example, 1866 to 1600. That might not hold true for the gen9 boxes, but it's definitely the case on their gen8 Proliant servers with Xeon E5 v1/v2 processors.

It can also vary depending on the ranks of each module. I note with curiosity that on the new server I'm getting, it'll run 2DPC @ 2400 MHz if the modules are dual ranked 16GB, but if you use the single-ranked 16GB modules, it runs @ 2133 with 2DPC.

Doesn't matter to me, I'm only doing one DPC, so 8 across both CPUs, but it's curious. It even says I could use load reduced DIMMs (LRDIMMs) and get 3DPC at the full 2400.

I'm imagining a system now with a full 24 x 128GB LRDIMM @ 2400 for a total of 3 TB of RAM. :smile: Pair that with a couple of 22-core E5-2699v4 and you've got yourself one heckuva virtual host machine... Or better yet, 4 of those CPUs once the E5-46xx v4 chips come out. :coffee:

GP2 2016-06-13 17:55

[QUOTE=Madpoo;436145]I expect to get a delivery of a dual Xeon E5-2690 v4 (14 cores @ 2.6 GHz) in the next couple weeks.[/QUOTE]

How many watts does one of these use when it's doing LL tests at full capacity? Not including the additional air-conditioning burden, which may be hard to calculate directly.

Mark Rose 2016-06-13 18:09

The processor has a design power of 135 watts. Throw in some memory and chipset watts and powersupply inefficiency and you're probably close to 175 watts per processor.


All times are UTC. The time now is 22:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.