mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Preliminary Skylake-X benchmark (https://www.mersenneforum.org/showthread.php?t=23632)

Prime95 2018-09-03 04:55

Preliminary Skylake-X benchmark
 
A 3.6GHz 8-core Skylake-X with DDR-3600 memory. Running new AXV-512 FFT code:

Timings for 4480K FFT length (1 core, 1 worker): 12.53 ms. Throughput: 79.83 iter/sec.
Timings for 4480K FFT length (2 cores, 1 worker): 6.94 ms. Throughput: 144.15 iter/sec.
Timings for 4480K FFT length (3 cores, 1 worker): 5.21 ms. Throughput: 192.10 iter/sec.
Timings for 4480K FFT length (4 cores, 1 worker): 4.09 ms. Throughput: 244.70 iter/sec.
Timings for 4480K FFT length (5 cores, 1 worker): 3.49 ms. Throughput: 286.31 iter/sec.
Timings for 4480K FFT length (6 cores, 1 worker): 3.15 ms. Throughput: 317.06 iter/sec.
Timings for 4480K FFT length (7 cores, 1 worker): 2.95 ms. Throughput: 339.29 iter/sec.
Timings for 4480K FFT length (8 cores, 1 worker): 2.95 ms. Throughput: 338.73 iter/sec.

Timings for 4480K FFT length (5 cores, 5 workers): 15.56, 15.50, 15.48, 15.39, 15.41 ms. Throughput: 323.30 iter/sec.
Timings for 4480K FFT length (6 cores, 6 workers): 16.90, 16.85, 16.78, 16.70, 16.77, 16.73 ms. Throughput: 357.38 iter/sec.
Timings for 4480K FFT length (7 cores, 7 workers): 18.71, 18.74, 18.63, 18.54, 18.56, 18.56, 18.63 ms. Throughput: 375.84 iter/sec.
Timings for 4480K FFT length (8 cores, 8 workers): 21.20, 21.38, 21.14, 21.12, 20.97, 21.15, 21.17, 21.36 ms. Throughput: 377.63 iter/sec.

The poor CPU is crying out for more memory bandwidth.



BTW, the old AVX code:

Timings for 4480K FFT length (8 cores, 8 workers): 24.81, 24.84, 24.70, 24.80, 24.77, 24.84, 24.80, 24.82 ms. Throughput: 322.61 iter/sec.

Mysticial 2018-09-03 05:13

Nice!

Are these all at fixed clock speeds regardless of the the workload? (i.e. AVX and AVX512 both running at 3.6 GHz?)

17% speedup is more than I expected given the memory bottleneck.

mackerel 2018-09-03 07:20

From my previous observations, the "old" AVX code wouldn't be significantly limited by ram in that configuration. Any significant increase from AVX-512 would push it there though.

Is my understanding correct, to assume AVX-512 could double throughput, if not limited by ram? What sort of speedup do you see for smaller FFTs that fit in cache? I can see it shaking things up when it eventually makes its way to LLR.

As a side thought, I assume the CPU temperatures when running would be a little warmer than with old code. It might get a lot hotter if it weren't limited...

Prime95 2018-09-03 13:56

@Mystical: not sure what the AVX-512 offsets are. I just plugged the chip in and let it rip using the motherboard's default settings.


sensors reports:

[CODE]Physical id 0: +59.0°C (high = +95.0°C, crit = +105.0°C)
Core 0: +54.0°C (high = +95.0°C, crit = +105.0°C)
Core 1: +43.0°C (high = +95.0°C, crit = +105.0°C)
Core 2: +57.0°C (high = +95.0°C, crit = +105.0°C)
Core 3: +46.0°C (high = +95.0°C, crit = +105.0°C)
Core 4: +56.0°C (high = +95.0°C, crit = +105.0°C)
Core 5: +57.0°C (high = +95.0°C, crit = +105.0°C)
Core 6: +54.0°C (high = +95.0°C, crit = +105.0°C)
Core 7: +59.0°C (high = +95.0°C, crit = +105.0°C)
[/CODE]

i7z snapshot:

[CODE]Socket [0] - [physical cores=8, logical cores=16, max online cores ever=8]
TURBO DISABLED on 8 Cores, Hyper Threading ON
Max Frequency without considering Turbo 3599.00 MHz (99.97 x [36])
Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 Cores is 45x/41x/40x/40x/40x/40x
Real Current Frequency 3603.84 MHz [99.97 x 36.05] (Max of below)
Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Temp VCore
Core 1 [0]: 3500.11 (35.01x) 1 99.9 0 0 54 0.9535
Core 2 [1]: 3599.86 (36.01x) 1 0.801 0 98.6 43 0.9777
Core 3 [2]: 3491.47 (34.92x) 1 100 0 0 56 0.9749
Core 4 [3]: 3603.84 (36.05x) 1 0.147 0 99.8 46 0.9835
Core 5 [4]: 3503.82 (35.05x) 1 100 0 0 57 0.9595
Core 6 [5]: 3487.92 (34.89x) 1 100 0 0 55 0.9529
Core 7 [6]: 3489.45 (34.90x) 1 100 0 0 54 0.9545
Core 8 [7]: 3500.00 (35.01x) 100 2.78 0 0 59 0.9590
C1 = Processor running with halts (States >C0 are power saver modes with cores idling)
C3 = Cores running with PLL turned off and core cache turned off
C6, C7 = Everything in C3 + core state saved to last level cache, C7 is deeper than C6
[/CODE]

Prime95 2018-09-03 14:00

Interestingly, uptime reports only 6 cores in use:

[CODE]george@SkylakeX:~/mers295/linux64$ uptime
09:59:12 up 78 days, 17:05, 2 users, load average: 6.01, 6.00, 6.00[/CODE]

paulunderwood 2018-09-03 14:07

[QUOTE=Prime95;495246]Interestingly, uptime reports only 6 cores in use:

[CODE]george@SkylakeX:~/mers295/linux64$ uptime
09:59:12 up 78 days, 17:05, 2 users, load average: 6.01, 6.00, 6.00[/CODE][/QUOTE]

[c]uptime[/c] is not the best measure. Better to look in [c]/proc/cpuinfo[/c] to see how many cores there are. Good luck getting the load to 8.0.

Here is my justification: When running my own code written with gwnum, I get less load then JP's LLR but similar timings.

Prime95 2018-09-03 19:55

My bad. I've been running the new code doing Gerbicz PRPs on Skylake-X and it finished two work units and will not get any more work. I've got some unexpected debugging to do.

The good news is that I didn't lose much throughput with two cores idle. The temps and i7z data above is inaccurate. The benchmarks are OK as that was done after "kill -SIGSTOP" on the running mprime.

Mysticial 2018-09-03 21:07

[QUOTE=Prime95;495245]@Mystical: not sure what the AVX-512 offsets are. I just plugged the chip in and let it rip using the motherboard's default settings.
[/QUOTE]

Sounds like it's probably -4 for AVX512.

If you want a raw cycle-for-cycle comparison of AVX vs. AVX512, you'll need to force it from the BIOS.

So you'll need to zero the offsets for both AVX and AVX512. But you'll also need to drop all the turbos to no higher than 3.6 GHz. Otherwise, you'll roast the machine when it tries to run AVX512 @ 4.0 GHz on all 8 cores.

GP2 2018-09-04 02:12

[QUOTE=Prime95;495218]
Timings for 4480K FFT length (1 core, 1 worker): 12.53 ms. Throughput: 79.83 iter/sec.

Timings for 4480K FFT length (8 cores, 8 workers): 21.20, 21.38, 21.14, 21.12, 20.97, 21.15, 21.17, 21.36 ms. Throughput: 377.63 iter/sec.

The poor CPU is crying out for more memory bandwidth.[/QUOTE]

One of the reasons I remain a fan of running on the cloud rather than a physical box is that using 8 separate one-core virtual machines really does mean 8 times the throughput of one core. Unless you somehow contrive to get them running on the same physical cloud server, which if it ever became an issue could be avoided with staggered starts.

In the above example, that would give you 640 iter/sec combined rather than 378, which is 70% more.

Although the nominal cost advantage is probably still in favor of a barebones server farm setup (unless you live in an area with expensive power), this factor does partly tilt the balance back the other way somewhat. That, plus the fact that the upgrade to Skylake hardware was free, just start using the new instance type, which was a 20% boost even on an AVX-to-AVX basis, and now I guess based on this benchmark will be an additional 17% boost when AVX-512 code is available.

Mysticial 2018-09-04 19:51

[QUOTE=GP2;495305]One of the reasons I remain a fan of running on the cloud rather than a physical box is that using 8 separate one-core virtual machines really does mean 8 times the throughput of one core. Unless you somehow contrive to get them running on the same physical cloud server, which if it ever became an issue could be avoided with staggered starts.

In the above example, that would give you 640 iter/sec combined rather than 378, which is 70% more.

Although the nominal cost advantage is probably still in favor of a barebones server farm setup (unless you live in an area with expensive power), this factor does partly tilt the balance back the other way somewhat. That, plus the fact that the upgrade to Skylake hardware was free, just start using the new instance type, which was a 20% boost even on an AVX-to-AVX basis, and now I guess based on this benchmark will be an additional 17% boost when AVX-512 code is available.[/QUOTE]

That sounds like a great way to piss off other cloud users! :razz:

Throw tons of single-threaded bandwidth-heavy AVX512 workloads on the cloud. Not only do you eat up all the memory bandwidth, you throttle their clocks as well!

:devil::devil::devil:

Mark Rose 2018-09-04 20:38

[QUOTE=Mysticial;495359]That sounds like a great way to piss off other cloud users! :razz:

Throw tons of single-threaded bandwidth-heavy AVX512 workloads on the cloud. Not only do you eat up all the memory bandwidth, you throttle their clocks as well!

:devil::devil::devil:[/QUOTE]

If EC2 users care enough, they can select dedicated tenancy instances.


All times are UTC. The time now is 07:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.