![]() |
[QUOTE=Cruelty;421122][LIST][*]sometimes higher FFT lenght has better timings than lower FFT lenght[*]sometimes HT is faster than non HT[/LIST][/QUOTE]
On my own Xeons, I've seen Prime95 struggle to match up the physical/HT cores so it didn't always pair them up correctly... I resorted to manually setting the affinity map. Also when benchmarking you may as well disable testing the HT cores. It really will hurt performance. You may have seen some gains because the HT core was actually on a different core that wasn't being used yet as a worker thread. George also explained that depending on the exponent size (and I guess that means the FFT length?) it may split the load among multiple threads in a single worker more optimally. Small FFT sizes scale horribly with a lot of cores, but the larger the FFT the better it scales... the work can be broken into more chunks which makes distributing to the multiple workers a more balanced affair. |
First run on my Single Xeon E5-2697V3 128GB ECC DDR4-2133:smile:
The 28 threads result is very much screwed. Apart from that (the HT results) there are improvements as more cores are added. More to come later. [CODE]Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz CPU speed: 2878.03 MHz, 14 hyperthreaded cores CPU features: Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA L1 cache size: 32 KB L2 cache size: 256 KB, L3 cache size: 35 MB L1 cache line size: 64 bytes L2 cache line size: 64 bytes TLBS: 64 Prime95 64-bit version 28.7, RdtscTiming=1 Best time for 1024K FFT length: 4.793 ms., avg: 4.811 ms. Best time for 1280K FFT length: 6.172 ms., avg: 6.334 ms. Best time for 1536K FFT length: 7.478 ms., avg: 7.492 ms. Best time for 1792K FFT length: 9.018 ms., avg: 9.051 ms. Best time for 2048K FFT length: 10.307 ms., avg: 10.315 ms. Best time for 2560K FFT length: 12.986 ms., avg: 12.999 ms. Best time for 3072K FFT length: 15.933 ms., avg: 16.499 ms. Best time for 3584K FFT length: 18.924 ms., avg: 18.983 ms. Best time for 4096K FFT length: 21.385 ms., avg: 21.413 ms. Best time for 5120K FFT length: 27.668 ms., avg: 27.746 ms. Best time for 6144K FFT length: 34.277 ms., avg: 34.338 ms. Best time for 7168K FFT length: 41.400 ms., avg: 41.499 ms. Best time for 8192K FFT length: 47.571 ms., avg: 47.702 ms. Timing FFTs using 2 threads on 1 physical CPU. Best time for 1024K FFT length: 4.905 ms., avg: 4.974 ms. Best time for 1280K FFT length: 6.330 ms., avg: 6.391 ms. Best time for 1536K FFT length: 7.685 ms., avg: 7.776 ms. Best time for 1792K FFT length: 9.335 ms., avg: 9.601 ms. Best time for 2048K FFT length: 10.181 ms., avg: 10.200 ms. Best time for 2560K FFT length: 12.705 ms., avg: 12.723 ms. Best time for 3072K FFT length: 16.425 ms., avg: 16.866 ms. Best time for 3584K FFT length: 18.850 ms., avg: 18.961 ms. Best time for 4096K FFT length: 22.294 ms., avg: 22.514 ms. Best time for 5120K FFT length: 29.846 ms., avg: 29.933 ms. Best time for 6144K FFT length: 38.068 ms., avg: 38.292 ms. Best time for 7168K FFT length: 45.381 ms., avg: 45.985 ms. Best time for 8192K FFT length: 51.352 ms., avg: 51.703 ms. Timing FFTs using 2 threads on 2 physical CPUs. Best time for 1024K FFT length: 2.692 ms., avg: 2.969 ms. Best time for 1280K FFT length: 3.612 ms., avg: 3.914 ms. Best time for 1536K FFT length: 4.326 ms., avg: 4.632 ms. Best time for 1792K FFT length: 5.180 ms., avg: 5.525 ms. Best time for 2048K FFT length: 5.733 ms., avg: 6.017 ms. Best time for 2560K FFT length: 7.122 ms., avg: 7.486 ms. Best time for 3072K FFT length: 8.819 ms., avg: 8.944 ms. Best time for 3584K FFT length: 10.321 ms., avg: 10.460 ms. Best time for 4096K FFT length: 11.996 ms., avg: 12.220 ms. Best time for 5120K FFT length: 14.406 ms., avg: 14.915 ms. Best time for 6144K FFT length: 17.746 ms., avg: 18.433 ms. Best time for 7168K FFT length: 21.377 ms., avg: 22.592 ms. Best time for 8192K FFT length: 24.572 ms., avg: 25.116 ms. Timing FFTs using 3 threads on 3 physical CPUs. Best time for 1024K FFT length: 1.870 ms., avg: 2.142 ms. Best time for 1280K FFT length: 2.543 ms., avg: 2.827 ms. Best time for 1536K FFT length: 3.033 ms., avg: 3.405 ms. Best time for 1792K FFT length: 3.614 ms., avg: 3.951 ms. Best time for 2048K FFT length: 3.978 ms., avg: 4.337 ms. Best time for 2560K FFT length: 4.989 ms., avg: 5.296 ms. Best time for 3072K FFT length: 6.084 ms., avg: 6.584 ms. Best time for 3584K FFT length: 7.081 ms., avg: 7.577 ms. Best time for 4096K FFT length: 8.280 ms., avg: 8.682 ms. Best time for 5120K FFT length: 10.449 ms., avg: 10.751 ms. Best time for 6144K FFT length: 12.006 ms., avg: 12.684 ms. Best time for 7168K FFT length: 14.439 ms., avg: 15.151 ms. Best time for 8192K FFT length: 16.587 ms., avg: 17.212 ms. Timing FFTs using 4 threads on 4 physical CPUs. Best time for 1024K FFT length: 1.434 ms., avg: 1.866 ms. Best time for 1280K FFT length: 1.915 ms., avg: 2.116 ms. Best time for 1536K FFT length: 2.494 ms., avg: 2.937 ms. Best time for 1792K FFT length: 2.995 ms., avg: 3.457 ms. Best time for 2048K FFT length: 3.268 ms., avg: 3.776 ms. Best time for 2560K FFT length: 4.234 ms., avg: 4.628 ms. Best time for 3072K FFT length: 4.354 ms., avg: 4.776 ms. Best time for 3584K FFT length: 5.378 ms., avg: 5.798 ms. Best time for 4096K FFT length: 6.349 ms., avg: 6.733 ms. Best time for 5120K FFT length: 7.948 ms., avg: 8.342 ms. Best time for 6144K FFT length: 9.688 ms., avg: 9.909 ms. Best time for 7168K FFT length: 10.968 ms., avg: 11.825 ms. Best time for 8192K FFT length: 12.826 ms., avg: 13.745 ms. Timing FFTs using 5 threads on 5 physical CPUs. Best time for 1024K FFT length: 1.177 ms., avg: 1.525 ms. Best time for 1280K FFT length: 1.529 ms., avg: 2.143 ms. Best time for 1536K FFT length: 1.878 ms., avg: 2.314 ms. Best time for 1792K FFT length: 2.239 ms., avg: 2.841 ms. Best time for 2048K FFT length: 2.514 ms., avg: 2.915 ms. Best time for 2560K FFT length: 3.099 ms., avg: 3.808 ms. Best time for 3072K FFT length: 3.738 ms., avg: 4.138 ms. Best time for 3584K FFT length: 4.346 ms., avg: 4.836 ms. Best time for 4096K FFT length: 5.078 ms., avg: 5.404 ms. Best time for 5120K FFT length: 6.424 ms., avg: 6.765 ms. Best time for 6144K FFT length: 7.334 ms., avg: 7.846 ms. Best time for 7168K FFT length: 9.388 ms., avg: 9.865 ms. Best time for 8192K FFT length: 10.973 ms., avg: 11.361 ms. Timing FFTs using 6 threads on 6 physical CPUs. Best time for 1024K FFT length: 1.182 ms., avg: 1.461 ms. Best time for 1280K FFT length: 1.425 ms., avg: 1.853 ms. Best time for 1536K FFT length: 1.678 ms., avg: 2.094 ms. Best time for 1792K FFT length: 1.857 ms., avg: 2.409 ms. Best time for 2048K FFT length: 2.100 ms., avg: 2.598 ms. Best time for 2560K FFT length: 2.610 ms., avg: 3.135 ms. Best time for 3072K FFT length: 3.168 ms., avg: 3.591 ms. Best time for 3584K FFT length: 3.685 ms., avg: 4.050 ms. Best time for 4096K FFT length: 4.283 ms., avg: 4.721 ms. Best time for 5120K FFT length: 5.438 ms., avg: 5.776 ms. Best time for 6144K FFT length: 6.590 ms., avg: 6.846 ms. Best time for 7168K FFT length: 8.428 ms., avg: 8.624 ms. Best time for 8192K FFT length: 9.133 ms., avg: 9.448 ms. Timing FFTs using 7 threads on 7 physical CPUs. Best time for 1024K FFT length: 0.921 ms., avg: 1.409 ms. Best time for 1280K FFT length: 1.198 ms., avg: 1.736 ms. Best time for 1536K FFT length: 1.394 ms., avg: 1.886 ms. Best time for 1792K FFT length: 1.657 ms., avg: 2.128 ms. Best time for 2048K FFT length: 1.865 ms., avg: 2.330 ms. Best time for 2560K FFT length: 2.270 ms., avg: 2.834 ms. Best time for 3072K FFT length: 2.734 ms., avg: 3.216 ms. Best time for 3584K FFT length: 3.192 ms., avg: 3.710 ms. Best time for 4096K FFT length: 4.242 ms., avg: 4.901 ms. Best time for 5120K FFT length: 4.732 ms., avg: 5.248 ms. Best time for 6144K FFT length: 5.710 ms., avg: 6.123 ms. Best time for 7168K FFT length: 6.856 ms., avg: 7.115 ms. Best time for 8192K FFT length: 7.901 ms., avg: 8.278 ms. Timing FFTs using 8 threads on 8 physical CPUs. Best time for 1024K FFT length: 1.007 ms., avg: 1.203 ms. Best time for 1280K FFT length: 1.138 ms., avg: 1.477 ms. Best time for 1536K FFT length: 1.238 ms., avg: 1.689 ms. Best time for 1792K FFT length: 1.563 ms., avg: 1.920 ms. Best time for 2048K FFT length: 1.680 ms., avg: 2.123 ms. Best time for 2560K FFT length: 2.040 ms., avg: 2.738 ms. Best time for 3072K FFT length: 2.400 ms., avg: 3.054 ms. Best time for 3584K FFT length: 2.835 ms., avg: 3.233 ms. Best time for 4096K FFT length: 3.276 ms., avg: 3.973 ms. Best time for 5120K FFT length: 4.076 ms., avg: 4.716 ms. Best time for 6144K FFT length: 5.037 ms., avg: 5.514 ms. Best time for 7168K FFT length: 6.083 ms., avg: 6.396 ms. Best time for 8192K FFT length: 7.017 ms., avg: 7.649 ms. Timing FFTs using 9 threads on 9 physical CPUs. Best time for 1024K FFT length: 1.300 ms., avg: 1.346 ms. Best time for 1280K FFT length: 0.959 ms., avg: 1.739 ms. Best time for 1536K FFT length: 1.129 ms., avg: 1.580 ms. Best time for 1792K FFT length: 1.305 ms., avg: 1.750 ms. Best time for 2048K FFT length: 1.518 ms., avg: 2.216 ms. Best time for 2560K FFT length: 1.809 ms., avg: 2.373 ms. Best time for 3072K FFT length: 2.143 ms., avg: 2.668 ms. Best time for 3584K FFT length: 2.573 ms., avg: 3.073 ms. Best time for 4096K FFT length: 2.951 ms., avg: 3.561 ms. Best time for 5120K FFT length: 3.738 ms., avg: 4.276 ms. Best time for 6144K FFT length: 4.598 ms., avg: 5.062 ms. Best time for 7168K FFT length: 5.483 ms., avg: 5.940 ms. Best time for 8192K FFT length: 6.292 ms., avg: 6.690 ms. Timing FFTs using 10 threads on 10 physical CPUs. Best time for 1024K FFT length: 0.728 ms., avg: 1.032 ms. Best time for 1280K FFT length: 0.999 ms., avg: 1.347 ms. Best time for 1536K FFT length: 1.012 ms., avg: 1.402 ms. Best time for 1792K FFT length: 1.187 ms., avg: 1.793 ms. Best time for 2048K FFT length: 1.400 ms., avg: 1.889 ms. Best time for 2560K FFT length: 1.702 ms., avg: 2.253 ms. Best time for 3072K FFT length: 1.939 ms., avg: 2.417 ms. Best time for 3584K FFT length: 2.356 ms., avg: 2.915 ms. [Sat Jan 30 00:26:03 2016] Best time for 4096K FFT length: 2.691 ms., avg: 3.513 ms. Best time for 5120K FFT length: 3.377 ms., avg: 3.906 ms. Best time for 6144K FFT length: 4.139 ms., avg: 4.465 ms. Best time for 7168K FFT length: 5.060 ms., avg: 5.456 ms. Best time for 8192K FFT length: 5.911 ms., avg: 6.458 ms. Timing FFTs using 11 threads on 11 physical CPUs. Best time for 1024K FFT length: 0.691 ms., avg: 1.012 ms. Best time for 1280K FFT length: 1.285 ms., avg: 1.368 ms. Best time for 1536K FFT length: 0.932 ms., avg: 1.311 ms. Best time for 1792K FFT length: 1.116 ms., avg: 1.630 ms. Best time for 2048K FFT length: 1.337 ms., avg: 2.201 ms. Best time for 2560K FFT length: 1.590 ms., avg: 2.089 ms. Best time for 3072K FFT length: 1.793 ms., avg: 2.225 ms. Best time for 3584K FFT length: 2.147 ms., avg: 2.659 ms. Best time for 4096K FFT length: 2.487 ms., avg: 3.080 ms. Best time for 5120K FFT length: 3.151 ms., avg: 3.620 ms. Best time for 6144K FFT length: 3.930 ms., avg: 4.468 ms. Best time for 7168K FFT length: 4.688 ms., avg: 4.821 ms. Best time for 8192K FFT length: 5.445 ms., avg: 5.852 ms. Timing FFTs using 12 threads on 12 physical CPUs. Best time for 1024K FFT length: 0.773 ms., avg: 0.828 ms. Best time for 1280K FFT length: 1.369 ms., avg: 1.406 ms. Best time for 1536K FFT length: 0.932 ms., avg: 1.386 ms. Best time for 1792K FFT length: 1.084 ms., avg: 1.703 ms. Best time for 2048K FFT length: 1.270 ms., avg: 1.824 ms. Best time for 2560K FFT length: 1.484 ms., avg: 2.105 ms. Best time for 3072K FFT length: 1.644 ms., avg: 2.379 ms. Best time for 3584K FFT length: 2.053 ms., avg: 2.905 ms. Best time for 4096K FFT length: 2.329 ms., avg: 2.814 ms. Best time for 5120K FFT length: 2.924 ms., avg: 3.592 ms. Best time for 6144K FFT length: 3.762 ms., avg: 3.848 ms. Best time for 7168K FFT length: 4.520 ms., avg: 4.866 ms. Best time for 8192K FFT length: 5.338 ms., avg: 5.746 ms. Timing FFTs using 13 threads on 13 physical CPUs. Best time for 1024K FFT length: 1.023 ms., avg: 1.083 ms. Best time for 1280K FFT length: 0.835 ms., avg: 1.145 ms. Best time for 1536K FFT length: 1.363 ms., avg: 1.469 ms. Best time for 1792K FFT length: 1.172 ms., avg: 1.490 ms. Best time for 2048K FFT length: 1.221 ms., avg: 1.812 ms. Best time for 2560K FFT length: 1.472 ms., avg: 2.020 ms. Best time for 3072K FFT length: 1.562 ms., avg: 2.188 ms. Best time for 3584K FFT length: 1.891 ms., avg: 2.571 ms. Best time for 4096K FFT length: 2.150 ms., avg: 2.738 ms. Best time for 5120K FFT length: 2.797 ms., avg: 3.612 ms. Best time for 6144K FFT length: 3.472 ms., avg: 3.843 ms. Best time for 7168K FFT length: 4.593 ms., avg: 4.946 ms. Best time for 8192K FFT length: 5.153 ms., avg: 5.493 ms. Timing FFTs using 14 threads on 14 physical CPUs. Best time for 1024K FFT length: 0.975 ms., avg: 1.039 ms. Best time for 1280K FFT length: 0.809 ms., avg: 1.115 ms. Best time for 1536K FFT length: 0.814 ms., avg: 1.289 ms. Best time for 1792K FFT length: 0.960 ms., avg: 1.483 ms. Best time for 2048K FFT length: 1.223 ms., avg: 1.695 ms. Best time for 2560K FFT length: 1.635 ms., avg: 1.927 ms. Best time for 3072K FFT length: 1.522 ms., avg: 2.320 ms. Best time for 3584K FFT length: 1.818 ms., avg: 2.653 ms. Best time for 4096K FFT length: 2.059 ms., avg: 2.665 ms. Best time for 5120K FFT length: 2.646 ms., avg: 3.227 ms. Best time for 6144K FFT length: 3.395 ms., avg: 4.085 ms. Best time for 7168K FFT length: 4.240 ms., avg: 4.636 ms. Best time for 8192K FFT length: 4.905 ms., avg: 5.378 ms. Timing FFTs using 28 threads on 14 physical CPUs. Best time for 1024K FFT length: 3.022 ms., avg: 3.683 ms. Best time for 1280K FFT length: 3.287 ms., avg: 4.911 ms. Best time for 1536K FFT length: 3.430 ms., avg: 4.703 ms. Best time for 1792K FFT length: 3.735 ms., avg: 4.901 ms. Best time for 2048K FFT length: 5.461 ms., avg: 7.825 ms. Best time for 2560K FFT length: 5.530 ms., avg: 8.088 ms. Best time for 3072K FFT length: 6.958 ms., avg: 8.445 ms. Best time for 3584K FFT length: 7.098 ms., avg: 10.531 ms. Best time for 4096K FFT length: 8.573 ms., avg: 11.630 ms. Best time for 5120K FFT length: 9.140 ms., avg: 12.934 ms. Best time for 6144K FFT length: 8.914 ms., avg: 12.554 ms. Best time for 7168K FFT length: 12.851 ms., avg: 16.634 ms. Best time for 8192K FFT length: 14.782 ms., avg: 24.383 ms. Timings for 1024K FFT length (1 cpu, 1 worker): 4.26 ms. Throughput: 234.57 iter/sec. Timings for 1024K FFT length (2 cpus, 2 workers): 4.27, 4.27 ms. Throughput: 468.06 iter/sec. Timings for 1024K FFT length (3 cpus, 3 workers): 4.54, 4.51, 4.59 ms. Throughput: 659.74 iter/sec. Timings for 1024K FFT length (4 cpus, 4 workers): 4.80, 4.78, 4.81, 4.83 ms. Throughput: 832.58 iter/sec. Timings for 1024K FFT length (5 cpus, 5 workers): 5.01, 5.05, 5.04, 5.04, 5.05 ms. Throughput: 992.16 iter/sec. Timings for 1024K FFT length (6 cpus, 6 workers): 5.19, 5.20, 5.20, 5.19, 5.17, 5.18 ms. Throughput: 1156.39 iter/sec. Timings for 1024K FFT length (7 cpus, 7 workers): 5.34, 5.37, 5.36, 5.34, 5.34, 5.34, 5.33 ms. Throughput: 1309.17 iter/sec. Timings for 1024K FFT length (8 cpus, 8 workers): 5.59, 5.60, 5.60, 5.60, 5.60, 5.60, 5.60, 5.58 ms. Throughput: 1430.07 iter/sec. Timings for 1024K FFT length (9 cpus, 9 workers): 6.04, 6.04, 6.05, 6.05, 6.05, 6.05, 6.05, 6.05, 6.02 ms. Throughput: 1488.82 iter/sec. Timings for 1024K FFT length (10 cpus, 10 workers): 6.55, 6.54, 6.54, 6.54, 6.55, 6.55, 6.55, 6.54, 6.55, 6.53 ms. Throughput: 1528.25 iter/sec. Timings for 1024K FFT length (11 cpus, 11 workers): 7.11, 7.07, 7.06, 7.05, 7.04, 7.04, 7.04, 7.04, 7.04, 7.04, 7.04 ms. Throughput: 1560.24 iter/sec. Timings for 1024K FFT length (12 cpus, 12 workers): 7.60, 7.61, 7.57, 7.57, 7.55, 7.58, 7.55, 7.57, 7.56, 7.55, 7.59, 7.55 ms. Throughput: 1585.16 iter/sec. Timings for 1024K FFT length (13 cpus, 13 workers): 8.34, 8.29, 8.28, 8.28, 8.22, 8.27, 8.26, 8.24, 8.27, 8.27, 8.30, 8.30, 8.26 ms. Throughput: 1571.02 iter/sec. Timings for 1024K FFT length (14 cpus, 14 workers): 9.12, 9.05, 9.04, 9.04, 9.04, 9.05, 8.96, 8.96, 8.99, 8.96, 9.04, 9.08, 9.12, 8.99 ms. Throughput: 1550.39 iter/sec.[SUP][/SUP][/CODE] |
[QUOTE=xtreme2k;424543]First run on my Single Xeon E5-2697V3 128GB ECC DDR4-2133:smile:
The 28 threads result is very much screwed. Apart from that (the HT results) there are improvements as more cores are added. More to come later.[/QUOTE] You should run the benchmarks to test multiple cores in single workers as well (like 14 cores, one worker). I forget the options to add to the config files to enable that (and also you may as well disable testing any HT cores... it adds to the noise from the test results and it won't be any faster). These many-cored systems are indeed awesome tools. |
Is there a way just to test the 1 worker 4096K FFT speed? For 1-14 cores without the need to run the full bench?
The initial results on 12 threads on 12 cpu (leaving me with 2 cpu for other tasks), seems to work well for me. The test result don't deviate much from the actual single work 12t LL test time which suggests the work is well multithreaded. With expected completion time of a current first time LL at about 52 hours. Win10 pro seems to divide the work well to real cpu (from looking at the task manager) without needing to manually fixating threads to cores. The Xeon's turbo is complicated I can only get 2.6GHz which is base speed at 12 threads. I was hoping for more. |
[QUOTE=xtreme2k;424793]Is there a way just to test the 1 worker 4096K FFT speed? For 1-14 cores without the need to run the full bench?
The initial results on 12 threads on 12 cpu (leaving me with 2 cpu for other tasks), seems to work well for me. The test result don't deviate much from the actual single work 12t LL test time which suggests the work is well multithreaded. With expected completion time of a current first time LL at about 52 hours. Win10 pro seems to divide the work well to real cpu (from looking at the task manager) without needing to manually fixating threads to cores. The Xeon's turbo is complicated I can only get 2.6GHz which is base speed at 12 threads. I was hoping for more.[/QUOTE] In the "prime.txt" file you'd add: MinBenchFFT=4096 MaxBenchFFT=4096 BenchHyperthreads=0 BenchMultithreads=1 That last one tells it to benchmark multiple cores in a single worker. I didn't see a way to force it to only benchmark with a single worker (but with all the possible # of cores per worker). At least with those you'll test just the 4096K FFT, no HT benchmarking, and see how your throughput increases (or possibly not) with multiple core workers. Hope that helps. |
Thanks I will give that a try tonight.
As I see it the 12 workers 12 threads benchmark time (2.329 ms) is very close to the actual LL 1 worker 12 threads (around 2.3-2.4 ms), I would expect the benchmark to show similar results. :smile: These Xeons are so powerful. |
It's interesting to see 1 worker vs multiple workers that 1 worker is actually faster throughout. Am I reading this right?
eg - 14 cpu 1 worker - 509 it/s 14 cpu 14 workers - 371 it/s [CODE]Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz CPU speed: 2594.02 MHz, 14 hyperthreaded cores CPU features: Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA L1 cache size: 32 KB L2 cache size: 256 KB, L3 cache size: 35 MB L1 cache line size: 64 bytes L2 cache line size: 64 bytes TLBS: 64 Prime95 64-bit version 28.7, RdtscTiming=1 Best time for 4096K FFT length: 23.861 ms., avg: 23.901 ms. Timing FFTs using 2 threads on 2 physical CPUs. Best time for 4096K FFT length: 12.534 ms., avg: 12.581 ms. Timing FFTs using 3 threads on 3 physical CPUs. Best time for 4096K FFT length: 8.445 ms., avg: 8.487 ms. Timing FFTs using 4 threads on 4 physical CPUs. Best time for 4096K FFT length: 6.512 ms., avg: 6.543 ms. Timing FFTs using 5 threads on 5 physical CPUs. Best time for 4096K FFT length: 4.655 ms., avg: 4.701 ms. Timing FFTs using 6 threads on 6 physical CPUs. Best time for 4096K FFT length: 3.915 ms., avg: 3.950 ms. Timing FFTs using 7 threads on 7 physical CPUs. Best time for 4096K FFT length: 3.340 ms., avg: 3.409 ms. Timing FFTs using 8 threads on 8 physical CPUs. Best time for 4096K FFT length: 2.942 ms., avg: 2.977 ms. Timing FFTs using 9 threads on 9 physical CPUs. Best time for 4096K FFT length: 2.626 ms., avg: 2.652 ms. Timing FFTs using 10 threads on 10 physical CPUs. Best time for 4096K FFT length: 2.444 ms., avg: 2.893 ms. Timing FFTs using 11 threads on 11 physical CPUs. Best time for 4096K FFT length: 2.318 ms., avg: 2.691 ms. Timing FFTs using 12 threads on 12 physical CPUs. Best time for 4096K FFT length: 2.167 ms., avg: 2.607 ms. Timing FFTs using 13 threads on 13 physical CPUs. Best time for 4096K FFT length: 2.037 ms., avg: 2.733 ms. Timing FFTs using 14 threads on 14 physical CPUs. Best time for 4096K FFT length: 1.929 ms., avg: 2.475 ms. Timings for 4096K FFT length (1 cpu, 1 worker): 19.44 ms. Throughput: 51.43 iter/sec. Timings for 4096K FFT length (2 cpus, 1 worker): 10.36 ms. Throughput: 96.56 iter/sec. Timings for 4096K FFT length (2 cpus, 2 workers): 24.33, 24.52 ms. Throughput: 81.87 iter/sec. Timings for 4096K FFT length (3 cpus, 1 worker): 7.34 ms. Throughput: 136.33 iter/sec. Timings for 4096K FFT length (3 cpus, 3 workers): 22.74, 22.71, 23.15 ms. Throughput: 131.19 iter/sec. Timings for 4096K FFT length (4 cpus, 1 worker): 5.75 ms. Throughput: 173.87 iter/sec. Timings for 4096K FFT length (4 cpus, 2 workers): 13.39, 13.33 ms. Throughput: 149.67 iter/sec. Timings for 4096K FFT length (4 cpus, 4 workers): 23.80, 23.79, 23.33, 23.35 ms. Throughput: 169.73 iter/sec. Timings for 4096K FFT length (5 cpus, 1 worker): 4.70 ms. Throughput: 212.96 iter/sec. Timings for 4096K FFT length (5 cpus, 5 workers): 24.01, 24.00, 23.86, 23.97, 23.68 ms. Throughput: 209.18 iter/sec. Timings for 4096K FFT length (6 cpus, 1 worker): 3.93 ms. Throughput: 254.21 iter/sec. Timings for 4096K FFT length (6 cpus, 2 workers): 9.17, 9.03 ms. Throughput: 219.75 iter/sec. Timings for 4096K FFT length (6 cpus, 3 workers): 12.45, 12.66, 12.46 ms. Throughput: 239.54 iter/sec. Timings for 4096K FFT length (6 cpus, 6 workers): 24.41, 24.27, 24.50, 24.14, 24.02, 24.03 ms. Throughput: 247.64 iter/sec. Timings for 4096K FFT length (7 cpus, 1 worker): 3.36 ms. Throughput: 297.56 iter/sec. Timings for 4096K FFT length (7 cpus, 7 workers): 24.62, 24.70, 24.57, 24.69, 24.66, 24.67, 24.52 ms. Throughput: 284.19 iter/sec. Timings for 4096K FFT length (8 cpus, 1 worker): 2.98 ms. Throughput: 336.02 iter/sec. Timings for 4096K FFT length (8 cpus, 2 workers): 7.01, 6.89 ms. Throughput: 287.92 iter/sec. Timings for 4096K FFT length (8 cpus, 4 workers): 13.00, 13.03, 12.98, 12.85 ms. Throughput: 308.53 iter/sec. Timings for 4096K FFT length (8 cpus, 8 workers): 25.55, 25.41, 25.48, 25.42, 25.30, 25.54, 25.21, 25.42 ms. Throughput: 314.77 iter/sec. Timings for 4096K FFT length (9 cpus, 1 worker): 2.64 ms. Throughput: 379.15 iter/sec. Timings for 4096K FFT length (9 cpus, 3 workers): 9.04, 8.96, 8.91 ms. Throughput: 334.42 iter/sec. Timings for 4096K FFT length (9 cpus, 9 workers): 26.42, 26.60, 26.72, 26.55, 26.65, 26.51, 26.68, 26.43, 26.75 ms. Throughput: 338.45 iter/sec. Timings for 4096K FFT length (10 cpus, 1 worker): 2.38 ms. Throughput: 419.87 iter/sec. Timings for 4096K FFT length (10 cpus, 2 workers): 5.64, 5.57 ms. Throughput: 356.92 iter/sec. [Mon Feb 01 21:57:10 2016] Timings for 4096K FFT length (10 cpus, 5 workers): 14.19, 14.15, 14.27, 14.03, 14.10 ms. Throughput: 353.40 iter/sec. Timings for 4096K FFT length (10 cpus, 10 workers): 28.33, 28.31, 28.23, 28.26, 28.22, 27.95, 28.18, 28.26, 28.09, 28.09 ms. Throughput: 354.71 iter/sec. Timings for 4096K FFT length (11 cpus, 1 worker): 2.20 ms. Throughput: 453.70 iter/sec. Timings for 4096K FFT length (11 cpus, 11 workers): 30.87, 31.17, 31.11, 30.83, 30.94, 31.16, 30.93, 30.65, 30.67, 30.95, 31.12 ms. Throughput: 355.48 iter/sec. Timings for 4096K FFT length (12 cpus, 1 worker): 2.25 ms. Throughput: 443.49 iter/sec. Timings for 4096K FFT length (12 cpus, 2 workers): 5.21, 5.21 ms. Throughput: 383.78 iter/sec. Timings for 4096K FFT length (12 cpus, 3 workers): 8.17, 8.17, 8.17 ms. Throughput: 367.36 iter/sec. Timings for 4096K FFT length (12 cpus, 4 workers): 11.20, 10.94, 10.83, 10.94 ms. Throughput: 364.53 iter/sec. Timings for 4096K FFT length (12 cpus, 6 workers): 16.54, 16.46, 16.34, 16.41, 16.47, 16.81 ms. Throughput: 363.58 iter/sec. Timings for 4096K FFT length (12 cpus, 12 workers): 32.97, 33.15, 33.15, 33.10, 32.65, 33.05, 32.88, 32.55, 32.73, 32.49, 33.04, 33.46 ms. Throughput: 364.37 iter/sec. Timings for 4096K FFT length (13 cpus, 1 worker): 2.08 ms. Throughput: 480.50 iter/sec. Timings for 4096K FFT length (13 cpus, 13 workers): 35.57, 35.26, 35.35, 34.97, 34.98, 35.27, 35.22, 35.10, 35.00, 35.07, 35.30, 35.47, 35.32 ms. Throughput: 369.11 iter/sec. Timings for 4096K FFT length (14 cpus, 1 worker): 1.96 ms. Throughput: 509.45 iter/sec. Timings for 4096K FFT length (14 cpus, 2 workers): 5.02, 5.10 ms. Throughput: 395.34 iter/sec. Timings for 4096K FFT length (14 cpus, 7 workers): 18.77, 18.78, 18.79, 18.53, 18.77, 19.21, 18.79 ms. Throughput: 372.26 iter/sec. Timings for 4096K FFT length (14 cpus, 14 workers): 37.86, 38.03, 37.62, 37.28, 37.40, 37.76, 37.46, 37.71, 37.65, 37.35, 37.68, 38.25, 37.94, 38.33 ms. Throughput: 371.01 iter/sec. [/CODE] |
[QUOTE=xtreme2k;424846]It's interesting to see 1 worker vs multiple workers that 1 worker is actually faster throughout. Am I reading this right?
eg - 14 cpu 1 worker - 509 it/s 14 cpu 14 workers - 371 it/s [/QUOTE] That's basically inline with my own observations, at that FFT size anyway. My theory is that with 14 workers, the memory contention becomes a big bottleneck, so the CPUs are actually idling more as it waits for memory. Running 1 worker with all 14 cores means the memory won't be the bottleneck anymore, and it's once again limited by the CPUs. A quick test is to look at the CPU graphs (one graph per core) when the workers are running. Doing this while benchmarking isn't as useful because you need to see performance over a longer period of time, like 10-15 seconds... you could adjust the benchmark settings to have it run longer... Anyway, when the cores are busy, if the physical cores aren't at 100% usage then it's waiting on something else (probably memory). You can also see the interesting effects of a multi-threaded worker with a sub-optimal threading scheme. For instance, if you take a somewhat small exponent in the 10M range and have a 14-core worker attack it with full force, what you'll see is the first core around 100% and then the other 13 cores will only be using 50-75%. That's because the smaller FFT sizes don't tend to distribute themselves as well among a lot of cores. For something like that (small FFTs) you'd want to limit it to a couple of cores at most, otherwise the program itself is too inefficient to cope with it. At the larger FFT sizes (2M and above seem decent enough) it's not much of a problem, and the larger the FFT, the more efficient it seems to get at distributing the load between cores. |
[QUOTE=xtreme2k;424793]The Xeon's turbo is complicated I can only get 2.6GHz which is base speed at 12 threads. I was hoping for more.[/QUOTE] Your CPU has TDP of 145W and it probably reaches that value with 12 cores utlizing Prime95 (you can verify that with some monitoring software). My CPU is 10c/20t and @ 2.5GHz I am hitting 105W with 9 cores running LLR / Prime95. Afterwards CPU clock occassionally falls to 2.4GHz (on single core) to stay below 105W TDP.
|
[QUOTE=Cruelty;424900]Your CPU has TDP of 145W and it probably reaches that value with 12 cores utlizing Prime95 (you can verify that with some monitoring software). My CPU is 10c/20t and @ 2.5GHz I am hitting 105W with 9 cores running LLR / Prime95. Afterwards CPU clock occassionally falls to 2.4GHz (on single core) to stay below 105W TDP.[/QUOTE]
I can reach (almost) full turbo speeds on my Xeon E5-2697 v3 dual-core server. I say "almost" because there seems to be something common on many of the Proliants I've tested this on... CPU #1 will reach full turbo speeds, but CPU #2 is one turbo boost lower than the max. I don't know if it's the placement of the CPUs in relation to the bank of fans, or maybe the way CPU #2 is closer to where I have the hard drives in front (incoming air passes over the drives, through the fans, and then into the air handling baffles). Whatever the case, CPU #2 must be a little hotter so it steps down a bit. Not enough for me to get too concerned about, but I do keep the larger FFT sized tests on CPU #1 just to give them more oomph to get the work done. |
It's interesting with these Xeon E5v3/i7 5xxx it is best to run 1 worker with multiple threads. It is clearly better than any other combinations. Not sure on the underlying reason but on the surface the parallelism and efficiency of 1 worker multiple threads is way better than 2 or more workers all fighting for cache and memory bandwidth.
There is some magic in there when these CPU is running just 1 worker. This is at least true for 4096K |
| All times are UTC. The time now is 22:22. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.