"Teal Dulcet"
Jun 2018
65_{10} Posts

Quote:
Originally Posted by drkirkby
Why would 3 workers give the most throughput on a dualsocket computer?

I ran the throughput benchmark on a c5.metal instance and got different results. Specifically, two workers were faster at the higher FFT lengths. Here are the fastest numbers of workers for each supported FFT length benchmarked by default:
 6 workers: 2048K, 2100K, 2160K, 2240K, 2304K, 2400K
 4 workers: 2520K, 2560K, 2592K, 2688K, 2880K, 2940K, 3000K, 3072K, 3136K, 3200K, 3360K, 3456K, 3600K, 3840K, 3920K, 4200K, 4320K, 4480K, 4800K
 3 workers: 4032K
 2 workers: 4608K, 4704K, 5040K, 5120K, 5184K, 5376K, 5760K, 6048K, 6144K, 6272K, 6400K, 6720K, 7056K, 7168K, 7200K, 7680K, 8064K
Here are the actual results for one of the FFT lengths used for wavefront first time tests:
Code:
Timings for 6144K FFT length (48 cores, 1 worker): 1.35 ms. Throughput: 740.08 iter/sec.
Timings for 6144K FFT length (48 cores, 2 workers): 1.16, 1.19 ms. Throughput: 1697.08 iter/sec.
Timings for 6144K FFT length (48 cores, 3 workers): 3.04, 3.07, 1.23 ms. Throughput: 1470.23 iter/sec.
Timings for 6144K FFT length (48 cores, 4 workers): 3.05, 3.02, 3.02, 3.00 ms. Throughput: 1322.79 iter/sec.
Timings for 6144K FFT length (48 cores, 6 workers): 5.47, 5.47, 5.48, 5.39, 5.37, 5.39 ms. Throughput: 1105.26 iter/sec.
Timings for 6144K FFT length (48 cores, 8 workers): 7.56, 7.54, 7.56, 7.55, 7.41, 7.50, 7.46, 7.44 ms. Throughput: 1066.38 iter/sec.
Timings for 6144K FFT length (48 cores, 12 workers): 11.56, 11.61, 11.62, 12.32, 11.57, 11.51, 11.54, 11.55, 11.29, 11.40, 11.43, 11.25 ms. Throughput: 1039.05 iter/sec.
Timings for 6144K FFT length (48 cores, 16 workers): 20.99, 20.72, 20.95, 20.82, 20.78, 21.04, 20.89, 20.78, 14.67, 13.45, 14.54, 14.61, 13.46, 13.71, 13.94, 14.91 ms. Throughput: 949.13 iter/sec.
Timings for 6144K FFT length (48 cores, 24 workers): 57.30, 56.56, 56.51, 56.29, 56.69, 56.99, 56.67, 56.94, 56.71, 56.65, 56.85, 56.70, 26.03, 30.28, 25.78, 27.11, 29.24, 29.75, 27.03, 29.71, 27.35, 28.07, 30.19, 28.15 ms. Throughput: 637.96 iter/sec.
Timings for 6144K FFT length (48 cores, 48 workers): 130.05, 132.14, 128.50, 128.65, 129.51, 129.92, 128.45, 129.71, 128.78, 129.41, 130.18, 128.95, 130.09, 129.10, 130.14, 129.61, 128.04, 130.51, 129.25, 129.42, 129.92, 130.49, 129.53, 131.25, 86.04, 102.15, 87.65, 103.38, 74.75, 91.32, 91.88, 76.62, 75.89, 103.66, 101.44, 101.42, 95.30, 93.57, 79.79, 102.96, 72.71, 95.29, 98.47, 87.29, 100.54, 87.55, 94.32, 102.50 ms. Throughput: 449.43 iter/sec.
MPrime by default wanted to use 12 workers, but 2 workers is significantly faster.
