20170411, 16:19  #144 
"/X\(‘‘)/X\"
Jan 2013
101101101011_{2} Posts 
I see the workers vs threads advantage crossover point is around 4096K. That's similar to what I see with my i56600 systems with only 6M of L3 vs the 16M Ryzen has.
What happens if you benchmark using two workers? I wonder how that will affect performance with the split L3 cache in Ryzen. 
20170411, 17:03  #145 
Jan 2003
7×29 Posts 
Ryzen 1700 benchmark results
Here's the permutation of results for the various number of cores / workers. It takes a long time to run, so I've only tested the 1024K and 8192K FFTs:
1024K FFT results: Code:
Timings for 1024K FFT length (1 cpu, 1 worker): 7.87 ms. Throughput: 127.02 iter/sec. Timings for 1024K FFT length (2 cpus, 1 worker): 4.02 ms. Throughput: 249.01 iter/sec. Timings for 1024K FFT length (2 cpus, 2 workers): 7.88, 7.83 ms. Throughput: 254.57 iter/sec. Timings for 1024K FFT length (3 cpus, 1 worker): 2.69 ms. Throughput: 371.47 iter/sec. Timings for 1024K FFT length (3 cpus, 2 workers): 4.04, 7.88 ms. Throughput: 374.67 iter/sec. Timings for 1024K FFT length (3 cpus, 3 workers): 7.88, 7.90, 7.90 ms. Throughput: 380.02 iter/sec. Timings for 1024K FFT length (4 cpus, 1 worker): 2.06 ms. Throughput: 484.73 iter/sec. Timings for 1024K FFT length (4 cpus, 2 workers): 4.12, 4.12 ms. Throughput: 485.85 iter/sec. Timings for 1024K FFT length (4 cpus, 3 workers): 4.11, 8.03, 8.02 ms. Throughput: 492.51 iter/sec. Timings for 1024K FFT length (4 cpus, 4 workers): 8.13, 7.93, 8.04, 7.94 ms. Throughput: 499.35 iter/sec. Timings for 1024K FFT length (5 cpus, 1 worker): 1.72 ms. Throughput: 580.39 iter/sec. Timings for 1024K FFT length (5 cpus, 2 workers): 2.75, 4.42 ms. Throughput: 589.95 iter/sec. Timings for 1024K FFT length (5 cpus, 3 workers): 4.13, 4.15, 7.91 ms. Throughput: 609.18 iter/sec. Timings for 1024K FFT length (5 cpus, 4 workers): 4.22, 8.13, 8.02, 7.80 ms. Throughput: 612.66 iter/sec. Timings for 1024K FFT length (5 cpus, 5 workers): 8.16, 8.22, 8.19, 8.17, 7.79 ms. Throughput: 616.98 iter/sec. Timings for 1024K FFT length (6 cpus, 1 worker): 1.47 ms. Throughput: 682.28 iter/sec. Timings for 1024K FFT length (6 cpus, 2 workers): 2.74, 2.93 ms. Throughput: 705.98 iter/sec. Timings for 1024K FFT length (6 cpus, 3 workers): 4.18, 4.20, 4.13 ms. Throughput: 719.08 iter/sec. Timings for 1024K FFT length (6 cpus, 4 workers): 4.50, 4.44, 8.31, 8.31 ms. Throughput: 688.14 iter/sec. Timings for 1024K FFT length (6 cpus, 5 workers): 4.51, 8.74, 8.78, 8.44, 8.44 ms. Throughput: 687.33 iter/sec. Timings for 1024K FFT length (6 cpus, 6 workers): 9.02, 9.02, 8.87, 8.88, 8.54, 8.54 ms. Throughput: 681.16 iter/sec. Timings for 1024K FFT length (7 cpus, 1 worker): 1.28 ms. Throughput: 779.77 iter/sec. Timings for 1024K FFT length (7 cpus, 2 workers): 2.06, 2.73 ms. Throughput: 853.00 iter/sec. Timings for 1024K FFT length (7 cpus, 3 workers): 2.90, 4.86, 4.32 ms. Throughput: 782.57 iter/sec. Timings for 1024K FFT length (7 cpus, 4 workers): 4.80, 4.80, 4.56, 8.96 ms. Throughput: 747.62 iter/sec. Timings for 1024K FFT length (7 cpus, 5 workers): 4.95, 4.94, 9.19, 9.07, 9.25 ms. Throughput: 731.95 iter/sec. Timings for 1024K FFT length (7 cpus, 6 workers): 5.06, 9.69, 9.69, 9.31, 9.44, 9.50 ms. Throughput: 722.76 iter/sec. Timings for 1024K FFT length (7 cpus, 7 workers): 10.13, 10.09, 9.94, 9.97, 9.57, 9.65, 9.63 ms. Throughput: 710.74 iter/sec. Timings for 1024K FFT length (8 cpus, 1 worker): 1.13 ms. Throughput: 884.05 iter/sec. Timings for 1024K FFT length (8 cpus, 2 workers): 2.63, 2.62 ms. Throughput: 761.75 iter/sec. Timings for 1024K FFT length (8 cpus, 3 workers): 2.95, 3.34, 4.52 ms. Throughput: 860.18 iter/sec. Timings for 1024K FFT length (8 cpus, 4 workers): 5.35, 5.35, 5.35, 5.35 ms. Throughput: 747.81 iter/sec. Timings for 1024K FFT length (8 cpus, 5 workers): 5.43, 5.34, 5.43, 10.30, 10.17 ms. Throughput: 750.93 iter/sec. Timings for 1024K FFT length (8 cpus, 6 workers): 5.44, 5.43, 10.74, 10.71, 10.71, 10.60 ms. Throughput: 742.24 iter/sec. Timings for 1024K FFT length (8 cpus, 7 workers): 5.62, 10.58, 10.95, 11.15, 11.10, 11.10, 11.05 ms. Throughput: 724.08 iter/sec. Timings for 1024K FFT length (8 cpus, 8 workers): 11.27, 11.30, 11.09, 11.41, 11.35, 11.26, 11.13, 11.13 ms. Throughput: 711.56 iter/sec. Code:
Timings for 8192K FFT length (1 cpu, 1 worker): 68.57 ms. Throughput: 14.58 iter/sec. Timings for 8192K FFT length (2 cpus, 1 worker): 35.11 ms. Throughput: 28.48 iter/sec. Timings for 8192K FFT length (2 cpus, 2 workers): 68.42, 68.64 ms. Throughput: 29.18 iter/sec. Timings for 8192K FFT length (3 cpus, 1 worker): 23.51 ms. Throughput: 42.54 iter/sec. Timings for 8192K FFT length (3 cpus, 2 workers): 35.24, 68.68 ms. Throughput: 42.94 iter/sec. Timings for 8192K FFT length (3 cpus, 3 workers): 69.41, 69.41, 68.65 ms. Throughput: 43.38 iter/sec. Timings for 8192K FFT length (4 cpus, 1 worker): 18.12 ms. Throughput: 55.18 iter/sec. Timings for 8192K FFT length (4 cpus, 2 workers): 36.31, 35.78 ms. Throughput: 55.49 iter/sec. Timings for 8192K FFT length (4 cpus, 3 workers): 36.17, 70.14, 70.68 ms. Throughput: 56.05 iter/sec. Timings for 8192K FFT length (4 cpus, 4 workers): 71.98, 70.70, 70.39, 71.11 ms. Throughput: 56.31 iter/sec. Timings for 8192K FFT length (5 cpus, 1 worker): 15.44 ms. Throughput: 64.78 iter/sec. Timings for 8192K FFT length (5 cpus, 2 workers): 25.09, 39.87 ms. Throughput: 64.94 iter/sec. Timings for 8192K FFT length (5 cpus, 3 workers): 37.76, 37.42, 72.10 ms. Throughput: 67.08 iter/sec. Timings for 8192K FFT length (5 cpus, 4 workers): 37.82, 73.26, 73.97, 72.15 ms. Throughput: 67.47 iter/sec. Timings for 8192K FFT length (5 cpus, 5 workers): 74.51, 74.57, 74.66, 73.65, 72.49 ms. Throughput: 67.60 iter/sec. Timings for 8192K FFT length (6 cpus, 1 worker): 13.89 ms. Throughput: 72.01 iter/sec. Timings for 8192K FFT length (6 cpus, 2 workers): 26.75, 27.58 ms. Throughput: 73.64 iter/sec. Timings for 8192K FFT length (6 cpus, 3 workers): 41.36, 39.80, 38.77 ms. Throughput: 75.10 iter/sec. Timings for 8192K FFT length (6 cpus, 4 workers): 40.86, 40.11, 76.65, 76.78 ms. Throughput: 75.48 iter/sec. Timings for 8192K FFT length (6 cpus, 5 workers): 40.51, 79.77, 79.87, 76.74, 77.09 ms. Throughput: 75.74 iter/sec. Timings for 8192K FFT length (6 cpus, 6 workers): 80.98, 80.93, 80.56, 80.90, 76.80, 76.97 ms. Throughput: 75.49 iter/sec. Timings for 8192K FFT length (7 cpus, 1 worker): 13.08 ms. Throughput: 76.46 iter/sec. Timings for 8192K FFT length (7 cpus, 2 workers): 22.33, 28.45 ms. Throughput: 79.94 iter/sec. Timings for 8192K FFT length (7 cpus, 3 workers): 29.86, 46.68, 42.07 ms. Throughput: 78.68 iter/sec. Timings for 8192K FFT length (7 cpus, 4 workers): 44.69, 43.87, 42.54, 84.34 ms. Throughput: 80.54 iter/sec. Timings for 8192K FFT length (7 cpus, 5 workers): 44.72, 44.22, 84.17, 84.02, 84.96 ms. Throughput: 80.53 iter/sec. Timings for 8192K FFT length (7 cpus, 6 workers): 45.27, 87.30, 87.45, 84.04, 84.19, 85.19 ms. Throughput: 80.50 iter/sec. Timings for 8192K FFT length (7 cpus, 7 workers): 88.28, 89.04, 88.03, 88.42, 84.05, 85.19, 85.67 ms. Throughput: 80.54 iter/sec. Timings for 8192K FFT length (8 cpus, 1 worker): 12.65 ms. Throughput: 79.04 iter/sec. Timings for 8192K FFT length (8 cpus, 2 workers): 24.54, 24.30 ms. Throughput: 81.90 iter/sec. Timings for 8192K FFT length (8 cpus, 3 workers): 31.89, 34.01, 48.59 ms. Throughput: 81.34 iter/sec. Timings for 8192K FFT length (8 cpus, 4 workers): 49.14, 49.03, 49.05, 49.04 ms. Throughput: 81.52 iter/sec. Timings for 8192K FFT length (8 cpus, 5 workers): 49.10, 48.61, 48.66, 95.25, 94.38 ms. Throughput: 82.58 iter/sec. Timings for 8192K FFT length (8 cpus, 6 workers): 49.30, 49.26, 97.05, 96.12, 96.48, 95.65 ms. Throughput: 82.11 iter/sec. Timings for 8192K FFT length (8 cpus, 7 workers): 50.30, 94.02, 97.53, 96.59, 96.13, 96.87, 96.09 ms. Throughput: 82.25 iter/sec. Timings for 8192K FFT length (8 cpus, 8 workers): 96.94, 98.16, 96.04, 98.54, 97.29, 97.34, 96.83, 96.22 ms. Throughput: 82.33 iter/sec. Last fiddled with by db597 on 20170411 at 17:09 
20170411, 23:05  #146  
Dec 2016
2·3^{2}·5 Posts 
Quote:
core #1 yields 14.58 iters/sec core #2 yields 14.60 iters/sec core #3 yields 14.20 iters/sec core #4 yields 12.93 iters/sec core #5 yields 11.29 iters/sec core #6 yields 8.14 iters/sec core #7 yields 4.80 iters/sec core #8 yields 1.79 iters/sec So the first 4 cores scale almost linearly, going to 6 cores you already lose some performance and going to 8 cores adds virtually nothing. The 1024K FFT has a strange behavior when going from 6 to 7 cores: core #1 yields 127.02 iters/sec core #2 yields 127.55 iters/sec core #3 yields 125.45 iters/sec core #4 yields 119.33 iters/sec core #5 yields 117.63 iters/sec core #6 yields 102.10 iters/sec core #7 yields 133.92 iters/sec core #8 yields 31.05 iters/sec Generally, it scales better than the 8192K benchmark, but also hits a (memory bandwidth?) bottleneck eventually. 

20170412, 00:18  #147 
"/X\(‘‘)/X\"
Jan 2013
37×79 Posts 
It could also be the CCX bandwidth and split L3 causing weirdness with the 1024K FFT. A 1024K FFT will consume about 8 MB, which will almost fit in a CCX's 8 MB of L3 cache.
Clockforclock, the 8 core Ryzen (with half speed FMA) is faster with one worker for FFT size up to 2048K than my i56660 @ 3.3 GHz with dualchannel dualrank DDR42133: 1024K FFT, 4 cpu, 1 worker: 840.336 (5% slower than 886.42) 2048K FFT, 4 cpu, 1 worker: 371.747 (11% slower than 418.60) 2560K FFT, 4 cpu, 1 worker: 298.507 (18% faster than 252.38) 4096K FFT, 4 cpu, 1 worker: 186.220 (29% faster than 144.58) 8192K FFT, 4 cpu, 1 worker: 83.822 (6% faster than 78.83) An FFT up to 2048K is mostly going to fit in Ryzen's 16MB of L3 cache, which may explain why working on a single worker is faster than the i5. The i5 has only 6MB of L3, which won't hold a 1024K FFT. 1024K FFT, 4 cpu, 4 workers: 776.38 (9% faster than 711.26) 2048K FFT, 4 cpu, 4 workers: 367.50 (4% faster than 352.17) 2560K FFT, 4 cpu, 4 workers: 293.34 (25% faster than 234.66) 4096K FFT, 4 cpu, 4 workers: 172.09 (19% faster than 144.14) 8192K FFT, 4 cpu, 4 workers: 87.53 (6% faster than 82.33) We can also look at how much of a benefit each Ryzen core provides, versus the i5, to show a bottleneck: 8192K FFT, 4 cpu, 1 worker: 83.822 (52% faster than 55.18, using 4 cores) 8192K FFT, 4 cpu, max workers: 87.53 (55% faster than 56.31, using 4 cores) 8192K FFT, 4 cpu, 1 worker: 83.822 (29% faster than 64.78, using 5 cores) 8192K FFT, 4 cpu, max workers: 87.53 (29% faster than 67.60, using 5 cores) 8192K FFT, 4 cpu, 1 worker: 83.822 (16% faster than 72.01, using 6 cores) 8192K FFT, 4 cpu, max workers: 87.53 (16% faster than 75.49, using 6 cores) 8192K FFT, 4 cpu, 1 worker: 83.822 (10% faster than 76.46, using 7 cores) 8192K FFT, 4 cpu, max workers: 87.53 (9% faster than 80.54, using 7 cores) 8192K FFT, 4 cpu, 1 worker: 83.822 (6% faster than 78.83, using 8 cores) 8192K FFT, 4 cpu, max workers: 87.53 (6% faster than 82.33, using 8 cores) So even though Ryzen has halfspeed FMA, the Ryzen still seems to be choking on a lack of memory bandwidth using more than 6 cores  either from bandwidth or the lack of interleaving from having half the ranks. I wish I had some singlerank DDR42133 to test how much of a difference the ranks make. I'll see if I can find some useful numbers in the benchmarks thread to make a singlerank comparison. So the team red $/iter/sec sweet spot might be a Ryzen 1600, with 6 cores at 3.2 GHz. It's about the same price as an i57500 (dual rank DDR42400) or i57400 (single rank DDR42400), which are the current sweet spots with team blue. 
20170412, 04:32  #148 
"/X\(‘‘)/X\"
Jan 2013
37×79 Posts 
Fred posted some numbers of an i56500 with what appears to be single rank DDR42133. With fourcore turbo, that CPU also runs at 3.3 GHz, and we can compare the single versus dual rank timings for 4096K:
4096K FFT, 4 cpu, 1 worker, single rank: 156.97 (9% faster than 144.58 for 8 core Ryzen) 4096K FFT, 4 cpu, 4 workers, single rank: 159.00 (10% faster than 144.14 for 8 core Ryzen) My 1 worker timing of 186.22 is 19% higher, while my 4 worker timing of 172.09 is 8% higher. This shows dual rank matters. The open question is how much dual rank will help Ryzen. The Ryzen system has more memory bandwidth, but something is still holding it back, especially in light of Ryzen pulling well ahead with the tiny 1024K FFT. 
20170412, 07:42  #149 
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 89<O<88
3×29×83 Posts 
So, to summarize, so far with nonoptimal P95 code, Ryzen appears to be ~10% slower than comparable Intel chips? IOW, a lot of research and optimization still to be done to figure out the long term ability of this chip?

20170412, 10:15  #150 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK
3×1,663 Posts 
What's the power consumption on both processors whilst doing the same type of work at full CPU occupancy? What's the overall investment for each type of machine?

20170412, 10:29  #151 
Aug 2010
Republic of Belarus
2·89 Posts 
db597, could you please make bench for 100M exponent. Thank you in advance.

20170412, 12:49  #152  
Jan 2003
7·29 Posts 
BTW, I believe Ryen processors no longer have 3DNow!  this instruction set has been retired. I see in the results file that P95 still thinks Ryzen supports it "CPU features: 3DNow! Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA".
Quote:


20170412, 13:21  #153  
"/X\(‘‘)/X\"
Jan 2013
37×79 Posts 
Quote:
If there's a way to do more work purely in cache it would likely benefit both platforms. Last fiddled with by Mark Rose on 20170412 at 13:44 

20170412, 13:35  #154  
Aug 2010
Republic of Belarus
2×89 Posts 
Quote:
~ thanks a lot 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Intel Processor Speculations  Mark Rose  Hardware  109  20171013 16:55 
Cannonlake speculations  henryzz  Hardware  0  20170303 19:49 