![]() |
|
|
#144 |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24×199 Posts |
I see the workers vs threads advantage crossover point is around 4096K. That's similar to what I see with my i5-6600 systems with only 6M of L3 vs the 16M Ryzen has.
What happens if you benchmark using two workers? I wonder how that will affect performance with the split L3 cache in Ryzen. |
|
|
|
|
|
#145 |
|
Jan 2003
2·103 Posts |
Here's the permutation of results for the various number of cores / workers. It takes a long time to run, so I've only tested the 1024K and 8192K FFTs:
1024K FFT results: Code:
Timings for 1024K FFT length (1 cpu, 1 worker): 7.87 ms. Throughput: 127.02 iter/sec. Timings for 1024K FFT length (2 cpus, 1 worker): 4.02 ms. Throughput: 249.01 iter/sec. Timings for 1024K FFT length (2 cpus, 2 workers): 7.88, 7.83 ms. Throughput: 254.57 iter/sec. Timings for 1024K FFT length (3 cpus, 1 worker): 2.69 ms. Throughput: 371.47 iter/sec. Timings for 1024K FFT length (3 cpus, 2 workers): 4.04, 7.88 ms. Throughput: 374.67 iter/sec. Timings for 1024K FFT length (3 cpus, 3 workers): 7.88, 7.90, 7.90 ms. Throughput: 380.02 iter/sec. Timings for 1024K FFT length (4 cpus, 1 worker): 2.06 ms. Throughput: 484.73 iter/sec. Timings for 1024K FFT length (4 cpus, 2 workers): 4.12, 4.12 ms. Throughput: 485.85 iter/sec. Timings for 1024K FFT length (4 cpus, 3 workers): 4.11, 8.03, 8.02 ms. Throughput: 492.51 iter/sec. Timings for 1024K FFT length (4 cpus, 4 workers): 8.13, 7.93, 8.04, 7.94 ms. Throughput: 499.35 iter/sec. Timings for 1024K FFT length (5 cpus, 1 worker): 1.72 ms. Throughput: 580.39 iter/sec. Timings for 1024K FFT length (5 cpus, 2 workers): 2.75, 4.42 ms. Throughput: 589.95 iter/sec. Timings for 1024K FFT length (5 cpus, 3 workers): 4.13, 4.15, 7.91 ms. Throughput: 609.18 iter/sec. Timings for 1024K FFT length (5 cpus, 4 workers): 4.22, 8.13, 8.02, 7.80 ms. Throughput: 612.66 iter/sec. Timings for 1024K FFT length (5 cpus, 5 workers): 8.16, 8.22, 8.19, 8.17, 7.79 ms. Throughput: 616.98 iter/sec. Timings for 1024K FFT length (6 cpus, 1 worker): 1.47 ms. Throughput: 682.28 iter/sec. Timings for 1024K FFT length (6 cpus, 2 workers): 2.74, 2.93 ms. Throughput: 705.98 iter/sec. Timings for 1024K FFT length (6 cpus, 3 workers): 4.18, 4.20, 4.13 ms. Throughput: 719.08 iter/sec. Timings for 1024K FFT length (6 cpus, 4 workers): 4.50, 4.44, 8.31, 8.31 ms. Throughput: 688.14 iter/sec. Timings for 1024K FFT length (6 cpus, 5 workers): 4.51, 8.74, 8.78, 8.44, 8.44 ms. Throughput: 687.33 iter/sec. Timings for 1024K FFT length (6 cpus, 6 workers): 9.02, 9.02, 8.87, 8.88, 8.54, 8.54 ms. Throughput: 681.16 iter/sec. Timings for 1024K FFT length (7 cpus, 1 worker): 1.28 ms. Throughput: 779.77 iter/sec. Timings for 1024K FFT length (7 cpus, 2 workers): 2.06, 2.73 ms. Throughput: 853.00 iter/sec. Timings for 1024K FFT length (7 cpus, 3 workers): 2.90, 4.86, 4.32 ms. Throughput: 782.57 iter/sec. Timings for 1024K FFT length (7 cpus, 4 workers): 4.80, 4.80, 4.56, 8.96 ms. Throughput: 747.62 iter/sec. Timings for 1024K FFT length (7 cpus, 5 workers): 4.95, 4.94, 9.19, 9.07, 9.25 ms. Throughput: 731.95 iter/sec. Timings for 1024K FFT length (7 cpus, 6 workers): 5.06, 9.69, 9.69, 9.31, 9.44, 9.50 ms. Throughput: 722.76 iter/sec. Timings for 1024K FFT length (7 cpus, 7 workers): 10.13, 10.09, 9.94, 9.97, 9.57, 9.65, 9.63 ms. Throughput: 710.74 iter/sec. Timings for 1024K FFT length (8 cpus, 1 worker): 1.13 ms. Throughput: 884.05 iter/sec. Timings for 1024K FFT length (8 cpus, 2 workers): 2.63, 2.62 ms. Throughput: 761.75 iter/sec. Timings for 1024K FFT length (8 cpus, 3 workers): 2.95, 3.34, 4.52 ms. Throughput: 860.18 iter/sec. Timings for 1024K FFT length (8 cpus, 4 workers): 5.35, 5.35, 5.35, 5.35 ms. Throughput: 747.81 iter/sec. Timings for 1024K FFT length (8 cpus, 5 workers): 5.43, 5.34, 5.43, 10.30, 10.17 ms. Throughput: 750.93 iter/sec. Timings for 1024K FFT length (8 cpus, 6 workers): 5.44, 5.43, 10.74, 10.71, 10.71, 10.60 ms. Throughput: 742.24 iter/sec. Timings for 1024K FFT length (8 cpus, 7 workers): 5.62, 10.58, 10.95, 11.15, 11.10, 11.10, 11.05 ms. Throughput: 724.08 iter/sec. Timings for 1024K FFT length (8 cpus, 8 workers): 11.27, 11.30, 11.09, 11.41, 11.35, 11.26, 11.13, 11.13 ms. Throughput: 711.56 iter/sec. Code:
Timings for 8192K FFT length (1 cpu, 1 worker): 68.57 ms. Throughput: 14.58 iter/sec. Timings for 8192K FFT length (2 cpus, 1 worker): 35.11 ms. Throughput: 28.48 iter/sec. Timings for 8192K FFT length (2 cpus, 2 workers): 68.42, 68.64 ms. Throughput: 29.18 iter/sec. Timings for 8192K FFT length (3 cpus, 1 worker): 23.51 ms. Throughput: 42.54 iter/sec. Timings for 8192K FFT length (3 cpus, 2 workers): 35.24, 68.68 ms. Throughput: 42.94 iter/sec. Timings for 8192K FFT length (3 cpus, 3 workers): 69.41, 69.41, 68.65 ms. Throughput: 43.38 iter/sec. Timings for 8192K FFT length (4 cpus, 1 worker): 18.12 ms. Throughput: 55.18 iter/sec. Timings for 8192K FFT length (4 cpus, 2 workers): 36.31, 35.78 ms. Throughput: 55.49 iter/sec. Timings for 8192K FFT length (4 cpus, 3 workers): 36.17, 70.14, 70.68 ms. Throughput: 56.05 iter/sec. Timings for 8192K FFT length (4 cpus, 4 workers): 71.98, 70.70, 70.39, 71.11 ms. Throughput: 56.31 iter/sec. Timings for 8192K FFT length (5 cpus, 1 worker): 15.44 ms. Throughput: 64.78 iter/sec. Timings for 8192K FFT length (5 cpus, 2 workers): 25.09, 39.87 ms. Throughput: 64.94 iter/sec. Timings for 8192K FFT length (5 cpus, 3 workers): 37.76, 37.42, 72.10 ms. Throughput: 67.08 iter/sec. Timings for 8192K FFT length (5 cpus, 4 workers): 37.82, 73.26, 73.97, 72.15 ms. Throughput: 67.47 iter/sec. Timings for 8192K FFT length (5 cpus, 5 workers): 74.51, 74.57, 74.66, 73.65, 72.49 ms. Throughput: 67.60 iter/sec. Timings for 8192K FFT length (6 cpus, 1 worker): 13.89 ms. Throughput: 72.01 iter/sec. Timings for 8192K FFT length (6 cpus, 2 workers): 26.75, 27.58 ms. Throughput: 73.64 iter/sec. Timings for 8192K FFT length (6 cpus, 3 workers): 41.36, 39.80, 38.77 ms. Throughput: 75.10 iter/sec. Timings for 8192K FFT length (6 cpus, 4 workers): 40.86, 40.11, 76.65, 76.78 ms. Throughput: 75.48 iter/sec. Timings for 8192K FFT length (6 cpus, 5 workers): 40.51, 79.77, 79.87, 76.74, 77.09 ms. Throughput: 75.74 iter/sec. Timings for 8192K FFT length (6 cpus, 6 workers): 80.98, 80.93, 80.56, 80.90, 76.80, 76.97 ms. Throughput: 75.49 iter/sec. Timings for 8192K FFT length (7 cpus, 1 worker): 13.08 ms. Throughput: 76.46 iter/sec. Timings for 8192K FFT length (7 cpus, 2 workers): 22.33, 28.45 ms. Throughput: 79.94 iter/sec. Timings for 8192K FFT length (7 cpus, 3 workers): 29.86, 46.68, 42.07 ms. Throughput: 78.68 iter/sec. Timings for 8192K FFT length (7 cpus, 4 workers): 44.69, 43.87, 42.54, 84.34 ms. Throughput: 80.54 iter/sec. Timings for 8192K FFT length (7 cpus, 5 workers): 44.72, 44.22, 84.17, 84.02, 84.96 ms. Throughput: 80.53 iter/sec. Timings for 8192K FFT length (7 cpus, 6 workers): 45.27, 87.30, 87.45, 84.04, 84.19, 85.19 ms. Throughput: 80.50 iter/sec. Timings for 8192K FFT length (7 cpus, 7 workers): 88.28, 89.04, 88.03, 88.42, 84.05, 85.19, 85.67 ms. Throughput: 80.54 iter/sec. Timings for 8192K FFT length (8 cpus, 1 worker): 12.65 ms. Throughput: 79.04 iter/sec. Timings for 8192K FFT length (8 cpus, 2 workers): 24.54, 24.30 ms. Throughput: 81.90 iter/sec. Timings for 8192K FFT length (8 cpus, 3 workers): 31.89, 34.01, 48.59 ms. Throughput: 81.34 iter/sec. Timings for 8192K FFT length (8 cpus, 4 workers): 49.14, 49.03, 49.05, 49.04 ms. Throughput: 81.52 iter/sec. Timings for 8192K FFT length (8 cpus, 5 workers): 49.10, 48.61, 48.66, 95.25, 94.38 ms. Throughput: 82.58 iter/sec. Timings for 8192K FFT length (8 cpus, 6 workers): 49.30, 49.26, 97.05, 96.12, 96.48, 95.65 ms. Throughput: 82.11 iter/sec. Timings for 8192K FFT length (8 cpus, 7 workers): 50.30, 94.02, 97.53, 96.59, 96.13, 96.87, 96.09 ms. Throughput: 82.25 iter/sec. Timings for 8192K FFT length (8 cpus, 8 workers): 96.94, 98.16, 96.04, 98.54, 97.29, 97.34, 96.83, 96.22 ms. Throughput: 82.33 iter/sec. Last fiddled with by db597 on 2017-04-11 at 17:09 |
|
|
|
|
|
#146 | |
|
Dec 2016
12710 Posts |
Quote:
core #1 yields 14.58 iters/sec core #2 yields 14.60 iters/sec core #3 yields 14.20 iters/sec core #4 yields 12.93 iters/sec core #5 yields 11.29 iters/sec core #6 yields 8.14 iters/sec core #7 yields 4.80 iters/sec core #8 yields 1.79 iters/sec So the first 4 cores scale almost linearly, going to 6 cores you already lose some performance and going to 8 cores adds virtually nothing. The 1024K FFT has a strange behavior when going from 6 to 7 cores: core #1 yields 127.02 iters/sec core #2 yields 127.55 iters/sec core #3 yields 125.45 iters/sec core #4 yields 119.33 iters/sec core #5 yields 117.63 iters/sec core #6 yields 102.10 iters/sec core #7 yields 133.92 iters/sec core #8 yields 31.05 iters/sec Generally, it scales better than the 8192K benchmark, but also hits a (memory bandwidth?) bottleneck eventually. |
|
|
|
|
|
|
#147 |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24·199 Posts |
It could also be the CCX bandwidth and split L3 causing weirdness with the 1024K FFT. A 1024K FFT will consume about 8 MB, which will almost fit in a CCX's 8 MB of L3 cache.
Clock-for-clock, the 8 core Ryzen (with half speed FMA) is faster with one worker for FFT size up to 2048K than my i5-6660 @ 3.3 GHz with dual-channel dual-rank DDR4-2133: 1024K FFT, 4 cpu, 1 worker: 840.336 (5% slower than 886.42) 2048K FFT, 4 cpu, 1 worker: 371.747 (11% slower than 418.60) 2560K FFT, 4 cpu, 1 worker: 298.507 (18% faster than 252.38) 4096K FFT, 4 cpu, 1 worker: 186.220 (29% faster than 144.58) 8192K FFT, 4 cpu, 1 worker: 83.822 (6% faster than 78.83) An FFT up to 2048K is mostly going to fit in Ryzen's 16MB of L3 cache, which may explain why working on a single worker is faster than the i5. The i5 has only 6MB of L3, which won't hold a 1024K FFT. 1024K FFT, 4 cpu, 4 workers: 776.38 (9% faster than 711.26) 2048K FFT, 4 cpu, 4 workers: 367.50 (4% faster than 352.17) 2560K FFT, 4 cpu, 4 workers: 293.34 (25% faster than 234.66) 4096K FFT, 4 cpu, 4 workers: 172.09 (19% faster than 144.14) 8192K FFT, 4 cpu, 4 workers: 87.53 (6% faster than 82.33) We can also look at how much of a benefit each Ryzen core provides, versus the i5, to show a bottleneck: 8192K FFT, 4 cpu, 1 worker: 83.822 (52% faster than 55.18, using 4 cores) 8192K FFT, 4 cpu, max workers: 87.53 (55% faster than 56.31, using 4 cores) 8192K FFT, 4 cpu, 1 worker: 83.822 (29% faster than 64.78, using 5 cores) 8192K FFT, 4 cpu, max workers: 87.53 (29% faster than 67.60, using 5 cores) 8192K FFT, 4 cpu, 1 worker: 83.822 (16% faster than 72.01, using 6 cores) 8192K FFT, 4 cpu, max workers: 87.53 (16% faster than 75.49, using 6 cores) 8192K FFT, 4 cpu, 1 worker: 83.822 (10% faster than 76.46, using 7 cores) 8192K FFT, 4 cpu, max workers: 87.53 (9% faster than 80.54, using 7 cores) 8192K FFT, 4 cpu, 1 worker: 83.822 (6% faster than 78.83, using 8 cores) 8192K FFT, 4 cpu, max workers: 87.53 (6% faster than 82.33, using 8 cores) So even though Ryzen has half-speed FMA, the Ryzen still seems to be choking on a lack of memory bandwidth using more than 6 cores -- either from bandwidth or the lack of interleaving from having half the ranks. I wish I had some single-rank DDR4-2133 to test how much of a difference the ranks make. I'll see if I can find some useful numbers in the benchmarks thread to make a single-rank comparison. So the team red $/iter/sec sweet spot might be a Ryzen 1600, with 6 cores at 3.2 GHz. It's about the same price as an i5-7500 (dual rank DDR4-2400) or i5-7400 (single rank DDR4-2400), which are the current sweet spots with team blue. |
|
|
|
|
|
#148 |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24·199 Posts |
Fred posted some numbers of an i5-6500 with what appears to be single rank DDR4-2133. With four-core turbo, that CPU also runs at 3.3 GHz, and we can compare the single versus dual rank timings for 4096K:
4096K FFT, 4 cpu, 1 worker, single rank: 156.97 (9% faster than 144.58 for 8 core Ryzen) 4096K FFT, 4 cpu, 4 workers, single rank: 159.00 (10% faster than 144.14 for 8 core Ryzen) My 1 worker timing of 186.22 is 19% higher, while my 4 worker timing of 172.09 is 8% higher. This shows dual rank matters. The open question is how much dual rank will help Ryzen. The Ryzen system has more memory bandwidth, but something is still holding it back, especially in light of Ryzen pulling well ahead with the tiny 1024K FFT. |
|
|
|
|
|
#149 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3×29×83 Posts |
So, to summarize, so far with non-optimal P95 code, Ryzen appears to be ~10% slower than comparable Intel chips? IOW, a lot of research and optimization still to be done to figure out the long term ability of this chip?
|
|
|
|
|
|
#150 |
|
"Carlos Pinho"
Oct 2011
Milton Keynes, UK
2·5·11·47 Posts |
What's the power consumption on both processors whilst doing the same type of work at full CPU occupancy? What's the overall investment for each type of machine?
|
|
|
|
|
|
#151 |
|
Aug 2010
Republic of Belarus
2×89 Posts |
db597, could you please make bench for 100M exponent. Thank you in advance.
|
|
|
|
|
|
#152 | |
|
Jan 2003
2·103 Posts |
BTW, I believe Ryen processors no longer have 3DNow! - this instruction set has been retired. I see in the results file that P95 still thinks Ryzen supports it "CPU features: 3DNow! Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA".
Quote:
|
|
|
|
|
|
|
#153 | |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24·199 Posts |
Quote:
If there's a way to do more work purely in cache it would likely benefit both platforms. Last fiddled with by Mark Rose on 2017-04-12 at 13:44 |
|
|
|
|
|
|
#154 | |
|
Aug 2010
Republic of Belarus
2·89 Posts |
Quote:
Just go to Advanced/Time (in main menu). Then type: 332220523 (field "exponent to time") and click on OK. And just wait for few minutes when it have done~ thanks a lot |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Intel Processor Speculations | Mark Rose | Hardware | 109 | 2017-10-13 16:55 |
| Cannonlake speculations | henryzz | Hardware | 0 | 2017-03-03 19:49 |