mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2017-04-11, 16:19   #144
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

1011011010112 Posts
Default

I see the workers vs threads advantage crossover point is around 4096K. That's similar to what I see with my i5-6600 systems with only 6M of L3 vs the 16M Ryzen has.

What happens if you benchmark using two workers? I wonder how that will affect performance with the split L3 cache in Ryzen.
Mark Rose is offline   Reply With Quote
Old 2017-04-11, 17:03   #145
db597
 
db597's Avatar
 
Jan 2003

7×29 Posts
Default Ryzen 1700 benchmark results

Here's the permutation of results for the various number of cores / workers. It takes a long time to run, so I've only tested the 1024K and 8192K FFTs:

1024K FFT results:

Code:
Timings for 1024K FFT length (1 cpu, 1 worker):  7.87 ms.  Throughput: 127.02 iter/sec.
Timings for 1024K FFT length (2 cpus, 1 worker):  4.02 ms.  Throughput: 249.01 iter/sec.
Timings for 1024K FFT length (2 cpus, 2 workers):  7.88,  7.83 ms.  Throughput: 254.57 iter/sec.
Timings for 1024K FFT length (3 cpus, 1 worker):  2.69 ms.  Throughput: 371.47 iter/sec.
Timings for 1024K FFT length (3 cpus, 2 workers):  4.04,  7.88 ms.  Throughput: 374.67 iter/sec.
Timings for 1024K FFT length (3 cpus, 3 workers):  7.88,  7.90,  7.90 ms.  Throughput: 380.02 iter/sec.
Timings for 1024K FFT length (4 cpus, 1 worker):  2.06 ms.  Throughput: 484.73 iter/sec.
Timings for 1024K FFT length (4 cpus, 2 workers):  4.12,  4.12 ms.  Throughput: 485.85 iter/sec.
Timings for 1024K FFT length (4 cpus, 3 workers):  4.11,  8.03,  8.02 ms.  Throughput: 492.51 iter/sec.
Timings for 1024K FFT length (4 cpus, 4 workers):  8.13,  7.93,  8.04,  7.94 ms.  Throughput: 499.35 iter/sec.
Timings for 1024K FFT length (5 cpus, 1 worker):  1.72 ms.  Throughput: 580.39 iter/sec.
Timings for 1024K FFT length (5 cpus, 2 workers):  2.75,  4.42 ms.  Throughput: 589.95 iter/sec.
Timings for 1024K FFT length (5 cpus, 3 workers):  4.13,  4.15,  7.91 ms.  Throughput: 609.18 iter/sec.
Timings for 1024K FFT length (5 cpus, 4 workers):  4.22,  8.13,  8.02,  7.80 ms.  Throughput: 612.66 iter/sec.
Timings for 1024K FFT length (5 cpus, 5 workers):  8.16,  8.22,  8.19,  8.17,  7.79 ms.  Throughput: 616.98 iter/sec.
Timings for 1024K FFT length (6 cpus, 1 worker):  1.47 ms.  Throughput: 682.28 iter/sec.
Timings for 1024K FFT length (6 cpus, 2 workers):  2.74,  2.93 ms.  Throughput: 705.98 iter/sec.
Timings for 1024K FFT length (6 cpus, 3 workers):  4.18,  4.20,  4.13 ms.  Throughput: 719.08 iter/sec.
Timings for 1024K FFT length (6 cpus, 4 workers):  4.50,  4.44,  8.31,  8.31 ms.  Throughput: 688.14 iter/sec.
Timings for 1024K FFT length (6 cpus, 5 workers):  4.51,  8.74,  8.78,  8.44,  8.44 ms.  Throughput: 687.33 iter/sec.
Timings for 1024K FFT length (6 cpus, 6 workers):  9.02,  9.02,  8.87,  8.88,  8.54,  8.54 ms.  Throughput: 681.16 iter/sec.
Timings for 1024K FFT length (7 cpus, 1 worker):  1.28 ms.  Throughput: 779.77 iter/sec.
Timings for 1024K FFT length (7 cpus, 2 workers):  2.06,  2.73 ms.  Throughput: 853.00 iter/sec.
Timings for 1024K FFT length (7 cpus, 3 workers):  2.90,  4.86,  4.32 ms.  Throughput: 782.57 iter/sec.
Timings for 1024K FFT length (7 cpus, 4 workers):  4.80,  4.80,  4.56,  8.96 ms.  Throughput: 747.62 iter/sec.
Timings for 1024K FFT length (7 cpus, 5 workers):  4.95,  4.94,  9.19,  9.07,  9.25 ms.  Throughput: 731.95 iter/sec.
Timings for 1024K FFT length (7 cpus, 6 workers):  5.06,  9.69,  9.69,  9.31,  9.44,  9.50 ms.  Throughput: 722.76 iter/sec.
Timings for 1024K FFT length (7 cpus, 7 workers): 10.13, 10.09,  9.94,  9.97,  9.57,  9.65,  9.63 ms.  Throughput: 710.74 iter/sec.
Timings for 1024K FFT length (8 cpus, 1 worker):  1.13 ms.  Throughput: 884.05 iter/sec.
Timings for 1024K FFT length (8 cpus, 2 workers):  2.63,  2.62 ms.  Throughput: 761.75 iter/sec.
Timings for 1024K FFT length (8 cpus, 3 workers):  2.95,  3.34,  4.52 ms.  Throughput: 860.18 iter/sec.
Timings for 1024K FFT length (8 cpus, 4 workers):  5.35,  5.35,  5.35,  5.35 ms.  Throughput: 747.81 iter/sec.
Timings for 1024K FFT length (8 cpus, 5 workers):  5.43,  5.34,  5.43, 10.30, 10.17 ms.  Throughput: 750.93 iter/sec.
Timings for 1024K FFT length (8 cpus, 6 workers):  5.44,  5.43, 10.74, 10.71, 10.71, 10.60 ms.  Throughput: 742.24 iter/sec.
Timings for 1024K FFT length (8 cpus, 7 workers):  5.62, 10.58, 10.95, 11.15, 11.10, 11.10, 11.05 ms.  Throughput: 724.08 iter/sec.
Timings for 1024K FFT length (8 cpus, 8 workers): 11.27, 11.30, 11.09, 11.41, 11.35, 11.26, 11.13, 11.13 ms.  Throughput: 711.56 iter/sec.
8192K FFT results:

Code:
Timings for 8192K FFT length (1 cpu, 1 worker): 68.57 ms.  Throughput: 14.58 iter/sec.
Timings for 8192K FFT length (2 cpus, 1 worker): 35.11 ms.  Throughput: 28.48 iter/sec.
Timings for 8192K FFT length (2 cpus, 2 workers): 68.42, 68.64 ms.  Throughput: 29.18 iter/sec.
Timings for 8192K FFT length (3 cpus, 1 worker): 23.51 ms.  Throughput: 42.54 iter/sec.
Timings for 8192K FFT length (3 cpus, 2 workers): 35.24, 68.68 ms.  Throughput: 42.94 iter/sec.
Timings for 8192K FFT length (3 cpus, 3 workers): 69.41, 69.41, 68.65 ms.  Throughput: 43.38 iter/sec.
Timings for 8192K FFT length (4 cpus, 1 worker): 18.12 ms.  Throughput: 55.18 iter/sec.
Timings for 8192K FFT length (4 cpus, 2 workers): 36.31, 35.78 ms.  Throughput: 55.49 iter/sec.
Timings for 8192K FFT length (4 cpus, 3 workers): 36.17, 70.14, 70.68 ms.  Throughput: 56.05 iter/sec.
Timings for 8192K FFT length (4 cpus, 4 workers): 71.98, 70.70, 70.39, 71.11 ms.  Throughput: 56.31 iter/sec.
Timings for 8192K FFT length (5 cpus, 1 worker): 15.44 ms.  Throughput: 64.78 iter/sec.
Timings for 8192K FFT length (5 cpus, 2 workers): 25.09, 39.87 ms.  Throughput: 64.94 iter/sec.
Timings for 8192K FFT length (5 cpus, 3 workers): 37.76, 37.42, 72.10 ms.  Throughput: 67.08 iter/sec.
Timings for 8192K FFT length (5 cpus, 4 workers): 37.82, 73.26, 73.97, 72.15 ms.  Throughput: 67.47 iter/sec.
Timings for 8192K FFT length (5 cpus, 5 workers): 74.51, 74.57, 74.66, 73.65, 72.49 ms.  Throughput: 67.60 iter/sec.
Timings for 8192K FFT length (6 cpus, 1 worker): 13.89 ms.  Throughput: 72.01 iter/sec.
Timings for 8192K FFT length (6 cpus, 2 workers): 26.75, 27.58 ms.  Throughput: 73.64 iter/sec.
Timings for 8192K FFT length (6 cpus, 3 workers): 41.36, 39.80, 38.77 ms.  Throughput: 75.10 iter/sec.
Timings for 8192K FFT length (6 cpus, 4 workers): 40.86, 40.11, 76.65, 76.78 ms.  Throughput: 75.48 iter/sec.
Timings for 8192K FFT length (6 cpus, 5 workers): 40.51, 79.77, 79.87, 76.74, 77.09 ms.  Throughput: 75.74 iter/sec.
Timings for 8192K FFT length (6 cpus, 6 workers): 80.98, 80.93, 80.56, 80.90, 76.80, 76.97 ms.  Throughput: 75.49 iter/sec.
Timings for 8192K FFT length (7 cpus, 1 worker): 13.08 ms.  Throughput: 76.46 iter/sec.
Timings for 8192K FFT length (7 cpus, 2 workers): 22.33, 28.45 ms.  Throughput: 79.94 iter/sec.
Timings for 8192K FFT length (7 cpus, 3 workers): 29.86, 46.68, 42.07 ms.  Throughput: 78.68 iter/sec.
Timings for 8192K FFT length (7 cpus, 4 workers): 44.69, 43.87, 42.54, 84.34 ms.  Throughput: 80.54 iter/sec.
Timings for 8192K FFT length (7 cpus, 5 workers): 44.72, 44.22, 84.17, 84.02, 84.96 ms.  Throughput: 80.53 iter/sec.
Timings for 8192K FFT length (7 cpus, 6 workers): 45.27, 87.30, 87.45, 84.04, 84.19, 85.19 ms.  Throughput: 80.50 iter/sec.
Timings for 8192K FFT length (7 cpus, 7 workers): 88.28, 89.04, 88.03, 88.42, 84.05, 85.19, 85.67 ms.  Throughput: 80.54 iter/sec.
Timings for 8192K FFT length (8 cpus, 1 worker): 12.65 ms.  Throughput: 79.04 iter/sec.
Timings for 8192K FFT length (8 cpus, 2 workers): 24.54, 24.30 ms.  Throughput: 81.90 iter/sec.
Timings for 8192K FFT length (8 cpus, 3 workers): 31.89, 34.01, 48.59 ms.  Throughput: 81.34 iter/sec.
Timings for 8192K FFT length (8 cpus, 4 workers): 49.14, 49.03, 49.05, 49.04 ms.  Throughput: 81.52 iter/sec.
Timings for 8192K FFT length (8 cpus, 5 workers): 49.10, 48.61, 48.66, 95.25, 94.38 ms.  Throughput: 82.58 iter/sec.
Timings for 8192K FFT length (8 cpus, 6 workers): 49.30, 49.26, 97.05, 96.12, 96.48, 95.65 ms.  Throughput: 82.11 iter/sec.
Timings for 8192K FFT length (8 cpus, 7 workers): 50.30, 94.02, 97.53, 96.59, 96.13, 96.87, 96.09 ms.  Throughput: 82.25 iter/sec.
Timings for 8192K FFT length (8 cpus, 8 workers): 96.94, 98.16, 96.04, 98.54, 97.29, 97.34, 96.83, 96.22 ms.  Throughput: 82.33 iter/sec.
There seems to be a sudden jump in timings for the first result between [N cpus, N - 1 workers] and [N cpus, N workers]. This seems to happen for all cases of "N", and it's not clear to me why it should happen.

Last fiddled with by db597 on 2017-04-11 at 17:09
db597 is offline   Reply With Quote
Old 2017-04-11, 23:05   #146
nordi
 
Dec 2016

2·32·5 Posts
Default

Quote:
Originally Posted by db597 View Post
Here's the permutation of results for the various number of cores / workers.
Thanks for the data, those numbers are very intersting. I took a closer look at the 8192K FFT results because I'm interested in how well it scales, i.e. how much extra performance you get with each additional core.

core #1 yields 14.58 iters/sec
core #2 yields 14.60 iters/sec
core #3 yields 14.20 iters/sec
core #4 yields 12.93 iters/sec
core #5 yields 11.29 iters/sec
core #6 yields 8.14 iters/sec
core #7 yields 4.80 iters/sec
core #8 yields 1.79 iters/sec

So the first 4 cores scale almost linearly, going to 6 cores you already lose some performance and going to 8 cores adds virtually nothing.

The 1024K FFT has a strange behavior when going from 6 to 7 cores:

core #1 yields 127.02 iters/sec
core #2 yields 127.55 iters/sec
core #3 yields 125.45 iters/sec
core #4 yields 119.33 iters/sec
core #5 yields 117.63 iters/sec
core #6 yields 102.10 iters/sec
core #7 yields 133.92 iters/sec
core #8 yields 31.05 iters/sec

Generally, it scales better than the 8192K benchmark, but also hits a (memory bandwidth?) bottleneck eventually.
nordi is offline   Reply With Quote
Old 2017-04-12, 00:18   #147
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

37×79 Posts
Default

It could also be the CCX bandwidth and split L3 causing weirdness with the 1024K FFT. A 1024K FFT will consume about 8 MB, which will almost fit in a CCX's 8 MB of L3 cache.

Clock-for-clock, the 8 core Ryzen (with half speed FMA) is faster with one worker for FFT size up to 2048K than my i5-6660 @ 3.3 GHz with dual-channel dual-rank DDR4-2133:

1024K FFT, 4 cpu, 1 worker: 840.336 (5% slower than 886.42)
2048K FFT, 4 cpu, 1 worker: 371.747 (11% slower than 418.60)
2560K FFT, 4 cpu, 1 worker: 298.507 (18% faster than 252.38)
4096K FFT, 4 cpu, 1 worker: 186.220 (29% faster than 144.58)
8192K FFT, 4 cpu, 1 worker: 83.822 (6% faster than 78.83)

An FFT up to 2048K is mostly going to fit in Ryzen's 16MB of L3 cache, which may explain why working on a single worker is faster than the i5. The i5 has only 6MB of L3, which won't hold a 1024K FFT.

1024K FFT, 4 cpu, 4 workers: 776.38 (9% faster than 711.26)
2048K FFT, 4 cpu, 4 workers: 367.50 (4% faster than 352.17)
2560K FFT, 4 cpu, 4 workers: 293.34 (25% faster than 234.66)
4096K FFT, 4 cpu, 4 workers: 172.09 (19% faster than 144.14)
8192K FFT, 4 cpu, 4 workers: 87.53 (6% faster than 82.33)

We can also look at how much of a benefit each Ryzen core provides, versus the i5, to show a bottleneck:

8192K FFT, 4 cpu, 1 worker: 83.822 (52% faster than 55.18, using 4 cores)
8192K FFT, 4 cpu, max workers: 87.53 (55% faster than 56.31, using 4 cores)

8192K FFT, 4 cpu, 1 worker: 83.822 (29% faster than 64.78, using 5 cores)
8192K FFT, 4 cpu, max workers: 87.53 (29% faster than 67.60, using 5 cores)

8192K FFT, 4 cpu, 1 worker: 83.822 (16% faster than 72.01, using 6 cores)
8192K FFT, 4 cpu, max workers: 87.53 (16% faster than 75.49, using 6 cores)

8192K FFT, 4 cpu, 1 worker: 83.822 (10% faster than 76.46, using 7 cores)
8192K FFT, 4 cpu, max workers: 87.53 (9% faster than 80.54, using 7 cores)

8192K FFT, 4 cpu, 1 worker: 83.822 (6% faster than 78.83, using 8 cores)
8192K FFT, 4 cpu, max workers: 87.53 (6% faster than 82.33, using 8 cores)

So even though Ryzen has half-speed FMA, the Ryzen still seems to be choking on a lack of memory bandwidth using more than 6 cores -- either from bandwidth or the lack of interleaving from having half the ranks.

I wish I had some single-rank DDR4-2133 to test how much of a difference the ranks make. I'll see if I can find some useful numbers in the benchmarks thread to make a single-rank comparison.

So the team red $/iter/sec sweet spot might be a Ryzen 1600, with 6 cores at 3.2 GHz. It's about the same price as an i5-7500 (dual rank DDR4-2400) or i5-7400 (single rank DDR4-2400), which are the current sweet spots with team blue.
Mark Rose is offline   Reply With Quote
Old 2017-04-12, 04:32   #148
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

37×79 Posts
Default

Fred posted some numbers of an i5-6500 with what appears to be single rank DDR4-2133. With four-core turbo, that CPU also runs at 3.3 GHz, and we can compare the single versus dual rank timings for 4096K:

4096K FFT, 4 cpu, 1 worker, single rank: 156.97 (9% faster than 144.58 for 8 core Ryzen)
4096K FFT, 4 cpu, 4 workers, single rank: 159.00 (10% faster than 144.14 for 8 core Ryzen)

My 1 worker timing of 186.22 is 19% higher, while my 4 worker timing of 172.09 is 8% higher. This shows dual rank matters. The open question is how much dual rank will help Ryzen. The Ryzen system has more memory bandwidth, but something is still holding it back, especially in light of Ryzen pulling well ahead with the tiny 1024K FFT.
Mark Rose is offline   Reply With Quote
Old 2017-04-12, 07:42   #149
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

So, to summarize, so far with non-optimal P95 code, Ryzen appears to be ~10% slower than comparable Intel chips? IOW, a lot of research and optimization still to be done to figure out the long term ability of this chip?
Dubslow is offline   Reply With Quote
Old 2017-04-12, 10:15   #150
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

3×1,663 Posts
Default

Quote:
Originally Posted by Dubslow View Post
So, to summarize, so far with non-optimal P95 code, Ryzen appears to be ~10% slower than comparable Intel chips? IOW, a lot of research and optimization still to be done to figure out the long term ability of this chip?
What's the power consumption on both processors whilst doing the same type of work at full CPU occupancy? What's the overall investment for each type of machine?
pinhodecarlos is offline   Reply With Quote
Old 2017-04-12, 10:29   #151
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

2·89 Posts
Default

db597, could you please make bench for 100M exponent. Thank you in advance.
Lorenzo is offline   Reply With Quote
Old 2017-04-12, 12:49   #152
db597
 
db597's Avatar
 
Jan 2003

7·29 Posts
Default

BTW, I believe Ryen processors no longer have 3DNow! - this instruction set has been retired. I see in the results file that P95 still thinks Ryzen supports it "CPU features: 3DNow! Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA".

Quote:
Originally Posted by Lorenzo View Post
db597, could you please make bench for 100M exponent. Thank you in advance.
I'd be happy to run any benchmarks you need, but could you please guide me through how to setup the benchmark to test a 100M exponent? I see Throughput / FFT timings / Trial Factoring in the drop down, and none of them allow me to specify the size of the exponent, only FFT size.
db597 is offline   Reply With Quote
Old 2017-04-12, 13:21   #153
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

37×79 Posts
Default

Quote:
Originally Posted by Dubslow View Post
So, to summarize, so far with non-optimal P95 code, Ryzen appears to be ~10% slower than comparable Intel chips? IOW, a lot of research and optimization still to be done to figure out the long term ability of this chip?
From what I can tell, it's entirely a memory issue. With a small FFT that fits in cache, two Ryzen cores are faster than an Intel core. Intel is also severely memory limited.

If there's a way to do more work purely in cache it would likely benefit both platforms.

Last fiddled with by Mark Rose on 2017-04-12 at 13:44
Mark Rose is offline   Reply With Quote
Old 2017-04-12, 13:35   #154
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

2×89 Posts
Default

Quote:
Originally Posted by db597 View Post
BTW, I believe Ryen processors no longer have 3DNow! - this instruction set has been retired. I see in the results file that P95 still thinks Ryzen supports it "CPU features: 3DNow! Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA".



I'd be happy to run any benchmarks you need, but could you please guide me through how to setup the benchmark to test a 100M exponent? I see Throughput / FFT timings / Trial Factoring in the drop down, and none of them allow me to specify the size of the exponent, only FFT size.
It's very simple Just go to Advanced/Time (in main menu). Then type: 332220523 (field "exponent to time") and click on OK. And just wait for few minutes when it have done
~ thanks a lot
Lorenzo is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Intel Processor Speculations Mark Rose Hardware 109 2017-10-13 16:55
Cannonlake speculations henryzz Hardware 0 2017-03-03 19:49

All times are UTC. The time now is 23:53.


Mon Nov 29 23:53:47 UTC 2021 up 129 days, 18:22, 0 users, load averages: 1.34, 1.27, 1.26

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.