mersenneforum.org 29.2 benchmark help
 Register FAQ Search Today's Posts Mark Forums Read

2017-05-02, 19:53   #45
tului

Jan 2013

22·17 Posts

Redone. I think this is correct
Attached Files
 results-1800X-3800-ddr4-2666-CL14-16-16-34.txt (320.9 KB, 52 views)

2017-05-02, 19:58   #46
Mark Rose

"/X\(‘-‘)/X\"
Jan 2013

B7316 Posts

Quote:
 Originally Posted by tului Redone. I think this is correct
Yes

 2017-05-03, 16:53 #47 tului   Jan 2013 10001002 Posts Based off this and any other results you've gotten, how does Ryzen look? How does whatever info was sought in asking for these benchmarks look?
2017-05-03, 19:27   #48
Mark Rose

"/X\(‘-‘)/X\"
Jan 2013

3·977 Posts

Quote:
 Originally Posted by tului Based off this and any other results you've gotten, how does Ryzen look? How does whatever info was sought in asking for these benchmarks look?
Well you can compare the default FFT selection from your first attachment:

Code:
Timings for 4096K FFT length (8 cpus, 1 worker):  4.86 ms.  Throughput: 205.61 iter/sec.
Timings for 4096K FFT length (8 cpus, 8 workers): 37.56, 36.45, 36.65, 36.25, 36.48, 36.65, 36.53, 36.65 ms.  Throughput: 218.29 iter/sec.
With the different FFT timings from your second attachment:

Code:
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=4 (8 cpus, 1 worker):  4.90 ms.  Throughput: 203.88 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=4 (8 cpus, 8 workers): 37.95, 36.61, 35.82, 38.02, 36.41, 37.05, 36.27, 36.03 ms.  Throughput: 217.66 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=2 (8 cpus, 1 worker):  5.06 ms.  Throughput: 197.47 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=2 (8 cpus, 8 workers): 39.23, 37.22, 36.70, 38.41, 37.80, 37.46, 37.29, 37.21 ms.  Throughput: 212.49 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=1 (8 cpus, 1 worker):  5.13 ms.  Throughput: 194.91 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=1 (8 cpus, 8 workers): 40.08, 39.17, 36.26, 39.46, 38.18, 38.19, 37.31, 36.94 ms.  Throughput: 209.65 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=4 (8 cpus, 1 worker):  5.01 ms.  Throughput: 199.73 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=4 (8 cpus, 8 workers): 37.44, 36.46, 35.77, 37.41, 36.46, 36.92, 35.97, 35.95 ms.  Throughput: 218.94 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=2 (8 cpus, 1 worker):  4.90 ms.  Throughput: 204.19 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=2 (8 cpus, 8 workers): 38.03, 36.80, 36.22, 37.90, 36.93, 36.73, 36.31, 36.27 ms.  Throughput: 216.89 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=1 (8 cpus, 1 worker):  4.79 ms.  Throughput: 208.80 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=1 (8 cpus, 8 workers): 36.71, 35.35, 34.96, 37.85, 35.48, 36.31, 35.87, 35.16 ms.  Throughput: 222.61 iter/sec.
[Tue May 02 09:05:18 2017]
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=4 (8 cpus, 1 worker):  4.71 ms.  Throughput: 212.12 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=4 (8 cpus, 8 workers): 37.63, 35.64, 35.12, 38.28, 35.58, 36.05, 35.87, 35.78 ms.  Throughput: 220.90 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=2 (8 cpus, 1 worker):  4.60 ms.  Throughput: 217.50 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=2 (8 cpus, 8 workers): 37.70, 36.17, 35.30, 37.66, 35.61, 36.05, 35.85, 36.24 ms.  Throughput: 220.37 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=1 (8 cpus, 1 worker):  5.19 ms.  Throughput: 192.52 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=1 (8 cpus, 8 workers): 42.05, 39.58, 39.01, 41.68, 40.45, 40.87, 39.89, 39.85 ms.  Throughput: 198.03 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=4 (8 cpus, 1 worker):  5.25 ms.  Throughput: 190.41 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=4 (8 cpus, 8 workers): 38.63, 36.86, 36.09, 38.24, 37.29, 37.31, 36.58, 36.25 ms.  Throughput: 215.41 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=2 (8 cpus, 1 worker):  4.92 ms.  Throughput: 203.31 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=2 (8 cpus, 8 workers): 38.41, 36.24, 35.54, 38.67, 35.96, 36.44, 36.39, 36.46 ms.  Throughput: 217.77 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=1 (8 cpus, 1 worker):  4.66 ms.  Throughput: 214.54 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=1 (8 cpus, 8 workers): 36.66, 34.82, 34.40, 37.50, 34.70, 35.39, 35.02, 35.12 ms.  Throughput: 225.84 iter/sec.
You can see the best timing for 1 worker is 217.50 iter/sec, a 6% improvement over 205.61 iter/sec for a 4096K FFT, and the best timing for 8 workers is 225.84 iter/sec, a 3% improvement over 218.29 iter/sec.

There's no guarantee George will pick those particular kernels for the build. The 1 worker configuration that was fastest for you was also fastest for my i5-6600. The n-worker configurations were different though, with your Ryzen system preferring more balance between Pass1 and Pass2, while the i5 preferred a smaller Pass1 and a larger Pass2.

Your best result of 225.84 iter/sec was 33% faster than my i5-6600 at 3.3 GHz giving 169.78 iter/sec. Your memory bandwidth is 25% faster than my DDR4-2133, and the stock clock of an 1800X at 3.6GHz is 9% faster than the 3.3 GHz I have my i5-6600 running, so your Ryzen chip is doing very well considering its FMA3 runs at half speed of the Skylake.

2017-05-03, 19:29   #49
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

3·5·499 Posts

Quote:
 Originally Posted by tului Based off this and any other results you've gotten, how does Ryzen look? How does whatever info was sought in asking for these benchmarks look?
I've not done an in-depth study of all the results. What I have found is that it is not at all uncommon for machines to have different optimal FFTs than prime95's default.

In most cases the difference is not great, but in some it can be significant. For example, the optimal 4M FFT on my Kaby Lake is the worst 4M FFT on Ryzen (150 iter/s vs 179 iter/s). That is a significant difference - enough for me to consider adding different default selections for Ryzen. I need to study more to see how often there is such a huge difference.

I've also found that if I just export all the FFT implementations that are within 10% of optimal on my Kaby Lake, then I'll still be excluding the best FFT implementation for some machines. Thus, there is no way to avoid serious bloat of the executable size given the present prime95 architecture....

So, I've started investigating how hard it would be to make pass 1 shared. That is, all FFT implementations that do 512 elements in pass 1 would call a common routine to do pass 1. If I do this, then I can include every possible FFT implementation without any bloat in executable size.

A little history: Old architectures would pay a significant penalty if they did a lot of address calculations during the FFT. Thus, prime95 favored using fixed displacements in pass 1, at the expense of making pass 1 unshareable. Also, the FFT code was written to work on both 32-bit and 64-bit machines, limiting me to using 8 registers.

My plan is to use the extra 8 registers available in 64-bit to greatly ease the amount of address calculations needed to "common-ize" pass 1 code. That, along with modern architectures not paying much penalty for doing some extra address calculations along side FPU calculations, means I can make pass 1 share-able at almost zero runtime cost.

After this is all done, I'll look at the different benchies sent in and use them to change prime95's defaults to be the best one for most CPUs rather than just my Kaby Lake.

In summary:
1) 64-bit prime95 will export all FFT implementations for AVX and AVX2 architectures, making the new find-the-fastest-FFT-implementation feature quite useful.
2) 32-bit prime95 will not have common pass 1 making the new feature pretty useless. But then again, who cares about 32-bit users?
3) There will be new default FFT implementations using a blend of benchmarks, with perhaps a separate selection for Ryzen.

Time frame? Won't be quick.

2017-05-03, 19:40   #50
GP2

Sep 2003

3·863 Posts

Quote:
 Originally Posted by Prime95 Thus, there is no way to avoid serious bloat of the executable size given the present prime95 architecture....
Is executable size bloat truly an issue? We have terabytes of disk space and gigabytes of RAM. Presumably any unneeded code will get paged out of physical memory if memory gets tight. Maybe as an interim solution it would be OK to let the executable get bloated, or at least offer a bloated executable as an optional alternative download?

2017-05-03, 20:46   #51
tului

Jan 2013

6810 Posts

Quote:
 Originally Posted by GP2 Is executable size bloat truly an issue? We have terabytes of disk space and gigabytes of RAM. Presumably any unneeded code will get paged out of physical memory if memory gets tight. Maybe as an interim solution it would be OK to let the executable get bloated, or at least offer a bloated executable as an optional alternative download?
This was my thought. I mean even the test .exe at fifty some megabytes was inconsequential.

If you need any more Ryzen testing let me know.

2017-05-03, 21:04   #52
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

1D3D16 Posts

Quote:
 Originally Posted by GP2 Is executable size bloat truly an issue?
A little. The test executable was 50ishMB, but did not include all implementations for all-complex FFTs, Core/Core2 (AVX, but not AVX2), AVX512. Do all of those and we could be pushing 200MB. Deal breaker, probably not.

The best reason for doing this is it would allow me to write new pass 1s for niche markets. Such as non-prefetching FFTs for Xeons with huge caches, or maybe a Ryzen-specific version that uses prefetchw (that's what the K8 and K10 specific versions do). Or perhaps add more pass1 sizes.

Besides, its just "cleaner". No one will look at the code and wonder "what was that guy thinking?"

 2017-05-04, 06:11 #53 Dubslow Basketry That Evening!     "Bunslow the Bold" Jun 2011 40
2017-05-04, 08:46   #54
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT/BST)

2×5×587 Posts

Quote:
 Originally Posted by Dubslow 200 MB is no small thing. We may have disks more than an order of magnitude larger than that, but one thing we don't have is the internet bandwidth to match (at least not in the US for the very, very large majority of users). Maybe it would be an acceptable stopgap, but since George has an alternate solution that should be the long term goal
4 minutes for me at 7 mbits/sec
TBH if you have internet that slow you get used to it. I would suggest a lite version as well if you go that bit.

2017-05-04, 09:18   #55
mackerel

Feb 2016
UK

419 Posts

Quote:
 Originally Posted by Prime95 2) 32-bit prime95 will not have common pass 1 making the new feature pretty useless. But then again, who cares about 32-bit users? 3) There will be new default FFT implementations using a blend of benchmarks, with perhaps a separate selection for Ryzen.
Random thoughts:

I assume this will eventually affect gwnum, and thus other software that uses it. With that in mind, is there some point where you need to draw a line and say anything older than some point becomes unsupported? Keep a legacy version available for those that need it, and have a more modern version looking forwards?

On Ryzen specifically, is there any significant potential advantage to be gained if its architecture was considered specifically?

 Similar Threads Thread Thread Starter Forum Replies Last Post Prime95 Software 10 2017-05-08 13:24 Fred Software 5 2016-04-01 18:15 Oddball Riesel Prime Search 5 2010-08-02 00:11 cipher Twin Prime Search 2 2009-04-14 20:16 R.D. Silverman Hardware 2 2007-07-25 12:16

All times are UTC. The time now is 23:41.

Sun May 9 23:41:08 UTC 2021 up 31 days, 18:22, 0 users, load averages: 3.85, 3.35, 2.90