mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-05-02, 19:53   #45
tului
 
Jan 2013

4416 Posts
Default

Redone. I think this is correct
Attached Files
File Type: txt results-1800X-3800-ddr4-2666-CL14-16-16-34.txt (320.9 KB, 39 views)
tului is offline   Reply With Quote
Old 2017-05-02, 19:58   #46
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2×31×47 Posts
Default

Quote:
Originally Posted by tului View Post
Redone. I think this is correct
Yes
Mark Rose is offline   Reply With Quote
Old 2017-05-03, 16:53   #47
tului
 
Jan 2013

22·17 Posts
Default

Based off this and any other results you've gotten, how does Ryzen look? How does whatever info was sought in asking for these benchmarks look?
tului is offline   Reply With Quote
Old 2017-05-03, 19:27   #48
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2×31×47 Posts
Default

Quote:
Originally Posted by tului View Post
Based off this and any other results you've gotten, how does Ryzen look? How does whatever info was sought in asking for these benchmarks look?
Well you can compare the default FFT selection from your first attachment:

Code:
Timings for 4096K FFT length (8 cpus, 1 worker):  4.86 ms.  Throughput: 205.61 iter/sec.
Timings for 4096K FFT length (8 cpus, 8 workers): 37.56, 36.45, 36.65, 36.25, 36.48, 36.65, 36.53, 36.65 ms.  Throughput: 218.29 iter/sec.
With the different FFT timings from your second attachment:

Code:
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=4 (8 cpus, 1 worker):  4.90 ms.  Throughput: 203.88 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=4 (8 cpus, 8 workers): 37.95, 36.61, 35.82, 38.02, 36.41, 37.05, 36.27, 36.03 ms.  Throughput: 217.66 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=2 (8 cpus, 1 worker):  5.06 ms.  Throughput: 197.47 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=2 (8 cpus, 8 workers): 39.23, 37.22, 36.70, 38.41, 37.80, 37.46, 37.29, 37.21 ms.  Throughput: 212.49 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=1 (8 cpus, 1 worker):  5.13 ms.  Throughput: 194.91 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=1 (8 cpus, 8 workers): 40.08, 39.17, 36.26, 39.46, 38.18, 38.19, 37.31, 36.94 ms.  Throughput: 209.65 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=4 (8 cpus, 1 worker):  5.01 ms.  Throughput: 199.73 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=4 (8 cpus, 8 workers): 37.44, 36.46, 35.77, 37.41, 36.46, 36.92, 35.97, 35.95 ms.  Throughput: 218.94 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=2 (8 cpus, 1 worker):  4.90 ms.  Throughput: 204.19 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=2 (8 cpus, 8 workers): 38.03, 36.80, 36.22, 37.90, 36.93, 36.73, 36.31, 36.27 ms.  Throughput: 216.89 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=1 (8 cpus, 1 worker):  4.79 ms.  Throughput: 208.80 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=1 (8 cpus, 8 workers): 36.71, 35.35, 34.96, 37.85, 35.48, 36.31, 35.87, 35.16 ms.  Throughput: 222.61 iter/sec.
[Tue May 02 09:05:18 2017]
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=4 (8 cpus, 1 worker):  4.71 ms.  Throughput: 212.12 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=4 (8 cpus, 8 workers): 37.63, 35.64, 35.12, 38.28, 35.58, 36.05, 35.87, 35.78 ms.  Throughput: 220.90 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=2 (8 cpus, 1 worker):  4.60 ms.  Throughput: 217.50 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=2 (8 cpus, 8 workers): 37.70, 36.17, 35.30, 37.66, 35.61, 36.05, 35.85, 36.24 ms.  Throughput: 220.37 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=1 (8 cpus, 1 worker):  5.19 ms.  Throughput: 192.52 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=1 (8 cpus, 8 workers): 42.05, 39.58, 39.01, 41.68, 40.45, 40.87, 39.89, 39.85 ms.  Throughput: 198.03 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=4 (8 cpus, 1 worker):  5.25 ms.  Throughput: 190.41 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=4 (8 cpus, 8 workers): 38.63, 36.86, 36.09, 38.24, 37.29, 37.31, 36.58, 36.25 ms.  Throughput: 215.41 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=2 (8 cpus, 1 worker):  4.92 ms.  Throughput: 203.31 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=2 (8 cpus, 8 workers): 38.41, 36.24, 35.54, 38.67, 35.96, 36.44, 36.39, 36.46 ms.  Throughput: 217.77 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=1 (8 cpus, 1 worker):  4.66 ms.  Throughput: 214.54 iter/sec.
FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=1 (8 cpus, 8 workers): 36.66, 34.82, 34.40, 37.50, 34.70, 35.39, 35.02, 35.12 ms.  Throughput: 225.84 iter/sec.
You can see the best timing for 1 worker is 217.50 iter/sec, a 6% improvement over 205.61 iter/sec for a 4096K FFT, and the best timing for 8 workers is 225.84 iter/sec, a 3% improvement over 218.29 iter/sec.

There's no guarantee George will pick those particular kernels for the build. The 1 worker configuration that was fastest for you was also fastest for my i5-6600. The n-worker configurations were different though, with your Ryzen system preferring more balance between Pass1 and Pass2, while the i5 preferred a smaller Pass1 and a larger Pass2.

Your best result of 225.84 iter/sec was 33% faster than my i5-6600 at 3.3 GHz giving 169.78 iter/sec. Your memory bandwidth is 25% faster than my DDR4-2133, and the stock clock of an 1800X at 3.6GHz is 9% faster than the 3.3 GHz I have my i5-6600 running, so your Ryzen chip is doing very well considering its FMA3 runs at half speed of the Skylake.
Mark Rose is offline   Reply With Quote
Old 2017-05-03, 19:29   #49
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

715810 Posts
Default

Quote:
Originally Posted by tului View Post
Based off this and any other results you've gotten, how does Ryzen look? How does whatever info was sought in asking for these benchmarks look?
I've not done an in-depth study of all the results. What I have found is that it is not at all uncommon for machines to have different optimal FFTs than prime95's default.

In most cases the difference is not great, but in some it can be significant. For example, the optimal 4M FFT on my Kaby Lake is the worst 4M FFT on Ryzen (150 iter/s vs 179 iter/s). That is a significant difference - enough for me to consider adding different default selections for Ryzen. I need to study more to see how often there is such a huge difference.

I've also found that if I just export all the FFT implementations that are within 10% of optimal on my Kaby Lake, then I'll still be excluding the best FFT implementation for some machines. Thus, there is no way to avoid serious bloat of the executable size given the present prime95 architecture....

So, I've started investigating how hard it would be to make pass 1 shared. That is, all FFT implementations that do 512 elements in pass 1 would call a common routine to do pass 1. If I do this, then I can include every possible FFT implementation without any bloat in executable size.

A little history: Old architectures would pay a significant penalty if they did a lot of address calculations during the FFT. Thus, prime95 favored using fixed displacements in pass 1, at the expense of making pass 1 unshareable. Also, the FFT code was written to work on both 32-bit and 64-bit machines, limiting me to using 8 registers.

My plan is to use the extra 8 registers available in 64-bit to greatly ease the amount of address calculations needed to "common-ize" pass 1 code. That, along with modern architectures not paying much penalty for doing some extra address calculations along side FPU calculations, means I can make pass 1 share-able at almost zero runtime cost.

After this is all done, I'll look at the different benchies sent in and use them to change prime95's defaults to be the best one for most CPUs rather than just my Kaby Lake.

In summary:
1) 64-bit prime95 will export all FFT implementations for AVX and AVX2 architectures, making the new find-the-fastest-FFT-implementation feature quite useful.
2) 32-bit prime95 will not have common pass 1 making the new feature pretty useless. But then again, who cares about 32-bit users?
3) There will be new default FFT implementations using a blend of benchmarks, with perhaps a separate selection for Ryzen.

Time frame? Won't be quick.
Prime95 is offline   Reply With Quote
Old 2017-05-03, 19:40   #50
GP2
 
GP2's Avatar
 
Sep 2003

258110 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Thus, there is no way to avoid serious bloat of the executable size given the present prime95 architecture....
Is executable size bloat truly an issue? We have terabytes of disk space and gigabytes of RAM. Presumably any unneeded code will get paged out of physical memory if memory gets tight. Maybe as an interim solution it would be OK to let the executable get bloated, or at least offer a bloated executable as an optional alternative download?
GP2 is offline   Reply With Quote
Old 2017-05-03, 20:46   #51
tului
 
Jan 2013

22×17 Posts
Default

Quote:
Originally Posted by GP2 View Post
Is executable size bloat truly an issue? We have terabytes of disk space and gigabytes of RAM. Presumably any unneeded code will get paged out of physical memory if memory gets tight. Maybe as an interim solution it would be OK to let the executable get bloated, or at least offer a bloated executable as an optional alternative download?
This was my thought. I mean even the test .exe at fifty some megabytes was inconsequential.

If you need any more Ryzen testing let me know.
tului is offline   Reply With Quote
Old 2017-05-03, 21:04   #52
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

157668 Posts
Default

Quote:
Originally Posted by GP2 View Post
Is executable size bloat truly an issue?
A little. The test executable was 50ishMB, but did not include all implementations for all-complex FFTs, Core/Core2 (AVX, but not AVX2), AVX512. Do all of those and we could be pushing 200MB. Deal breaker, probably not.

The best reason for doing this is it would allow me to write new pass 1s for niche markets. Such as non-prefetching FFTs for Xeons with huge caches, or maybe a Ryzen-specific version that uses prefetchw (that's what the K8 and K10 specific versions do). Or perhaps add more pass1 sizes.

Besides, its just "cleaner". No one will look at the code and wonder "what was that guy thinking?"
Prime95 is offline   Reply With Quote
Old 2017-05-04, 06:11   #53
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×2,399 Posts
Default

200 MB is no small thing. We may have disks more than an order of magnitude larger than that, but one thing we don't have is the internet bandwidth to match (at least not in the US for the very, very large majority of users). Maybe it would be an acceptable stopgap, but since George has an alternate solution that should be the long term goal
Dubslow is offline   Reply With Quote
Old 2017-05-04, 08:46   #54
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

34×71 Posts
Default

Quote:
Originally Posted by Dubslow View Post
200 MB is no small thing. We may have disks more than an order of magnitude larger than that, but one thing we don't have is the internet bandwidth to match (at least not in the US for the very, very large majority of users). Maybe it would be an acceptable stopgap, but since George has an alternate solution that should be the long term goal
4 minutes for me at 7 mbits/sec
TBH if you have internet that slow you get used to it. I would suggest a lite version as well if you go that bit.
henryzz is online now   Reply With Quote
Old 2017-05-04, 09:18   #55
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

1100010012 Posts
Default

Quote:
Originally Posted by Prime95 View Post
2) 32-bit prime95 will not have common pass 1 making the new feature pretty useless. But then again, who cares about 32-bit users?
3) There will be new default FFT implementations using a blend of benchmarks, with perhaps a separate selection for Ryzen.
Random thoughts:

I assume this will eventually affect gwnum, and thus other software that uses it. With that in mind, is there some point where you need to draw a line and say anything older than some point becomes unsupported? Keep a legacy version available for those that need it, and have a more modern version looking forwards?

On Ryzen specifically, is there any significant potential advantage to be gained if its architecture was considered specifically?
mackerel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
29.2 benchmark help #2 (Ryzen only) Prime95 Software 10 2017-05-08 13:24
Benchmark Variances Fred Software 5 2016-04-01 18:15
LLR benchmark thread Oddball Riesel Prime Search 5 2010-08-02 00:11
Does anyone have i7 920? for Benchmark? cipher Twin Prime Search 2 2009-04-14 20:16
Benchmark Weirdness R.D. Silverman Hardware 2 2007-07-25 12:16

All times are UTC. The time now is 20:57.

Fri Nov 27 20:57:39 UTC 2020 up 78 days, 18:08, 3 users, load averages: 1.19, 1.10, 1.23

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.