![]() |
|
|
#45 |
|
Jan 2013
22×17 Posts |
Redone. I think this is correct
|
|
|
|
|
|
#46 |
|
"/X\(‘-‘)/X\"
Jan 2013
2×5×293 Posts |
|
|
|
|
|
|
#47 |
|
Jan 2013
22·17 Posts |
Based off this and any other results you've gotten, how does Ryzen look? How does whatever info was sought in asking for these benchmarks look?
|
|
|
|
|
|
#48 | |
|
"/X\(‘-‘)/X\"
Jan 2013
55628 Posts |
Quote:
Code:
Timings for 4096K FFT length (8 cpus, 1 worker): 4.86 ms. Throughput: 205.61 iter/sec. Timings for 4096K FFT length (8 cpus, 8 workers): 37.56, 36.45, 36.65, 36.25, 36.48, 36.65, 36.53, 36.65 ms. Throughput: 218.29 iter/sec. Code:
FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=4 (8 cpus, 1 worker): 4.90 ms. Throughput: 203.88 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=4 (8 cpus, 8 workers): 37.95, 36.61, 35.82, 38.02, 36.41, 37.05, 36.27, 36.03 ms. Throughput: 217.66 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=2 (8 cpus, 1 worker): 5.06 ms. Throughput: 197.47 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=2 (8 cpus, 8 workers): 39.23, 37.22, 36.70, 38.41, 37.80, 37.46, 37.29, 37.21 ms. Throughput: 212.49 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=1 (8 cpus, 1 worker): 5.13 ms. Throughput: 194.91 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=256, Pass2=16384, clm=1 (8 cpus, 8 workers): 40.08, 39.17, 36.26, 39.46, 38.18, 38.19, 37.31, 36.94 ms. Throughput: 209.65 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=4 (8 cpus, 1 worker): 5.01 ms. Throughput: 199.73 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=4 (8 cpus, 8 workers): 37.44, 36.46, 35.77, 37.41, 36.46, 36.92, 35.97, 35.95 ms. Throughput: 218.94 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=2 (8 cpus, 1 worker): 4.90 ms. Throughput: 204.19 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=2 (8 cpus, 8 workers): 38.03, 36.80, 36.22, 37.90, 36.93, 36.73, 36.31, 36.27 ms. Throughput: 216.89 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=1 (8 cpus, 1 worker): 4.79 ms. Throughput: 208.80 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=512, Pass2=8192, clm=1 (8 cpus, 8 workers): 36.71, 35.35, 34.96, 37.85, 35.48, 36.31, 35.87, 35.16 ms. Throughput: 222.61 iter/sec. [Tue May 02 09:05:18 2017] FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=4 (8 cpus, 1 worker): 4.71 ms. Throughput: 212.12 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=4 (8 cpus, 8 workers): 37.63, 35.64, 35.12, 38.28, 35.58, 36.05, 35.87, 35.78 ms. Throughput: 220.90 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=2 (8 cpus, 1 worker): 4.60 ms. Throughput: 217.50 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=2 (8 cpus, 8 workers): 37.70, 36.17, 35.30, 37.66, 35.61, 36.05, 35.85, 36.24 ms. Throughput: 220.37 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=1 (8 cpus, 1 worker): 5.19 ms. Throughput: 192.52 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=1024, Pass2=4096, clm=1 (8 cpus, 8 workers): 42.05, 39.58, 39.01, 41.68, 40.45, 40.87, 39.89, 39.85 ms. Throughput: 198.03 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=4 (8 cpus, 1 worker): 5.25 ms. Throughput: 190.41 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=4 (8 cpus, 8 workers): 38.63, 36.86, 36.09, 38.24, 37.29, 37.31, 36.58, 36.25 ms. Throughput: 215.41 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=2 (8 cpus, 1 worker): 4.92 ms. Throughput: 203.31 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=2 (8 cpus, 8 workers): 38.41, 36.24, 35.54, 38.67, 35.96, 36.44, 36.39, 36.46 ms. Throughput: 217.77 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=1 (8 cpus, 1 worker): 4.66 ms. Throughput: 214.54 iter/sec. FFTlen=4096K, Type=3, Arch=4, Pass1=2048, Pass2=2048, clm=1 (8 cpus, 8 workers): 36.66, 34.82, 34.40, 37.50, 34.70, 35.39, 35.02, 35.12 ms. Throughput: 225.84 iter/sec. There's no guarantee George will pick those particular kernels for the build. The 1 worker configuration that was fastest for you was also fastest for my i5-6600. The n-worker configurations were different though, with your Ryzen system preferring more balance between Pass1 and Pass2, while the i5 preferred a smaller Pass1 and a larger Pass2. Your best result of 225.84 iter/sec was 33% faster than my i5-6600 at 3.3 GHz giving 169.78 iter/sec. Your memory bandwidth is 25% faster than my DDR4-2133, and the stock clock of an 1800X at 3.6GHz is 9% faster than the 3.3 GHz I have my i5-6600 running, so your Ryzen chip is doing very well considering its FMA3 runs at half speed of the Skylake. |
|
|
|
|
|
|
#49 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
753310 Posts |
Quote:
In most cases the difference is not great, but in some it can be significant. For example, the optimal 4M FFT on my Kaby Lake is the worst 4M FFT on Ryzen (150 iter/s vs 179 iter/s). That is a significant difference - enough for me to consider adding different default selections for Ryzen. I need to study more to see how often there is such a huge difference. I've also found that if I just export all the FFT implementations that are within 10% of optimal on my Kaby Lake, then I'll still be excluding the best FFT implementation for some machines. Thus, there is no way to avoid serious bloat of the executable size given the present prime95 architecture.... So, I've started investigating how hard it would be to make pass 1 shared. That is, all FFT implementations that do 512 elements in pass 1 would call a common routine to do pass 1. If I do this, then I can include every possible FFT implementation without any bloat in executable size. A little history: Old architectures would pay a significant penalty if they did a lot of address calculations during the FFT. Thus, prime95 favored using fixed displacements in pass 1, at the expense of making pass 1 unshareable. Also, the FFT code was written to work on both 32-bit and 64-bit machines, limiting me to using 8 registers. My plan is to use the extra 8 registers available in 64-bit to greatly ease the amount of address calculations needed to "common-ize" pass 1 code. That, along with modern architectures not paying much penalty for doing some extra address calculations along side FPU calculations, means I can make pass 1 share-able at almost zero runtime cost. After this is all done, I'll look at the different benchies sent in and use them to change prime95's defaults to be the best one for most CPUs rather than just my Kaby Lake. In summary: 1) 64-bit prime95 will export all FFT implementations for AVX and AVX2 architectures, making the new find-the-fastest-FFT-implementation feature quite useful. 2) 32-bit prime95 will not have common pass 1 making the new feature pretty useless. But then again, who cares about 32-bit users? 3) There will be new default FFT implementations using a blend of benchmarks, with perhaps a separate selection for Ryzen. Time frame? Won't be quick. |
|
|
|
|
|
|
#50 |
|
Sep 2003
5·11·47 Posts |
Is executable size bloat truly an issue? We have terabytes of disk space and gigabytes of RAM. Presumably any unneeded code will get paged out of physical memory if memory gets tight. Maybe as an interim solution it would be OK to let the executable get bloated, or at least offer a bloated executable as an optional alternative download?
|
|
|
|
|
|
#51 | |
|
Jan 2013
22×17 Posts |
Quote:
If you need any more Ryzen testing let me know. |
|
|
|
|
|
|
#52 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
1D6D16 Posts |
A little. The test executable was 50ishMB, but did not include all implementations for all-complex FFTs, Core/Core2 (AVX, but not AVX2), AVX512. Do all of those and we could be pushing 200MB. Deal breaker, probably not.
The best reason for doing this is it would allow me to write new pass 1s for niche markets. Such as non-prefetching FFTs for Xeons with huge caches, or maybe a Ryzen-specific version that uses prefetchw (that's what the K8 and K10 specific versions do). Or perhaps add more pass1 sizes. Besides, its just "cleaner". No one will look at the code and wonder "what was that guy thinking?"
|
|
|
|
|
|
#53 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3·29·83 Posts |
200 MB is no small thing. We may have disks more than an order of magnitude larger than that, but one thing we don't have is the internet bandwidth to match (at least not in the US for the very, very large majority of users). Maybe it would be an acceptable stopgap, but since George has an alternate solution that should be the long term goal
|
|
|
|
|
|
#54 | |
|
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
2×33×109 Posts |
Quote:
TBH if you have internet that slow you get used to it. I would suggest a lite version as well if you go that bit. |
|
|
|
|
|
|
#55 | |
|
Feb 2016
UK
3·5·29 Posts |
Quote:
I assume this will eventually affect gwnum, and thus other software that uses it. With that in mind, is there some point where you need to draw a line and say anything older than some point becomes unsupported? Keep a legacy version available for those that need it, and have a more modern version looking forwards? On Ryzen specifically, is there any significant potential advantage to be gained if its architecture was considered specifically? |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| 29.2 benchmark help #2 (Ryzen only) | Prime95 | Software | 10 | 2017-05-08 13:24 |
| Benchmark Variances | Fred | Software | 5 | 2016-04-01 18:15 |
| LLR benchmark thread | Oddball | Riesel Prime Search | 5 | 2010-08-02 00:11 |
| Does anyone have i7 920? for Benchmark? | cipher | Twin Prime Search | 2 | 2009-04-14 20:16 |
| Benchmark Weirdness | R.D. Silverman | Hardware | 2 | 2007-07-25 12:16 |