![]() |
|
|
#1 |
|
Jan 2023
Riga, Latvia
22·3·5 Posts |
Hi, in p95 benchmark there is an option to do full pass benchmark to determine optimal setup for my PC.
Can somebody explain what they mean? Thanks! FFTlen=3360K, Type=3, Arch=4, Pass1=448, Pass2=7680, clm=4 (12 cores, 2 workers): 1.95, 1.86 ms. Throughput: 1050.86 iter/sec. FFTlen=3360K, Type=3, Arch=4, Pass1=448, Pass2=7680, clm=2 (12 cores, 2 workers): 1.88, 1.86 ms. Throughput: 1068.46 iter/sec. FFTlen=3360K, Type=3, Arch=4, Pass1=448, Pass2=7680, clm=1 (12 cores, 2 workers): 2.00, 1.92 ms. Throughput: 1020.24 iter/sec. FFTlen=3360K, Type=3, Arch=4, Pass1=896, Pass2=3840, clm=4 (12 cores, 2 workers): 2.07, 1.94 ms. Throughput: 997.99 iter/sec. FFTlen=3360K, Type=3, Arch=4, Pass1=896, Pass2=3840, clm=2 (12 cores, 2 workers): 1.93, 1.88 ms. Throughput: 1050.31 iter/sec. FFTlen=3360K, Type=3, Arch=4, Pass1=896, Pass2=3840, clm=1 (12 cores, 2 workers): 1.93, 1.86 ms. Throughput: 1055.57 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=384, Pass2=9216, clm=4 (12 cores, 2 workers): 1.97, 1.89 ms. Throughput: 1036.81 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=384, Pass2=9216, clm=2 (12 cores, 2 workers): 1.93, 1.91 ms. Throughput: 1043.26 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=384, Pass2=9216, clm=1 (12 cores, 2 workers): 1.94, 1.91 ms. Throughput: 1040.41 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=768, Pass2=4608, clm=4 (12 cores, 2 workers): 2.06, 1.93 ms. Throughput: 1004.03 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=768, Pass2=4608, clm=2 (12 cores, 2 workers): 1.93, 1.88 ms. Throughput: 1051.41 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=768, Pass2=4608, clm=1 (12 cores, 2 workers): 1.89, 1.86 ms. Throughput: 1067.57 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=1536, Pass2=2304, clm=4 (12 cores, 2 workers): 2.38, 2.17 ms. Throughput: 880.39 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=1536, Pass2=2304, clm=2 (12 cores, 2 workers): 2.14, 1.94 ms. Throughput: 983.53 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=1536, Pass2=2304, clm=1 (12 cores, 2 workers): 2.04, 1.95 ms. Throughput: 1003.56 iter/sec. |
|
|
|
|
|
#2 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,823 Posts |
From prime95 source file commonb.c (beginning line 9604 in v30.8b15), and similar occurs also elsewhere:
Code:
sprintf (buf, "FFTlen=%lu%s%s, Type=%d, Arch=%d, Pass1=%lu, Pass2=%lu, clm=%lu", (fftlen & 0x3FF) ? fftlen : fftlen / 1024, (fftlen & 0x3FF) ? "" : "K", plus1 ? " all-complex" : "", lldata.gwdata.FFT_TYPE, lldata.gwdata.ARCH, fftlen / (lldata.gwdata.PASS2_SIZE ? lldata.gwdata.PASS2_SIZE : 1), lldata.gwdata.PASS2_SIZE, lldata.gwdata.PASS1_CACHE_LINES / ((CPU_FLAGS & CPU_AVX512F) ? 8 : ((CPU_FLAGS & CPU_AVX) ? 4 : 2))); pass1 = size of first fft pass across the data pass2 =size of second fft pass across the data clm = cache line multiplier see also at least source files cpuid.c, cpuid.h re cpu architecture (George is on vacation. If I've botched the above, he may correct it after returning) Last fiddled with by kriesel on 2023-02-18 at 15:29 |
|
|
|
|
|
#3 |
|
Jan 2023
Riga, Latvia
22·3·5 Posts |
Thanks!
It become clearer also, more questions arise - what is cache line multiplier, what is pass across data? :D |
|
|
|
|
|
#4 | |
|
Jan 2021
California
10001100002 Posts |
Quote:
Do you understand what an FFT is? Very large FFTs are used to perform the multiplication/modular arithmetic, the FFT processing in prime95/mprime is done in two passes, effectively pass1 is working on X size pieces of the FFT, and pass two is taking Y pieces and combining them together. If you multiply the two pass size numbers together you'll get the full size of the FFT. Some programs break the FFT processing into 3 passes. It's an implementation detail to make the code more efficient in terms of how it uses the memory and cache. ETA: You can use the search features on the forum and probably find much better answers than the ones I've given here. Last fiddled with by slandrum on 2023-02-18 at 18:13 |
|
|
|
|
|
|
#5 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,823 Posts |
|
|
|
|
|
|
#6 |
|
Jan 2023
Riga, Latvia
22·3·5 Posts |
Thanks will look into it!
Last question, how can I calculate the MB size of FFT calculation? To understand when my cache 32MB is done and RAM kicks in, can I calculate the FFT that fits within 32MB of cache? Thanks! |
|
|
|
|
|
#7 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,823 Posts |
I suspect pass 1 fitting in L1 or L2 is more likely and significant.
As a ballpark computation, if the fft data are changed in place, L3cachesize / #workers / (8bytes/DPword) = 32Mi / 2 / 8 = 2048Ki. That's smaller than wavefront DC. |
|
|
|
|
|
#8 | |
|
Jan 2023
Riga, Latvia
22·3·5 Posts |
Quote:
Something like that I assume. 2 workers, each utilizing its own 32MB of cache. Last fiddled with by Jurzal on 2023-02-19 at 07:34 |
|
|
|
|
|
|
#9 | |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24·199 Posts |
Quote:
Though your benchmarks in the benchmark thread seem a bit odd. Was there anything else running at the same time? Someone ran benchmarks of the 5800X3D and you can see it doesn't have a big increase when the FFT exceeds 32M unlike the other CPUs with 32MB per chiplet. I'm on mobile and too lazy to dig through the thread. |
|
|
|
|
|
|
#10 |
|
Jan 2023
Riga, Latvia
22×3×5 Posts |
GPU was running on background, discord, hwinfo64 other background type tasks.
5800X3D has 96 MB of L3 cache, while 5900X has 2x32 MB of L3 cache. I can test some other parameters, but I don't think I would come up with different results |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| How to pass a timeout argument? | Dubslow | YAFU | 3 | 2016-05-11 15:37 |
| Out of 1st pass work | iconized | Prime Sierpinski Project | 1 | 2012-02-12 18:36 |
| Block Lanczos with a reordering pass | jasonp | Msieve | 18 | 2010-02-07 08:33 |
| First pass PRPNet server out of work? | opyrt | Prime Sierpinski Project | 6 | 2009-09-24 18:14 |
| please help me pass the test. | caliman | Hardware | 2 | 2007-11-08 06:12 |