![]() |
What is Arch, Pass 1, Pass 2, clm?
Hi, in p95 benchmark there is an option to do full pass benchmark to determine optimal setup for my PC.
Can somebody explain what they mean? Thanks! FFTlen=3360K, Type=3, Arch=4, Pass1=448, Pass2=7680, clm=4 (12 cores, 2 workers): 1.95, 1.86 ms. Throughput: 1050.86 iter/sec. FFTlen=3360K, Type=3, Arch=4, Pass1=448, Pass2=7680, clm=2 (12 cores, 2 workers): 1.88, 1.86 ms. Throughput: 1068.46 iter/sec. FFTlen=3360K, Type=3, Arch=4, Pass1=448, Pass2=7680, clm=1 (12 cores, 2 workers): 2.00, 1.92 ms. Throughput: 1020.24 iter/sec. FFTlen=3360K, Type=3, Arch=4, Pass1=896, Pass2=3840, clm=4 (12 cores, 2 workers): 2.07, 1.94 ms. Throughput: 997.99 iter/sec. FFTlen=3360K, Type=3, Arch=4, Pass1=896, Pass2=3840, clm=2 (12 cores, 2 workers): 1.93, 1.88 ms. Throughput: 1050.31 iter/sec. FFTlen=3360K, Type=3, Arch=4, Pass1=896, Pass2=3840, clm=1 (12 cores, 2 workers): 1.93, 1.86 ms. Throughput: 1055.57 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=384, Pass2=9216, clm=4 (12 cores, 2 workers): 1.97, 1.89 ms. Throughput: 1036.81 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=384, Pass2=9216, clm=2 (12 cores, 2 workers): 1.93, 1.91 ms. Throughput: 1043.26 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=384, Pass2=9216, clm=1 (12 cores, 2 workers): 1.94, 1.91 ms. Throughput: 1040.41 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=768, Pass2=4608, clm=4 (12 cores, 2 workers): 2.06, 1.93 ms. Throughput: 1004.03 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=768, Pass2=4608, clm=2 (12 cores, 2 workers): 1.93, 1.88 ms. Throughput: 1051.41 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=768, Pass2=4608, clm=1 (12 cores, 2 workers): 1.89, 1.86 ms. Throughput: 1067.57 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=1536, Pass2=2304, clm=4 (12 cores, 2 workers): 2.38, 2.17 ms. Throughput: 880.39 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=1536, Pass2=2304, clm=2 (12 cores, 2 workers): 2.14, 1.94 ms. Throughput: 983.53 iter/sec. FFTlen=3456K, Type=3, Arch=4, Pass1=1536, Pass2=2304, clm=1 (12 cores, 2 workers): 2.04, 1.95 ms. Throughput: 1003.56 iter/sec. |
From prime95 source file commonb.c (beginning line 9604 in v30.8b15), and similar occurs also elsewhere:
[CODE] sprintf (buf, "FFTlen=%lu%s%s, Type=%d, Arch=%d, Pass1=%lu, Pass2=%lu, clm=%lu", (fftlen & 0x3FF) ? fftlen : fftlen / 1024, (fftlen & 0x3FF) ? "" : "K", plus1 ? " all-complex" : "", lldata.gwdata.FFT_TYPE, lldata.gwdata.ARCH, fftlen / (lldata.gwdata.PASS2_SIZE ? lldata.gwdata.PASS2_SIZE : 1), lldata.gwdata.PASS2_SIZE, lldata.gwdata.PASS1_CACHE_LINES / ((CPU_FLAGS & CPU_AVX512F) ? 8 : ((CPU_FLAGS & CPU_AVX) ? 4 : 2)));[/CODE] Arch = a number code for cpu architecture; pass1 = size of first fft pass across the data pass2 =size of second fft pass across the data clm = cache line multiplier see also at least source files cpuid.c, cpuid.h re cpu architecture (George is on vacation. If I've botched the above, he may correct it after returning) |
Thanks!
It become clearer also, more questions arise - what is cache line multiplier, what is pass across data? :D |
[QUOTE=Jurzal;625174]Thanks!
It become clearer also, more questions arise - what is cache line multiplier, what is pass across data? :D[/QUOTE] Cache line multiplier is a low level detail in the implementation of prime95/mprime, values can be 1, 2 or 4 and on different architectures one of the settings may be slighlty faster than the other. Do you understand what an FFT is? Very large FFTs are used to perform the multiplication/modular arithmetic, the FFT processing in prime95/mprime is done in two passes, effectively pass1 is working on X size pieces of the FFT, and pass two is taking Y pieces and combining them together. If you multiply the two pass size numbers together you'll get the full size of the FFT. Some programs break the FFT processing into 3 passes. It's an implementation detail to make the code more efficient in terms of how it uses the memory and cache. ETA: You can use the search features on the forum and probably find much better answers than the ones I've given here. |
Some further reading:
[url]https://mersenneforum.org/showpost.php?p=1478&postcount=5[/url] [url]https://www.mersenneforum.org/showpost.php?p=510721&postcount=7[/url] |
Thanks will look into it!
Last question, how can I calculate the MB size of FFT calculation? To understand when my cache 32MB is done and RAM kicks in, can I calculate the FFT that fits within 32MB of cache? Thanks! |
I suspect pass 1 fitting in L1 or L2 is more likely and significant.
As a ballpark computation, if the fft data are changed in place, L3cachesize / #workers / (8bytes/DPword) = 32Mi / 2 / 8 = 2048Ki. That's smaller than wavefront DC. |
[QUOTE=kriesel;625189]I suspect pass 1 fitting in L1 or L2 is more likely and significant.
As a ballpark computation, if the fft data are changed in place, L3cachesize / #workers / (8bytes/DPword) = 32Mi / 2 / 8 = 2048Ki. That's smaller than wavefront DC.[/QUOTE] I have 5900X with 32x2 MB of L3 Cache, so if 32Mi / 1 / 8, I should fit within 4,096K range? Something like that I assume. 2 workers, each utilizing its own 32MB of cache. |
[QUOTE=Jurzal;625198]I have 5900X with 32x2 MB of L3 Cache, so if 32Mi / 1 / 8, I should fit within 4,096K range?
Something like that I assume. 2 workers, each utilizing its own 32MB of cache.[/QUOTE] Yes, about. Though your benchmarks in the benchmark thread seem a bit odd. Was there anything else running at the same time? Someone ran benchmarks of the 5800X3D and you can see it doesn't have a big increase when the FFT exceeds 32M unlike the other CPUs with 32MB per chiplet. I'm on mobile and too lazy to dig through the thread. |
GPU was running on background, discord, hwinfo64 other background type tasks.
5800X3D has 96 MB of L3 cache, while 5900X has 2x32 MB of L3 cache. I can test some other parameters, but I don't think I would come up with different results |
| All times are UTC. The time now is 14:01. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.