![]() |
Best AVX-512 CPUs for large-footprint FFT-mul
Starting late last summer I ran a p-1 stage 1 to b1 = 10^7 on F33 on my Knights Landing cheapie-refurb mini-workstation. After installing a big wad of server dimm-RAM I've been running 10^8-sized stage 2 intervals using the stage 1 residue, with a view to starting a distributed such effort among interested forumites with suitable hardware.
But before risking wasting others' runtime, it's important to be sure stage 1 result is correct. Our own Mike/Xyzzy has been kindly running a separate stage 1 computation on his Intel 18c36t i9, that was roughly 70% done (~10m s1 iterations of the needed 14427494) when he recently shut said machine down and sold it off. The problem is that like most Intel manycore offerings, his machine had woefully inadequate memory bandwidth to keep those cores fed on a big-footprint (~4GB) FFT-modmul running data-hungry avx-512 8-fold-double code - using 16c32t he was getting 900-1000 ms/iter at 512M FFT, roughly half the speed of my KNL running out of the onboard 16GB HBM. If one were targeting F33 stage 1 work, I wonder what the most bang-for-buck-ish non-KNL avx-512 option would be. One would want at least 4 cores but no more than 8 due to memory-bandwidth constraints, as large an L3 cache as possible (on the KNL the MCDRAM acts as such) and - lacking any kind of HBM - a mobo which supports fast high-bandwidth RAM, with DIMM slots filled with low-capacity but very-fast sticks, say 16-32GB total. Maybe a 1-2-year-old used CPU, if the newer ones don't really offer much max-throughput for the above type of big-footprint workloads? if you have such a machine and are willing to do some timings, here's how: o Get and build and the current version of Mlucas, using instructions [url=http://www.mersenneforum.org/mayer/README.html]here[/url]. If your system has < 24GB RAM, you'll need a couple of post-build tweaks to reduce the memory footprint; PM me for those once your automated 'bash makemake.sh' parallel build completes. o 512M FFT will have a strong preference for power-of-2 threadcounts, and 2-threads-per-core assuming it's a hyperthreaded Intel CPU (AFAIK no AMD chips have avx-512 support at present). Assuming your machine has N physical cores and P = largest power of 2 <= N, you want to pin 2*P threads to the same subset of P physical cores. Using the Intel core numbering convention: [i] ./Mlucas -iters 100 -fft 512M -f 33 -shift 0 -cpu 0:P-1,N:N+P-1 [/i] Thus e.g. on a 6c12t CPU, the args to the latter flag would be '-cpu 0:3,6:9'. The resulting timing captured in the fermat.cfg file will be ~10% pessimistic due to data-and-thread-init overhead. |
[QUOTE=ewmayer;600947]I wonder what the most bang-for-buck-ish non-KNL avx-512 option would be.
[/quote] I have not seen any desktop chips, but this workstation looks good: [url]https://www.ebay.com/itm/352828567560?hash=item522638b008:g:S9IAAOSwkDJhiyLA&var=622531495174[/url] Newer chips run cooler, not like Skylake. |
[QUOTE=paulunderwood;600949]I have not seen any desktop chips, but this workstation looks good: [url]https://www.ebay.com/itm/352828567560?hash=item522638b008:g:S9IAAOSwkDJhiyLA&var=622531495174[/url]
Newer chips run cooler, not like Skylake.[/QUOTE] Thanks for the link - but in "bang for the buck" terms, it comes with no RAM, and is unlikely to outperform my refurb-KNL, which cost just $500, plus - once I had completed my stage 1 run - another $1400 (would be < $1000 currently) to upgrade with 192GB RAM for stage 2 work. Admittedly, it's a niche sort of optimization problem, and quite possibly a cheap used RAM-less KNL will prove the best option, I mainly wanted a sense of whether there were any consumer-grade Intel offerings which could provide similar total memory-bandwidth at comparable cost. |
[QUOTE=ewmayer;600953]Thanks for the link - but in "bang for the buck" terms, it comes with no RAM, and is unlikely to outperform my refurb-KNL, which cost just $500, plus - once I had completed my stage 1 run - another $1400 (would be < $1000 currently) to upgrade with 192GB RAM for stage 2 work
[/quote] The base price is for 96GB RAM. Another $500 gets you 192GB. It also comes with 4x 250GB NVMe. I thought 38.5MB Level3 cache would be attractive, despite 1/3 memory bandwidth. Running 4-8 cores on one of its chips would probably trigger turbo bumping it up to nearly 3.8GHz. [quote] Admittedly, it's a niche sort of optimization problem, and quite possibly a cheap used RAM-less KNL will prove the best option, I mainly wanted a sense of whether there were any consumer-grade Intel offerings which could provide similar total memory-bandwidth at comparable cost.[/QUOTE] Hopefully someone will run some benchmarks for you and that a cheap desktop chip will give you what you want. :smile: I looked at NewEgg. An Intel 12700k ($400) plus an Asus Strix motherboard ($500) and 64GB DDR5 (~$600) dual channel. The chip will run AVX512 if the motherboard allows the disablement of E-cores, resulting in 8 cores. |
Just a note on AVX-512 vs FMA3 on a dual channel board. Used CpuSupportsAVX512F=0 or 1 to toggle AVX-512.
[code]3200K FFT DCLL on 60198527 @4700 -4698 Mhz on all cores reported by CPU-Z for both FMA3 and AVX-512 runs. AVX-512 FMA3 2.88 ms/iter 3.025 ms/iter 2.87 ms/iter 3.022 ms/iter 2.91 ms/iter 3.018 ms/iter[/code] Side note: my Ryzen 7 3800X (dual channel FMA3) is quicker than both at IIRC around 2.00 ms/iter. Will confirm #s when remote access to box possible. 1 worker 8 cores on all. |
[QUOTE=paulunderwood;600969]…64GB DDR5 (~$600) dual channel.[/QUOTE]We are currently running this job on a 32GB NUC.[CODE]top - 17:08:22 up 5 days, 11 min, 1 user, load average: 5.10, 5.27, 4.73
Tasks: 281 total, 1 running, 278 sleeping, 0 stopped, 2 zombie %Cpu(s): 1.0 us, 0.3 sy, 50.0 ni, 48.2 id, 0.0 wa, 0.4 hi, 0.1 si, 0.0 st GiB Mem : 31.1 total, 12.8 free, 8.1 used, 10.1 buff/cache GiB Swap: 0.0 total, 0.0 free, 0.0 used. 22.1 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 121965 m 30 10 22.3g 5.0g 4.0m S 400.0 16.2 5230:10 ./mlucas -cpu 0:3[/CODE]:mike: |
[QUOTE=sdbardwick;600999]Just a note on AVX-512 vs FMA3 on a dual channel board. Used CpuSupportsAVX512F=0 or 1 to toggle AVX-512.
[code]3200K FFT DCLL on 60198527 @4700 -4698 Mhz on all cores reported by CPU-Z for both FMA3 and AVX-512 runs. AVX-512 FMA3 2.88 ms/iter 3.025 ms/iter 2.87 ms/iter 3.022 ms/iter 2.91 ms/iter 3.018 ms/iter[/code] Side note: my Ryzen 7 3800X (dual channel FMA3) is quicker than both at IIRC around 2.00 ms/iter. Will confirm #s when remote access to box possible. 1 worker 8 cores on all.[/QUOTE] Yes, average ms/iter on R7 3800X is 2.022. Guess 2x16MB cache helps. The upcoming 3800X3D with 96MB cache will be interesting for larger FFTs. |
[QUOTE=sdbardwick;600999]Just a note on AVX-512 vs FMA3 on a dual channel board. Used CpuSupportsAVX512F=0 or 1 to toggle AVX-512.
[code]3200K FFT DCLL on 60198527 @4700 -4698 Mhz on all cores reported by CPU-Z for both FMA3 and AVX-512 runs. AVX-512 FMA3 2.88 ms/iter 3.025 ms/iter 2.87 ms/iter 3.022 ms/iter 2.91 ms/iter 3.018 ms/iter[/code] Side note: my Ryzen 7 3800X (dual channel FMA3) is quicker than both at IIRC around 2.00 ms/iter. Will confirm #s when remote access to box possible. 1 worker 8 cores on all.[/QUOTE] What was the power usage like for AVX-512 vs FMA3? I suspect that FMA3 may be more efficient on your system(and many others). Was the same frequency held for the AVX-512 benchmark? |
[QUOTE=henryzz;601055]What was the power usage like for AVX-512 vs FMA3? I suspect that FMA3 may be more efficient on your system(and many others). Was the same frequency held for the AVX-512 benchmark?[/QUOTE]
According to Intel Extreme Tuning Utility, Package TDP for AVX-512 is 200W, FMA3 is 176W. Both stabilize at 4.7GHz, with the -512 running with an extra 0.1 V for core voltage. |
[QUOTE=sdbardwick;601083]According to Intel Extreme Tuning Utility, Package TDP for AVX-512 is 200W, FMA3 is 176W. Both stabilize at 4.7GHz, with the -512 running with an extra 0.1 V for core voltage.[/QUOTE]
I'm not familiar with the details of the apparent difference in running mprime in avx-512 vs fma3 mode. Could you enlighten? My own code makes no such distinction, because AFAIK no Intel CPUs have the former without supporting the latter. [b]Update:[/b] Paul Underwood has kindly agreed to run the stage 1 DC to completion, taking over from Mike/Xyzzy around iteration 9.4M. He's getting 502 ms/iter @512M FFT running 64c128t on his KNL, right around what I expected based on the 470 ms/iter I got on my KNL, which at 1.4 GHz clocks 0.1 GHz higher than his. At that rate, with ~5Miters left to go, ETA for the DC is 29 days from now, assuming uninterrupted 24/7 running. |
Intel plans to fuse disable AVX-512 support from Alder Lake cpus
even tho it is on the chip. Previously they were kinda, possibly, maybe going to support it but have changed their minds. The link below is to my favorite leaks and rumors page, Gamer Meld. I have found them to be brand agnostic and very accurate. The Alder Lake section starts at 1:59 [URL]https://www.youtube.com/watch?v=LNQVX1YP7m4&t=207s[/URL] Up yours Intel !! |
| All times are UTC. The time now is 16:30. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.