![]() |
Vulkan compute shaders
I've been trying out coding up a few simple things using Vulkan compute shaders, instead of OpenCL. It seems pretty cool so far.
I was thinking of taking a shot at converting mfakt[oc] into a vulkan mfaktv. It would be fun to compare performance. Has anyone already tried something like this? Is it a dumb idea? |
Last time I looked it seemed hard to do generic compute with vulkan because vulkan is so focused on rendering, but that might have changed and I have little experience to be a good judge. It's worth exploring especially if you've had success implementing more trivial but definitely non-render-type compute in vulkan already. It might be much harder than OpenCL to get running performantly on different hardware types, and there might not be much to gain besides compatibility and maybe less overhead, but AFAIK no one has seriously tried and these are just my assumptions.
If you've found good resources on doing generic compute on vulkan please post them here, it's an interesting rabbit hole to learn about. |
[QUOTE=M344587487;626509]Last time I looked it seemed hard to do generic compute with vulkan because vulkan is so focused on rendering, but that might have changed and I have little experience to be a good judge. It's worth exploring especially if you've had success implementing more trivial but definitely non-render-type compute in vulkan already. It might be much harder than OpenCL to get running performantly on different hardware types, and there might not be much to gain besides compatibility and maybe less overhead, but AFAIK no one has seriously tried and these are just my assumptions.
If you've found good resources on doing generic compute on vulkan please post them here, it's an interesting rabbit hole to learn about.[/QUOTE] I found [URL="https://github.com/Erkaman/vulkan_minimal_compute"]this[/URL] example and thought it did a nice job showing how to get stared on the CPU side. I started with that. I think I can figure out the GPU compute side. I have a lot to learn wrt to other things, for example the vulkan memory models. I was able to build and run just by installing a few packages from the standard ubuntu distro. I'll keep digging into it and see what happens. thanks! |
So I changed by path a bit. I've started converting gpuowl.cl into GLSL that compiles to SPIR-V to be used as a vulkan compute shader. I'm testing little bits as I go. Learning a lot about gpuowl and vulkan as I go. Fun! The bit-wise manipulations of floats/doubles had me stuck for a bit, but I figured out how to make those work well now.
|
seems to be a lot of work but useful, respect!
|
Have you looked into this or something like it? [url]https://github.com/google/clspv[/url]
No idea how far along that project is and it probably isn't as simple as press a button to transpile,but with luck it may be able to do some legwork. |
[QUOTE=M344587487;626936]Have you looked into this or something like it? [url]https://github.com/google/clspv[/url]
No idea how far along that project is and it probably isn't as simple as press a button to transpile,but with luck it may be able to do some legwork.[/QUOTE] I did, but I think I should look at it again. Thanks! |
I had a little spare time so I wrote a very simple trial factor implementation as a Vulkan shader. It implements simple 96 or 128-bit integer math. It doesn't yet do any fancy montgomery, barrett, etc optimizations. I wanted to make sure I understood enough to make a working TF first.
On a Radeon VII it can test > 100M k-values/sec, so not mfakt[co] speeds, but still interesting. Shader code is here: [url]https://github.com/mrh42/vtf/blob/main/tf.comp[/url] |
Kudos! How did you choose a value for NK? What GPU utilization do you see in GPU-Z or similar, vs cpu utilization of the process in Task Manager (Win) / top (Linux)?
[CODE]const int ntestprimes = 160; // tweak as needed[/CODE]This seems small to me. You may gain more performance by sieving out more candidate factors, with more known small primes. |
[QUOTE=kriesel;629349]Kudos! How did you choose a value for NK? What GPU utilization do you see in GPU-Z or similar, vs cpu utilization of the process in Task Manager (Win) / top (Linux)?
[CODE]const int ntestprimes = 160; // tweak as needed[/CODE]This seems small to me. You may gain more performance by sieving out more candidate factors, with more known small primes.[/QUOTE] Thanks! I choose NK, the chunk of work for each thread, to keep the time running on the gpu to around 150ms. Longer than that seems to make the console unusable. It seems to me that when a computer shader is running everything else waits. Still looking at this. The cpu does basically nothing, mapping/unmapping memory, checking on 64bit result and setting K for the next round. The vulkan calls don't seem to need any overhead that I can see. It will max out my gpu at 250W if I let it go at 2200Mhz. (On linux) You are correct, as it is now the sieve lets about 12% through. It should do better if it takes it down to around 8%, I think. But it doesn't at the moment. That function needs a little work to make it more pipelined, maybe. Now, if someone had a fft/ntt based Square-mod function I could use... :) |
With a few optimizations, it can test about 420M factors/second, or about 15 ghz-days/day (I think). Pretty good considering the really slow/simple squaring algorithm.
|
[QUOTE=mrh;629469]With a few optimizations, it can test about 420M factors/second, or about 15 ghz-days/day (I think). Pretty good considering the really slow/simple squaring algorithm.[/QUOTE]
We welcome the empirical around these here parts. Do you have more to show? Perspiring minds want to know. |
[QUOTE=chalsall;629471]We welcome the empirical around these here parts.
Do you have more to show? Perspiring minds want to know.[/QUOTE] It is still a little rough, I only wanted to determine what could and could not be done with a compute shader. [url]https://github.com/mrh42/vtf[/url] |
[QUOTE=mrh;629469]With a few optimizations, it can test about 420M factors/second, or about 15 ghz-days/day (I think). Pretty good considering the really slow/simple squaring algorithm.[/QUOTE]420 vs 100 is an admirable jump in performance. 15 GHD/d sounds anomalously low to me for 420M factor candidates/sec.
From a recent RTX2080 Super mfaktc run, [code]Starting trial factoring M299000123 from 2^79 to 2^80 (409.48 GHz-days) k_min = 1010807125666800 k_max = 2021614251333651 Using GPU kernel "barrett87_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Apr 26 15:48 | 0 0.1% | 12.420 3h18m | 2967.22 82485 n.a.% ... Apr 26 19:07 | 4617 100.0% | 12.495 0m00s | 2949.41 82485 n.a.% no factor for M299000123 from 2^79 to 2^80 [mfaktc 0.21 barrett87_mul32_gs] tf(): total time spent: 3h 19m 6.270s[/code] Let's suppose that on average that is 2958. GHD/d. Duration was 11946.27 seconds, or 0.138267 days. Check mean sustained GHD/d; 409.48/0.138267 = 2961.516 GHD/d, 0.1189% higher than estimated above. k range = k_max-k_min ~ 1.0108E15. Per the attachment of [url]https://www.mersenneforum.org/showpost.php?p=621318&postcount=9[/url] the density of primes in that k range 1E15 to 2E15 is ~2.9%. Optimistically assuming mfaktc is approximating complete sieving, that would mean it is testing ~2.9E13 k values. ~2.9E13 / 11946.27 seconds ~2.4275E9 candidates/sec ~ 2.097E14 candidates/day; 2961.5GHD/day for RTX2080 super, vs. its 3072. rating at [url]https://www.mersenne.ca/mfaktc.php[/url] is close, and explainable variation, since this RTX2080 Super is being operated at lower than nominal input power, and effective GHD/d varies by kernel & perhaps other variables. I tend to operate Radeon VIIs at reduced power too. The Radeon vii GPU is rated for TF performance at [url]https://www.mersenne.ca/mfaktc.php[/url] as 1114. GHD/d. That's 1114/3072 = 0.3626 of the RTX 2080 Super throughput. If the 420M/sec figure is surviving candidates, that's 420E6/2.43E9 / 0.3626 ~ 47.7% of mfaktx's performance after adjusting for GPU TF expected speed ratio. If the 420M/sec figure is raw k values before even the wheel sieve, it's ~2.9% of 47.7%, or ~1.4%. Achieving even 1% of state of the art GPU software TF performance is more than nearly all GIMPS participants have coded. You may be able to increase performance by using more of the concepts listed [URL="https://www.mersenneforum.org/showthread.php?p=508523#post508523"]here[/URL]. I see you are already using some, and specifically not using others (Montgomery, Barrett) yet. It was unclear to me how you run the code on a specific exponent and bit level. The latest commit seems to contain exponent and k values inline in the code. I think it would make sense to repurpose some mfaktc/o code for implementing ini files, worktodo, checkpoint files, etc at some point. |
Ah, I think you are correct. I did quick estimate to complete between vtf and mfakto, and vtf is maybe 20x slower. So maybe 140 ghd/d?
The current checked in code can manage about 420M raw k-values/sec when using 512K threads (on M262359187). I have studied your TF reference manual quite a bit, thank you! I want to write a SqMod() that uses one of the advanced techniques, but I'm not quite there yet. I understand them in principal, but not enough just yet to actually write the code... Yes, all the params on the cpu side are just hard coded. I can almost convert from bit-level to k-value in my head now. :) I might add some of the file interface stuff, but I'm not sure where to go with this. Does the world need another TF variant? Probably not. I did this to learn how the Vulkan stuff works, and learning about how trial factoring works was a bonus. (I wrote the TF code in Go first, so much easier to debug). At some point I want to understand how PRP works. The only way I know how to learn, is trying to implement it myself. But that seems quite a bit more difficult. [QUOTE=kriesel;629512]420 vs 100 is an admirable jump in performance. 15 GHD/d sounds anomalously low to me for 420M factor candidates/sec. ... If the 420M/sec figure is surviving candidates, that's 420E6/2.43E9 / 0.3626 ~ 47.7% of mfaktx's performance after adjusting for GPU TF expected speed ratio. If the 420M/sec figure is raw k values before even the wheel sieve, it's ~2.9% of 47.7%, or ~1.4%. Achieving even 1% of state of the art GPU software TF performance is more than nearly all GIMPS participants have coded. You may be able to increase performance by using more of the concepts listed [URL="https://www.mersenneforum.org/showthread.php?p=508523#post508523"]here[/URL]. I see you are already using some, and specifically not using others (Montgomery, Barrett) yet. It was unclear to me how you run the code on a specific exponent and bit level. The latest commit seems to contain exponent and k values inline in the code. I think it would make sense to repurpose some mfaktc/o code for implementing ini files, worktodo, checkpoint files, etc at some point.[/QUOTE] |
[QUOTE=mrh;629501]It is still a little rough, I only wanted to determine what could and could not be done with a compute shader.
[url]https://github.com/mrh42/vtf[/url][/QUOTE] In case folks are interested, here are some things I did or thought about to make it fast: - Very minimal data movement between the cpu/gpu. R/W to/from gpu memory is really slow from the cpu. So it writes some tables before the first call, but after that the cpu only looks at a couple uint64_t values per call. - Try to keep all the gpu "threads" doing useful work. I found my gpu likes a lot. Like 512k threads. So be very mindful of branching, etc. You don't want one edge case going while 1000s are waiting for it to catch up. - To keep everyone working on something, requires a little memory trade-off. So each thread starts with a range of 200ish K values, and builds an array of 30ish to pass to the next sieve stage. Something like 13 stages of list shortening gave good results. I think this can still be improved, the first stage should be broken up. - This pre-work is done with 32bit P and 64bit K values, but it still does a number of integer mod operations. So I found using around 200 small primes for the sieve() was a good balance for saving work in the next step. - For the K values that make it through the sieve, we finally create Q from P*2*K+1, start squaring. This will perform 10*log2(Q) * log2(P) - ish simple operations on a 96bit extended uints. These are fast, but that is a lot of operations, 20K maybe? I'll work on this next. That last step works well across the threads, but there is still some variation in the length of the lists of k-values to test between threads. So there are some threads waiting around, that could be doing work. I don't know what to do about that, maybe threads within a work group could balance out the work lists. There are some mechanisms for that. It was certainly an interesting exercise, it took me back to my early days of writing SIMD code in lisp for a Connection Machine in 1990 maybe? |
[QUOTE=mrh;629354]Now, if someone had a fft/ntt based Square-mod function I could use... :)[/QUOTE][url]https://github.com/DTolm/VkFFT[/url] may be useful, and offers competitive performance.
Re, who would need or want a Vulkan based TF or PRP package for GIMPS, some scenarios: 1) OpenCL or CUDA driver not working, but Vulkan is 2) performance advantage 3) If it supported >32 bit exponent, TF on 4.3G-10G for mersenne.ca, or even higher exponents in special cases: [url]https://www.mersenneforum.org/showpost.php?p=512904&postcount=5[/url] [url]https://www.mersenneforum.org/showpost.php?p=567245&postcount=4[/url] [url]https://www.mersenneforum.org/showpost.php?p=622464&postcount=11[/url] Double mersennes on CUDA with mmff, but there's no OpenCL software for double mersennes above MM31. |
I made a little bit more progress today. The little shader can now process almost 2.5 billion K-values/second. Maybe 100ghz-d/day? I'll have to do some longer runs to be sure.
Not optimized yet, but it seems to work, best I can tell. This is now using 128/256 bit uints. It has a Modulo function that uses floating point to estimate 128-bit values to subtract leaving an accurate remainder. [url]https://github.com/mrh42/vtf/blob/main/tf.comp[/url] |
Another vulkan thing I learned. I can point my program at the Radeon VII vulkan device, or an "llvmpipe" device that runs on my CPU.
The unmodified SPIR-V code ran on my CPU, using 16 cores at about 17M factors/sec. (Xeon Gold 6146) Kinda cool for testing I suppose. |
The latest version of the TF compute shader code can process about 3.7 billion potential factors per second at 100W on my Radean VII. About 4.5 billion/sec at 250W. The main update was to use DEVICE_LOCAL memory for some data. No where near mfact* performance, but the code is also very simple.
This might be a useful starting starting point if someone wants to write something else for a GPU but doesn't want to start from scratch. [url]https://github.com/mrh42/vtf/blob/main/tf.comp[/url] |
[QUOTE=mrh;630871]The latest version of the TF compute shader code can process about 3.7 billion potential factors per second at 100W on my Radean VII. About 4.5 billion/sec at 250W. ...
This might be a useful starting point if someone wants to write something else for a GPU but doesn't want to start from scratch.[/QUOTE] Thank you for doing this. It takes some confidence to publish code, and then take peer review. The Scientific Method is not everyone's comfort zone... Is there a simple "make" system avaiabvle to build and run this? If not, any install guide? I am impressed by this work. |
[QUOTE=chalsall;630896]Thank you for doing this. It takes some confidence to publish code, and then take peer review.
The Scientific Method is not everyone's comfort zone... Is there a simple "make" system avaiabvle to build and run this? If not, any install guide? I am impressed by this work.[/QUOTE] Thanks! I wasn't sure if there was interest. For now I've added some basic instructions to the README. I'll cleanup the CPU side a bit and add a makefile in a while. [url]https://github.com/mrh42/vtf/blob/main/README.md[/url] edit: I'm working on a version that doesn't require GL_EXT_shader_explicit_arithmetic_types_int64. I don't know if it will be better or worse. Its a bit of work, so it may not show up for a few days. |
The initial version of the shader using 96/192-bit math based on 32-bit uints is about 40% faster.
Not 100% tested yet, but it seems promising. I'll probably drop the 64-bit version. This will also make it easy to support K values up to 96-bit as well. [url]https://github.com/mrh42/vtf/blob/main/tf32.comp[/url] |
[QUOTE=mrh;630986]The initial version of the shader using 96/192-bit math based on 32-bit uints is about 40% faster.[/QUOTE]
So... Perhaps that then suggests that that is the best path. Always ask for guidance, of course. I tend to learn from others. |
| All times are UTC. The time now is 15:04. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.