![]() |
|
|
#12 | |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
2·112·47 Posts |
Quote:
Do you have more to show? Perspiring minds want to know. |
|
|
|
|
|
|
#13 | |
|
"mrh"
Oct 2018
Temecula, ca
24·32 Posts |
Quote:
https://github.com/mrh42/vtf |
|
|
|
|
|
|
#14 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24×3×163 Posts |
Quote:
From a recent RTX2080 Super mfaktc run, Code:
Starting trial factoring M299000123 from 2^79 to 2^80 (409.48 GHz-days) k_min = 1010807125666800 k_max = 2021614251333651 Using GPU kernel "barrett87_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Apr 26 15:48 | 0 0.1% | 12.420 3h18m | 2967.22 82485 n.a.% ... Apr 26 19:07 | 4617 100.0% | 12.495 0m00s | 2949.41 82485 n.a.% no factor for M299000123 from 2^79 to 2^80 [mfaktc 0.21 barrett87_mul32_gs] tf(): total time spent: 3h 19m 6.270s Duration was 11946.27 seconds, or 0.138267 days. Check mean sustained GHD/d; 409.48/0.138267 = 2961.516 GHD/d, 0.1189% higher than estimated above. k range = k_max-k_min ~ 1.0108E15. Per the attachment of https://www.mersenneforum.org/showpo...18&postcount=9 the density of primes in that k range 1E15 to 2E15 is ~2.9%. Optimistically assuming mfaktc is approximating complete sieving, that would mean it is testing ~2.9E13 k values. ~2.9E13 / 11946.27 seconds ~2.4275E9 candidates/sec ~ 2.097E14 candidates/day; 2961.5GHD/day for RTX2080 super, vs. its 3072. rating at https://www.mersenne.ca/mfaktc.php is close, and explainable variation, since this RTX2080 Super is being operated at lower than nominal input power, and effective GHD/d varies by kernel & perhaps other variables. I tend to operate Radeon VIIs at reduced power too. The Radeon vii GPU is rated for TF performance at https://www.mersenne.ca/mfaktc.php as 1114. GHD/d. That's 1114/3072 = 0.3626 of the RTX 2080 Super throughput. If the 420M/sec figure is surviving candidates, that's 420E6/2.43E9 / 0.3626 ~ 47.7% of mfaktx's performance after adjusting for GPU TF expected speed ratio. If the 420M/sec figure is raw k values before even the wheel sieve, it's ~2.9% of 47.7%, or ~1.4%. Achieving even 1% of state of the art GPU software TF performance is more than nearly all GIMPS participants have coded. You may be able to increase performance by using more of the concepts listed here. I see you are already using some, and specifically not using others (Montgomery, Barrett) yet. It was unclear to me how you run the code on a specific exponent and bit level. The latest commit seems to contain exponent and k values inline in the code. I think it would make sense to repurpose some mfaktc/o code for implementing ini files, worktodo, checkpoint files, etc at some point. Last fiddled with by kriesel on 2023-04-27 at 16:45 |
|
|
|
|
|
|
#15 | |
|
"mrh"
Oct 2018
Temecula, ca
24·32 Posts |
Ah, I think you are correct. I did quick estimate to complete between vtf and mfakto, and vtf is maybe 20x slower. So maybe 140 ghd/d?
The current checked in code can manage about 420M raw k-values/sec when using 512K threads (on M262359187). I have studied your TF reference manual quite a bit, thank you! I want to write a SqMod() that uses one of the advanced techniques, but I'm not quite there yet. I understand them in principal, but not enough just yet to actually write the code... Yes, all the params on the cpu side are just hard coded. I can almost convert from bit-level to k-value in my head now. :) I might add some of the file interface stuff, but I'm not sure where to go with this. Does the world need another TF variant? Probably not. I did this to learn how the Vulkan stuff works, and learning about how trial factoring works was a bonus. (I wrote the TF code in Go first, so much easier to debug). At some point I want to understand how PRP works. The only way I know how to learn, is trying to implement it myself. But that seems quite a bit more difficult. Quote:
|
|
|
|
|
|
|
#16 | |
|
"mrh"
Oct 2018
Temecula, ca
24·32 Posts |
Quote:
- Very minimal data movement between the cpu/gpu. R/W to/from gpu memory is really slow from the cpu. So it writes some tables before the first call, but after that the cpu only looks at a couple uint64_t values per call. - Try to keep all the gpu "threads" doing useful work. I found my gpu likes a lot. Like 512k threads. So be very mindful of branching, etc. You don't want one edge case going while 1000s are waiting for it to catch up. - To keep everyone working on something, requires a little memory trade-off. So each thread starts with a range of 200ish K values, and builds an array of 30ish to pass to the next sieve stage. Something like 13 stages of list shortening gave good results. I think this can still be improved, the first stage should be broken up. - This pre-work is done with 32bit P and 64bit K values, but it still does a number of integer mod operations. So I found using around 200 small primes for the sieve() was a good balance for saving work in the next step. - For the K values that make it through the sieve, we finally create Q from P*2*K+1, start squaring. This will perform 10*log2(Q) * log2(P) - ish simple operations on a 96bit extended uints. These are fast, but that is a lot of operations, 20K maybe? I'll work on this next. That last step works well across the threads, but there is still some variation in the length of the lists of k-values to test between threads. So there are some threads waiting around, that could be doing work. I don't know what to do about that, maybe threads within a work group could balance out the work lists. There are some mechanisms for that. It was certainly an interesting exercise, it took me back to my early days of writing SIMD code in lisp for a Connection Machine in 1990 maybe? Last fiddled with by mrh on 2023-04-28 at 01:28 |
|
|
|
|
|
|
#17 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
Quote:
Re, who would need or want a Vulkan based TF or PRP package for GIMPS, some scenarios: 1) OpenCL or CUDA driver not working, but Vulkan is 2) performance advantage 3) If it supported >32 bit exponent, TF on 4.3G-10G for mersenne.ca, or even higher exponents in special cases: https://www.mersenneforum.org/showpo...04&postcount=5 https://www.mersenneforum.org/showpo...45&postcount=4 https://www.mersenneforum.org/showpo...4&postcount=11 Double mersennes on CUDA with mmff, but there's no OpenCL software for double mersennes above MM31. |
|
|
|
|
|
|
#18 |
|
"mrh"
Oct 2018
Temecula, ca
100100002 Posts |
I made a little bit more progress today. The little shader can now process almost 2.5 billion K-values/second. Maybe 100ghz-d/day? I'll have to do some longer runs to be sure.
Not optimized yet, but it seems to work, best I can tell. This is now using 128/256 bit uints. It has a Modulo function that uses floating point to estimate 128-bit values to subtract leaving an accurate remainder. https://github.com/mrh42/vtf/blob/main/tf.comp |
|
|
|
|
|
#19 |
|
"mrh"
Oct 2018
Temecula, ca
24×32 Posts |
Another vulkan thing I learned. I can point my program at the Radeon VII vulkan device, or an "llvmpipe" device that runs on my CPU.
The unmodified SPIR-V code ran on my CPU, using 16 cores at about 17M factors/sec. (Xeon Gold 6146) Kinda cool for testing I suppose. |
|
|
|
|
|
#20 |
|
"mrh"
Oct 2018
Temecula, ca
24×32 Posts |
The latest version of the TF compute shader code can process about 3.7 billion potential factors per second at 100W on my Radean VII. About 4.5 billion/sec at 250W. The main update was to use DEVICE_LOCAL memory for some data. No where near mfact* performance, but the code is also very simple.
This might be a useful starting starting point if someone wants to write something else for a GPU but doesn't want to start from scratch. https://github.com/mrh42/vtf/blob/main/tf.comp |
|
|
|
|
|
#21 | |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
2·112·47 Posts |
Quote:
The Scientific Method is not everyone's comfort zone... Is there a simple "make" system avaiabvle to build and run this? If not, any install guide? I am impressed by this work. |
|
|
|
|
|
|
#22 | |
|
"mrh"
Oct 2018
Temecula, ca
9016 Posts |
Quote:
https://github.com/mrh42/vtf/blob/main/README.md edit: I'm working on a version that doesn't require GL_EXT_shader_explicit_arithmetic_types_int64. I don't know if it will be better or worse. Its a bit of work, so it may not show up for a few days. Last fiddled with by mrh on 2023-05-18 at 21:10 |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| How does this compute | tServo | PARI/GP | 3 | 2019-06-22 14:48 |
| New GPU Compute System | airsquirrels | GPU Computing | 90 | 2017-12-08 00:13 |
| Vulkan | CuriousKit | GPU Computing | 5 | 2016-02-25 14:00 |
| New Compute Box | Christenson | Hardware | 0 | 2011-01-15 04:44 |
| My throughput does not compute... | petrw1 | Hardware | 9 | 2007-08-13 14:38 |