![]() |
mfaktc: a CUDA program for Mersenne prefactoring
[B][edit by LaurV]
As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created: [URL]http://www.mersenneforum.org/mfaktc/[/URL] Here you can find many different versions of mfaktc, for different OS-es, different Cuda drivers, bla bla. Select the suitable one for you, download it, clear some exponents! :smile: Thanks! [end of edit by LaurV][/B] Hi, maybe a bit offtopic... as a proof-of-concept I can do TF on my GTX 275 :) A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct) I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :) Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;) The quality of the code right now is.... "proof-of-concept" (means really ugly, alot of hardcoded stuff, ...) :D Most time is spent on the calculation if the reminder (using a simple basecase division). jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :( TheJudger |
[QUOTE=TheJudger;196991]
jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :( [/QUOTE] It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus. |
Hi jasonp!
[QUOTE=jasonp;197012]It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus.[/QUOTE] Please tell me which alternatives you're thinking about. Maybe you've allready tried by yourself some of them on CUDA? |
[QUOTE=TheJudger;197083]
Please tell me which alternatives you're thinking about. Maybe you've allready tried by yourself some of them on CUDA?[/QUOTE] No, I went straight to writing the Montgomery code because that's what I know :) The other alternatives are Barrett modular multiplication with a precomputed inverse, or split floating point / integer division. For a*b mod N, the latter uses the FPU to find the top 24 bits of the quotient q, converts to integer, does one step of Newton iteration and then subtracts q*N from a*b. This is the fastest method on x86 because the FPU can compute up to a 61-bit q in one step, but on a GPU floating point division is much faster than integer division, and floating point reciprocal instructions are faster still. So it may still be a win to do things that way. |
jasonp: thank you for your hints.
I had allready the barrett modular multiplicatoin on my radar but didn't take a deeper look at it so far. Current speed: 30 million checks per second on a 26bit exponent with factors up to 71bit :) Next on my to-do list: - more flexible input (e.g. the exponents are still hardcoded in the hostcode) - presieving of factor candidates - reduce the amount of data transfered from/to device - interleave host/device code |
Hi,
Next on my to-do list: - more flexible input (e.g. the exponents are still hardcoded in the hostcode): DONE! - presieving of factor candidates: 50% - reduce the amount of data transfered from/to device: DONE! - interleave host/device code: DONE! Raw speed on my GTX 275: ~34 million factor candidates per second for a 26bit exponent. presieving is supported in the code now, what is missing is a fast siever. The current code does "sieving" by calculation the remainder of each factor candidate which isn't fast, offcourse. So with presieving factor candidates with primes <= 11 *lol* I can factor M66xxxxxx to 60 bits in 80 seconds. This removes 2/3 of the factor candidates, with a good siever this should increase ;) (the prime95 help notes that ~95% of the candidates are removed with the siever). If I can do the same this will be speedup of ~5-6 (~33% vs. ~5% remaining factor candidates) compared to the current version. :) There is one known bug (introduced in the current version): if 2 (or more) factors are found at the same dataset often only one factor is returned. It is even possible that the returned factor is wrong (mixup of both factors) in this case. I haven't noticed a mixup so far but I'm sure it can occur. :( |
Hi,
found a bug in my horrible "siever" (actually it is no sieve right now), it removed only the half of the factor candidates which are not 1 or 7 mod 8. Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 7: 57s Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 11: 55s (siever is too slow, causes some stalls on the GPU code) I ran some tests from M66000000 to M66004000 and "verified" the factors known allready to the primenet and found some unknown factors (which I've checked in manually). |
Does anyone else get the feeling especially the mods that TheJudger's stuff should get its own thread and not hijack this one?
|
henryzz: good idea!
Maybe a mod can move all posts related to tf on CUDA to a separate thread? |
Hi,
improved siever (sieving the first 600 primes): Sieving M66xxxxxx to 2^60: 17 seconds Sieving M66xxxxxx to 2^64: 270 seconds Sieving M66xxxxxx from 2^64 to 2^65: 270s seconds I think I'm close to the first "public beta". :) I have some ideas in my mind that I want to test/implement before that. Regards, TheJudger |
Hi,
my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :( The good news: the 2nd attempt works fine an took only ~10 lines of code :) sieving the first 2000 primes: TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...) sieving the first 4000 primes: TF M66xxxxxx to 2^64: 220 seconds :) |
All times are UTC. The time now is 12:29. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.