mfaktc: a CUDA program for Mersenne prefactoring
[B][edit by LaurV]
As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created: [URL]http://www.mersenneforum.org/mfaktc/[/URL] Here you can find many different versions of mfaktc, for different OSes, different Cuda drivers, bla bla. Select the suitable one for you, download it, clear some exponents! :smile: Thanks! [end of edit by LaurV][/B] Hi, maybe a bit offtopic... as a proofofconcept I can do TF on my GTX 275 :) A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct) I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :) Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;) The quality of the code right now is.... "proofofconcept" (means really ugly, alot of hardcoded stuff, ...) :D Most time is spent on the calculation if the reminder (using a simple basecase division). jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :( TheJudger 
[QUOTE=TheJudger;196991]
jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :( [/QUOTE] It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus. 
Hi jasonp!
[QUOTE=jasonp;197012]It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus.[/QUOTE] Please tell me which alternatives you're thinking about. Maybe you've allready tried by yourself some of them on CUDA? 
[QUOTE=TheJudger;197083]
Please tell me which alternatives you're thinking about. Maybe you've allready tried by yourself some of them on CUDA?[/QUOTE] No, I went straight to writing the Montgomery code because that's what I know :) The other alternatives are Barrett modular multiplication with a precomputed inverse, or split floating point / integer division. For a*b mod N, the latter uses the FPU to find the top 24 bits of the quotient q, converts to integer, does one step of Newton iteration and then subtracts q*N from a*b. This is the fastest method on x86 because the FPU can compute up to a 61bit q in one step, but on a GPU floating point division is much faster than integer division, and floating point reciprocal instructions are faster still. So it may still be a win to do things that way. 
jasonp: thank you for your hints.
I had allready the barrett modular multiplicatoin on my radar but didn't take a deeper look at it so far. Current speed: 30 million checks per second on a 26bit exponent with factors up to 71bit :) Next on my todo list:  more flexible input (e.g. the exponents are still hardcoded in the hostcode)  presieving of factor candidates  reduce the amount of data transfered from/to device  interleave host/device code 
Hi,
Next on my todo list:  more flexible input (e.g. the exponents are still hardcoded in the hostcode): DONE!  presieving of factor candidates: 50%  reduce the amount of data transfered from/to device: DONE!  interleave host/device code: DONE! Raw speed on my GTX 275: ~34 million factor candidates per second for a 26bit exponent. presieving is supported in the code now, what is missing is a fast siever. The current code does "sieving" by calculation the remainder of each factor candidate which isn't fast, offcourse. So with presieving factor candidates with primes <= 11 *lol* I can factor M66xxxxxx to 60 bits in 80 seconds. This removes 2/3 of the factor candidates, with a good siever this should increase ;) (the prime95 help notes that ~95% of the candidates are removed with the siever). If I can do the same this will be speedup of ~56 (~33% vs. ~5% remaining factor candidates) compared to the current version. :) There is one known bug (introduced in the current version): if 2 (or more) factors are found at the same dataset often only one factor is returned. It is even possible that the returned factor is wrong (mixup of both factors) in this case. I haven't noticed a mixup so far but I'm sure it can occur. :( 
Hi,
found a bug in my horrible "siever" (actually it is no sieve right now), it removed only the half of the factor candidates which are not 1 or 7 mod 8. Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 7: 57s Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 11: 55s (siever is too slow, causes some stalls on the GPU code) I ran some tests from M66000000 to M66004000 and "verified" the factors known allready to the primenet and found some unknown factors (which I've checked in manually). 
Does anyone else get the feeling especially the mods that TheJudger's stuff should get its own thread and not hijack this one?

henryzz: good idea!
Maybe a mod can move all posts related to tf on CUDA to a separate thread? 
Hi,
improved siever (sieving the first 600 primes): Sieving M66xxxxxx to 2^60: 17 seconds Sieving M66xxxxxx to 2^64: 270 seconds Sieving M66xxxxxx from 2^64 to 2^65: 270s seconds I think I'm close to the first "public beta". :) I have some ideas in my mind that I want to test/implement before that. Regards, TheJudger 
Hi,
my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :( The good news: the 2nd attempt works fine an took only ~10 lines of code :) sieving the first 2000 primes: TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...) sieving the first 4000 primes: TF M66xxxxxx to 2^64: 220 seconds :) 
All times are UTC. The time now is 12:29. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2021, Jelsoft Enterprises Ltd.