20091125, 16:49  #1 
"Oliver"
Mar 2005
Germany
11×101 Posts 
mfaktc: a CUDA program for Mersenne prefactoring
[edit by LaurV]
As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created: http://www.mersenneforum.org/mfaktc/ Here you can find many different versions of mfaktc, for different OSes, different Cuda drivers, bla bla. Select the suitable one for you, download it, clear some exponents! Thanks! [end of edit by LaurV] Hi, maybe a bit offtopic... as a proofofconcept I can do TF on my GTX 275 :) A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct) I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :) Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;) The quality of the code right now is.... "proofofconcept" (means really ugly, alot of hardcoded stuff, ...) :D Most time is spent on the calculation if the reminder (using a simple basecase division). jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :( TheJudger Last fiddled with by LaurV on 20121228 at 06:12 
20091125, 19:34  #2 
Tribal Bullet
Oct 2004
2×29×61 Posts 
It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus.

20091126, 17:09  #3  
"Oliver"
Mar 2005
Germany
11×101 Posts 
Hi jasonp!
Quote:
Maybe you've allready tried by yourself some of them on CUDA? 

20091127, 02:43  #4  
Tribal Bullet
Oct 2004
2×29×61 Posts 
Quote:
The other alternatives are Barrett modular multiplication with a precomputed inverse, or split floating point / integer division. For a*b mod N, the latter uses the FPU to find the top 24 bits of the quotient q, converts to integer, does one step of Newton iteration and then subtracts q*N from a*b. This is the fastest method on x86 because the FPU can compute up to a 61bit q in one step, but on a GPU floating point division is much faster than integer division, and floating point reciprocal instructions are faster still. So it may still be a win to do things that way. 

20091127, 11:33  #5 
"Oliver"
Mar 2005
Germany
11×101 Posts 
jasonp: thank you for your hints.
I had allready the barrett modular multiplicatoin on my radar but didn't take a deeper look at it so far. Current speed: 30 million checks per second on a 26bit exponent with factors up to 71bit :) Next on my todo list:  more flexible input (e.g. the exponents are still hardcoded in the hostcode)  presieving of factor candidates  reduce the amount of data transfered from/to device  interleave host/device code 
20091203, 16:10  #6 
"Oliver"
Mar 2005
Germany
11×101 Posts 
Hi,
Next on my todo list:  more flexible input (e.g. the exponents are still hardcoded in the hostcode): DONE!  presieving of factor candidates: 50%  reduce the amount of data transfered from/to device: DONE!  interleave host/device code: DONE! Raw speed on my GTX 275: ~34 million factor candidates per second for a 26bit exponent. presieving is supported in the code now, what is missing is a fast siever. The current code does "sieving" by calculation the remainder of each factor candidate which isn't fast, offcourse. So with presieving factor candidates with primes <= 11 *lol* I can factor M66xxxxxx to 60 bits in 80 seconds. This removes 2/3 of the factor candidates, with a good siever this should increase ;) (the prime95 help notes that ~95% of the candidates are removed with the siever). If I can do the same this will be speedup of ~56 (~33% vs. ~5% remaining factor candidates) compared to the current version. :) There is one known bug (introduced in the current version): if 2 (or more) factors are found at the same dataset often only one factor is returned. It is even possible that the returned factor is wrong (mixup of both factors) in this case. I haven't noticed a mixup so far but I'm sure it can occur. :( Last fiddled with by TheJudger on 20091203 at 16:11 
20091205, 19:41  #7 
"Oliver"
Mar 2005
Germany
11·101 Posts 
Hi,
found a bug in my horrible "siever" (actually it is no sieve right now), it removed only the half of the factor candidates which are not 1 or 7 mod 8. Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 7: 57s Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 11: 55s (siever is too slow, causes some stalls on the GPU code) I ran some tests from M66000000 to M66004000 and "verified" the factors known allready to the primenet and found some unknown factors (which I've checked in manually). Last fiddled with by TheJudger on 20091205 at 19:53 
20091206, 09:09  #8 
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
2^{2}×13×113 Posts 
Does anyone else get the feeling especially the mods that TheJudger's stuff should get its own thread and not hijack this one?

20091206, 14:24  #9 
"Oliver"
Mar 2005
Germany
11×101 Posts 
henryzz: good idea!
Maybe a mod can move all posts related to tf on CUDA to a separate thread? Last fiddled with by akruppa on 20091207 at 12:41 Reason: Postings moved 
20091208, 09:48  #10 
"Oliver"
Mar 2005
Germany
2127_{8} Posts 
Hi,
improved siever (sieving the first 600 primes): Sieving M66xxxxxx to 2^60: 17 seconds Sieving M66xxxxxx to 2^64: 270 seconds Sieving M66xxxxxx from 2^64 to 2^65: 270s seconds I think I'm close to the first "public beta". :) I have some ideas in my mind that I want to test/implement before that. Regards, TheJudger 
20091215, 09:54  #11 
"Oliver"
Mar 2005
Germany
11·101 Posts 
Hi,
my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :( The good news: the 2nd attempt works fine an took only ~10 lines of code :) sieving the first 2000 primes: TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...) sieving the first 4000 primes: TF M66xxxxxx to 2^64: 220 seconds :) Last fiddled with by TheJudger on 20091215 at 09:55 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
mfakto: an OpenCL program for Mersenne prefactoring  Bdot  GPU Computing  1668  20201222 15:38 
The P1 factoring CUDA program  firejuggler  GPU Computing  753  20201212 18:07 
grmfaktc: a CUDA program for generalized repunits prefactoring  MrRepunit  GPU Computing  32  20201111 19:56 
mfaktc 0.21  CUDA runtime wrong  keisentraut  Software  2  20200818 07:03 
World's seconddumbest CUDA program  fivemack  Programming  112  20150212 22:51 