![]() |
![]() |
#1 |
"Oliver"
Mar 2005
Germany
2·13·43 Posts |
![]()
[edit by LaurV (2012-Dec-28)]
As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created: https://www.mersenneforum.org/mfaktc/ Here you can find many different versions of mfaktc, for different OS-es, different Cuda drivers, bla bla. Select the suitable one for you, download it, clear some exponents! ![]() Thanks! [end of edit by LaurV] [edit by James Heinrich (2022-Nov-06)] The mersenneforum mirror hasn't been updated in several years, current mfaktc builds are available here: https://download.mersenne.ca/mfaktc/ [end of edit by James Heinrich] Hi, maybe a bit offtopic... as a proof-of-concept I can do TF on my GTX 275 :) A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct) I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :) Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;) The quality of the code right now is.... "proof-of-concept" (means really ugly, alot of hardcoded stuff, ...) :D Most time is spent on the calculation if the reminder (using a simple basecase division). jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :( TheJudger Last fiddled with by James Heinrich on 2022-11-06 at 16:21 |
![]() |
![]() |
![]() |
#2 |
Tribal Bullet
Oct 2004
3×1,193 Posts |
![]()
It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus.
|
![]() |
![]() |
![]() |
#3 | |
"Oliver"
Mar 2005
Germany
2×13×43 Posts |
![]()
Hi jasonp!
Quote:
Maybe you've allready tried by yourself some of them on CUDA? |
|
![]() |
![]() |
![]() |
#4 | |
Tribal Bullet
Oct 2004
3·1,193 Posts |
![]() Quote:
The other alternatives are Barrett modular multiplication with a precomputed inverse, or split floating point / integer division. For a*b mod N, the latter uses the FPU to find the top 24 bits of the quotient q, converts to integer, does one step of Newton iteration and then subtracts q*N from a*b. This is the fastest method on x86 because the FPU can compute up to a 61-bit q in one step, but on a GPU floating point division is much faster than integer division, and floating point reciprocal instructions are faster still. So it may still be a win to do things that way. |
|
![]() |
![]() |
![]() |
#5 |
"Oliver"
Mar 2005
Germany
2·13·43 Posts |
![]()
jasonp: thank you for your hints.
I had allready the barrett modular multiplicatoin on my radar but didn't take a deeper look at it so far. Current speed: 30 million checks per second on a 26bit exponent with factors up to 71bit :) Next on my to-do list: - more flexible input (e.g. the exponents are still hardcoded in the hostcode) - presieving of factor candidates - reduce the amount of data transfered from/to device - interleave host/device code |
![]() |
![]() |
![]() |
#6 |
"Oliver"
Mar 2005
Germany
2·13·43 Posts |
![]()
Hi,
Next on my to-do list: - more flexible input (e.g. the exponents are still hardcoded in the hostcode): DONE! - presieving of factor candidates: 50% - reduce the amount of data transfered from/to device: DONE! - interleave host/device code: DONE! Raw speed on my GTX 275: ~34 million factor candidates per second for a 26bit exponent. presieving is supported in the code now, what is missing is a fast siever. The current code does "sieving" by calculation the remainder of each factor candidate which isn't fast, offcourse. So with presieving factor candidates with primes <= 11 *lol* I can factor M66xxxxxx to 60 bits in 80 seconds. This removes 2/3 of the factor candidates, with a good siever this should increase ;) (the prime95 help notes that ~95% of the candidates are removed with the siever). If I can do the same this will be speedup of ~5-6 (~33% vs. ~5% remaining factor candidates) compared to the current version. :) There is one known bug (introduced in the current version): if 2 (or more) factors are found at the same dataset often only one factor is returned. It is even possible that the returned factor is wrong (mixup of both factors) in this case. I haven't noticed a mixup so far but I'm sure it can occur. :( Last fiddled with by TheJudger on 2009-12-03 at 16:11 |
![]() |
![]() |
![]() |
#7 |
"Oliver"
Mar 2005
Germany
2×13×43 Posts |
![]()
Hi,
found a bug in my horrible "siever" (actually it is no sieve right now), it removed only the half of the factor candidates which are not 1 or 7 mod 8. Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 7: 57s Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 11: 55s (siever is too slow, causes some stalls on the GPU code) I ran some tests from M66000000 to M66004000 and "verified" the factors known allready to the primenet and found some unknown factors (which I've checked in manually). Last fiddled with by TheJudger on 2009-12-05 at 19:53 |
![]() |
![]() |
![]() |
#8 |
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
11000010110002 Posts |
![]()
Does anyone else get the feeling especially the mods that TheJudger's stuff should get its own thread and not hijack this one?
|
![]() |
![]() |
![]() |
#9 |
"Oliver"
Mar 2005
Germany
21368 Posts |
![]()
henryzz: good idea!
Maybe a mod can move all posts related to tf on CUDA to a separate thread? Last fiddled with by akruppa on 2009-12-07 at 12:41 Reason: Postings moved |
![]() |
![]() |
![]() |
#10 |
"Oliver"
Mar 2005
Germany
111810 Posts |
![]()
Hi,
improved siever (sieving the first 600 primes): Sieving M66xxxxxx to 2^60: 17 seconds Sieving M66xxxxxx to 2^64: 270 seconds Sieving M66xxxxxx from 2^64 to 2^65: 270s seconds I think I'm close to the first "public beta". :) I have some ideas in my mind that I want to test/implement before that. Regards, TheJudger |
![]() |
![]() |
![]() |
#11 |
"Oliver"
Mar 2005
Germany
2×13×43 Posts |
![]()
Hi,
my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :( The good news: the 2nd attempt works fine an took only ~10 lines of code :) sieving the first 2000 primes: TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...) sieving the first 4000 primes: TF M66xxxxxx to 2^64: 220 seconds :) Last fiddled with by TheJudger on 2009-12-15 at 09:55 |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1724 | 2023-06-04 23:31 |
gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 42 | 2022-12-18 05:59 |
The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |