mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

TheJudger 2009-11-25 16:49

mfaktc: a CUDA program for Mersenne prefactoring
 
[B][edit by LaurV]

As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created:

[URL]http://www.mersenneforum.org/mfaktc/[/URL]

Here you can find many different versions of mfaktc, for different OS-es, different Cuda drivers, bla bla.
Select the suitable one for you, download it, clear some exponents! :smile:
Thanks!

[end of edit by LaurV][/B]

Hi,

maybe a bit offtopic...

as a proof-of-concept I can do TF on my GTX 275 :)
A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct)
I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :)

Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;)

The quality of the code right now is.... "proof-of-concept" (means really ugly, alot of hardcoded stuff, ...) :D

Most time is spent on the calculation if the reminder (using a simple basecase division).
jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :(

TheJudger

jasonp 2009-11-25 19:34

[QUOTE=TheJudger;196991]
jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :(
[/QUOTE]
It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus.

TheJudger 2009-11-26 17:09

Hi jasonp!

[QUOTE=jasonp;197012]It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus.[/QUOTE]

Please tell me which alternatives you're thinking about.
Maybe you've allready tried by yourself some of them on CUDA?

jasonp 2009-11-27 02:43

[QUOTE=TheJudger;197083]
Please tell me which alternatives you're thinking about.
Maybe you've allready tried by yourself some of them on CUDA?[/QUOTE]
No, I went straight to writing the Montgomery code because that's what I know :)

The other alternatives are Barrett modular multiplication with a precomputed inverse, or split floating point / integer division. For a*b mod N, the latter uses the FPU to find the top 24 bits of the quotient q, converts to integer, does one step of Newton iteration and then subtracts q*N from a*b. This is the fastest method on x86 because the FPU can compute up to a 61-bit q in one step, but on a GPU floating point division is much faster than integer division, and floating point reciprocal instructions are faster still. So it may still be a win to do things that way.

TheJudger 2009-11-27 11:33

jasonp: thank you for your hints.

I had allready the barrett modular multiplicatoin on my radar but didn't take a deeper look at it so far.

Current speed: 30 million checks per second on a 26bit exponent with factors up to 71bit :)

Next on my to-do list:
- more flexible input (e.g. the exponents are still hardcoded in the hostcode)
- presieving of factor candidates
- reduce the amount of data transfered from/to device
- interleave host/device code

TheJudger 2009-12-03 16:10

Hi,

Next on my to-do list:
- more flexible input (e.g. the exponents are still hardcoded in the hostcode): DONE!
- presieving of factor candidates: 50%
- reduce the amount of data transfered from/to device: DONE!
- interleave host/device code: DONE!

Raw speed on my GTX 275: ~34 million factor candidates per second for a 26bit exponent.

presieving is supported in the code now, what is missing is a fast siever. The current code does "sieving" by calculation the remainder of each factor candidate which isn't fast, offcourse.

So with presieving factor candidates with primes <= 11 *lol* I can factor M66xxxxxx to 60 bits in 80 seconds. This removes 2/3 of the factor candidates, with a good siever this should increase ;) (the prime95 help notes that ~95% of the candidates are removed with the siever). If I can do the same this will be speedup of ~5-6 (~33% vs. ~5% remaining factor candidates) compared to the current version. :)

There is one known bug (introduced in the current version): if 2 (or more) factors are found at the same dataset often only one factor is returned.
It is even possible that the returned factor is wrong (mixup of both factors) in this case. I haven't noticed a mixup so far but I'm sure it can occur. :(

TheJudger 2009-12-05 19:41

Hi,

found a bug in my horrible "siever" (actually it is no sieve right now), it removed only the half of the factor candidates which are not 1 or 7 mod 8.

Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 7: 57s
Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 11: 55s (siever is too slow, causes some stalls on the GPU code)

I ran some tests from M66000000 to M66004000 and "verified" the factors known allready to the primenet and found some unknown factors (which I've checked in manually).

henryzz 2009-12-06 09:09

Does anyone else get the feeling especially the mods that TheJudger's stuff should get its own thread and not hijack this one?

TheJudger 2009-12-06 14:24

henryzz: good idea!

Maybe a mod can move all posts related to tf on CUDA to a separate thread?

TheJudger 2009-12-08 09:48

Hi,

improved siever (sieving the first 600 primes):
Sieving M66xxxxxx to 2^60: 17 seconds
Sieving M66xxxxxx to 2^64: 270 seconds
Sieving M66xxxxxx from 2^64 to 2^65: 270s seconds

I think I'm close to the first "public beta". :)
I have some ideas in my mind that I want to test/implement before that.

Regards,
TheJudger

TheJudger 2009-12-15 09:54

Hi,

my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :(

The good news: the 2nd attempt works fine an took only ~10 lines of code :)

sieving the first 2000 primes:
TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...)

sieving the first 4000 primes:
TF M66xxxxxx to 2^64: 220 seconds :)


All times are UTC. The time now is 01:53.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.