mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2009-11-25, 16:49   #1
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33·41 Posts
Exclamation mfaktc: a CUDA program for Mersenne prefactoring

[edit by LaurV]

As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created:

http://www.mersenneforum.org/mfaktc/

Here you can find many different versions of mfaktc, for different OS-es, different Cuda drivers, bla bla.
Select the suitable one for you, download it, clear some exponents!
Thanks!

[end of edit by LaurV]


Hi,

maybe a bit offtopic...

as a proof-of-concept I can do TF on my GTX 275 :)
A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct)
I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :)

Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;)

The quality of the code right now is.... "proof-of-concept" (means really ugly, alot of hardcoded stuff, ...) :D

Most time is spent on the calculation if the reminder (using a simple basecase division).
jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :(

TheJudger

Last fiddled with by LaurV on 2012-12-28 at 06:12
TheJudger is offline   Reply With Quote
Old 2009-11-25, 19:34   #2
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

23·32·72 Posts
Default

Quote:
Originally Posted by TheJudger View Post
jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :(
It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus.
jasonp is offline   Reply With Quote
Old 2009-11-26, 17:09   #3
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33×41 Posts
Default

Hi jasonp!

Quote:
Originally Posted by jasonp View Post
It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus.
Please tell me which alternatives you're thinking about.
Maybe you've allready tried by yourself some of them on CUDA?
TheJudger is offline   Reply With Quote
Old 2009-11-27, 02:43   #4
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

23·32·72 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Please tell me which alternatives you're thinking about.
Maybe you've allready tried by yourself some of them on CUDA?
No, I went straight to writing the Montgomery code because that's what I know :)

The other alternatives are Barrett modular multiplication with a precomputed inverse, or split floating point / integer division. For a*b mod N, the latter uses the FPU to find the top 24 bits of the quotient q, converts to integer, does one step of Newton iteration and then subtracts q*N from a*b. This is the fastest method on x86 because the FPU can compute up to a 61-bit q in one step, but on a GPU floating point division is much faster than integer division, and floating point reciprocal instructions are faster still. So it may still be a win to do things that way.
jasonp is offline   Reply With Quote
Old 2009-11-27, 11:33   #5
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

110710 Posts
Default

jasonp: thank you for your hints.

I had allready the barrett modular multiplicatoin on my radar but didn't take a deeper look at it so far.

Current speed: 30 million checks per second on a 26bit exponent with factors up to 71bit :)

Next on my to-do list:
- more flexible input (e.g. the exponents are still hardcoded in the hostcode)
- presieving of factor candidates
- reduce the amount of data transfered from/to device
- interleave host/device code
TheJudger is offline   Reply With Quote
Old 2009-12-03, 16:10   #6
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33×41 Posts
Default

Hi,

Next on my to-do list:
- more flexible input (e.g. the exponents are still hardcoded in the hostcode): DONE!
- presieving of factor candidates: 50%
- reduce the amount of data transfered from/to device: DONE!
- interleave host/device code: DONE!

Raw speed on my GTX 275: ~34 million factor candidates per second for a 26bit exponent.

presieving is supported in the code now, what is missing is a fast siever. The current code does "sieving" by calculation the remainder of each factor candidate which isn't fast, offcourse.

So with presieving factor candidates with primes <= 11 *lol* I can factor M66xxxxxx to 60 bits in 80 seconds. This removes 2/3 of the factor candidates, with a good siever this should increase ;) (the prime95 help notes that ~95% of the candidates are removed with the siever). If I can do the same this will be speedup of ~5-6 (~33% vs. ~5% remaining factor candidates) compared to the current version. :)

There is one known bug (introduced in the current version): if 2 (or more) factors are found at the same dataset often only one factor is returned.
It is even possible that the returned factor is wrong (mixup of both factors) in this case. I haven't noticed a mixup so far but I'm sure it can occur. :(

Last fiddled with by TheJudger on 2009-12-03 at 16:11
TheJudger is offline   Reply With Quote
Old 2009-12-05, 19:41   #7
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33×41 Posts
Default

Hi,

found a bug in my horrible "siever" (actually it is no sieve right now), it removed only the half of the factor candidates which are not 1 or 7 mod 8.

Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 7: 57s
Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 11: 55s (siever is too slow, causes some stalls on the GPU code)

I ran some tests from M66000000 to M66004000 and "verified" the factors known allready to the primenet and found some unknown factors (which I've checked in manually).

Last fiddled with by TheJudger on 2009-12-05 at 19:53
TheJudger is offline   Reply With Quote
Old 2009-12-06, 09:09   #8
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT)

164016 Posts
Default

Does anyone else get the feeling especially the mods that TheJudger's stuff should get its own thread and not hijack this one?
henryzz is offline   Reply With Quote
Old 2009-12-06, 14:24   #9
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33×41 Posts
Default

henryzz: good idea!

Maybe a mod can move all posts related to tf on CUDA to a separate thread?

Last fiddled with by akruppa on 2009-12-07 at 12:41 Reason: Postings moved
TheJudger is offline   Reply With Quote
Old 2009-12-08, 09:48   #10
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33·41 Posts
Default

Hi,

improved siever (sieving the first 600 primes):
Sieving M66xxxxxx to 2^60: 17 seconds
Sieving M66xxxxxx to 2^64: 270 seconds
Sieving M66xxxxxx from 2^64 to 2^65: 270s seconds

I think I'm close to the first "public beta". :)
I have some ideas in my mind that I want to test/implement before that.

Regards,
TheJudger
TheJudger is offline   Reply With Quote
Old 2009-12-15, 09:54   #11
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33×41 Posts
Default

Hi,

my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :(

The good news: the 2nd attempt works fine an took only ~10 lines of code :)

sieving the first 2000 primes:
TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...)

sieving the first 4000 primes:
TF M66xxxxxx to 2^64: 220 seconds :)

Last fiddled with by TheJudger on 2009-12-15 at 09:55
TheJudger is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1631 2020-08-15 10:29
The P-1 factoring CUDA program firejuggler GPU Computing 748 2019-11-23 16:36
"CUDA runtime version 0.0" when running mfaktc.exe froderik GPU Computing 4 2016-10-30 15:29
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 10:32.

Sat Aug 15 10:32:55 UTC 2020 up 2 days, 7:08, 0 users, load averages: 1.30, 1.43, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.