mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-02-20, 20:39   #12
firejuggler
 
firejuggler's Avatar
 
"Vincent"
Apr 2010
Over the rainbow

22×7×103 Posts
Default

Since we were taking about mersenne test,my first ideas would include arrays,
multiple Millers-Rabin primality test, and modulo operation. But I realise that
would only work for the first phase of a P-1.

naive implementation :
split B1 range in as many core-1 as there is,
determine k, calculate 2*k*p+1, if prime (determined by MR test)=> k in array

last core would take care of the value in the array and determine if it divide Mp

Last fiddled with by firejuggler on 2013-02-20 at 20:47
firejuggler is offline   Reply With Quote
Old 2013-02-20, 23:39   #13
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

4738 Posts
Default

Quote:
Originally Posted by chalsall View Post
Hey, deliver a prototype P-1 GPU program and I'll give you my first born child.

(Please note: I'm intentionally childless.)
Does this mean that for a stage 1 only program, we would get half of your nonexistent first born?
owftheevil is offline   Reply With Quote
Old 2013-02-20, 23:56   #14
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×5,531 Posts
Default

Quote:
Originally Posted by owftheevil View Post
Does this mean that for a stage 1 only program, we would get half of your nonexistent first born?
Sure.

I'll even put you up in a hotel in the "Gap" for half a night....
chalsall is offline   Reply With Quote
Old 2013-02-23, 09:06   #15
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5·103 Posts
Default

Thinking outside the box for a minute.

Is there any other primality/factoring tests that would easily suit GPUs and deliver pretty good search percentages over time? (similar to percentage chance of finding a factor close to the time of TF/P-1?)

Finding the round peg for a round hole analogy comes to mind.

-- Craig
nucleon is offline   Reply With Quote
Old 2013-02-26, 20:17   #16
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5×2,351 Posts
Default

Quote:
Originally Posted by ET_ View Post
Some naif subjects to talk about:

- Parallelization of tasks
- Limitations due to the memory factor of the GPU (how far may we go having 0.5, 1, 2,3 or 6GB of memory?
- Limitations of the GPU shared memory.
- Description of steps 1 and 2 (from MersenneWiki I got a grasp of it, but a talk would explain more).
- use of streams to pass chunks of bytes to analyze.
- How to apply CuFFT library to the algorithm.
- Is a parallel Montgomery multiplication algorithm out of question for such algorithm?
Let's start with the last item: Montmul is used for multiply w.r.to *generic* moduli ... for Mersenne/Fermat/etc-mod arithmetic you want the specialized FFT-based implicit mod algos, same as for LL (or more general primality) testing.

All the core ops for p-1 are quite similar to what you know from the lens of LL testing, just a couple twists:

1. For stage 1 you need to decide whether you want to use a fixed bound B1 or allow for later "powering-up" of a previous stage 1 residue to a larger B1. Most efficient is a fixed B1, that way you can precompute the product of all the needed stage 1 prime powers and use the resulting bitstring for fast LR binary modpow;

2. For stage 2 you need to precompute a set of small powers of the stage 1 residue, and store those in RAM in forward-FFTed form. Those get pointwise-multiplied by a "live" forward-FFTed dataset (i.e. the dyadic-mul takes data from 2 vectors and combines them, as opposed to the single-vector-data squaring used in LL) and then iFFT/carry steps are as before. Multiple stage 2 intervals can run in parallel starting with a common stage 1 residue.

But what you really need to do first of all is some kind of "simple" implementation of the algorithm in HLL code ... useless to try to port an algorithm to specialized hardware unless you understand it in some detail.
ewmayer is offline   Reply With Quote
Old 2013-02-26, 21:31   #17
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32·5·7 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Let's start with the last item: Montmul is used for multiply w.r.to *generic* moduli ... for Mersenne/Fermat/etc-mod arithmetic you want the specialized FFT-based implicit mod algos, same as for LL (or more general primality) testing.

All the core ops for p-1 are quite similar to what you know from the lens of LL testing, just a couple twists:

1. For stage 1 you need to decide whether you want to use a fixed bound B1 or allow for later "powering-up" of a previous stage 1 residue to a larger B1. Most efficient is a fixed B1, that way you can precompute the product of all the needed stage 1 prime powers and use the resulting bitstring for fast LR binary modpow;

2. For stage 2 you need to precompute a set of small powers of the stage 1 residue, and store those in RAM in forward-FFTed form. Those get pointwise-multiplied by a "live" forward-FFTed dataset (i.e. the dyadic-mul takes data from 2 vectors and combines them, as opposed to the single-vector-data squaring used in LL) and then iFFT/carry steps are as before. Multiple stage 2 intervals can run in parallel starting with a common stage 1 residue.

But what you really need to do first of all is some kind of "simple" implementation of the algorithm in HLL code ... useless to try to port an algorithm to specialized hardware unless you understand it in some detail.
I haven't thought much about the stage 2 part yet, but for stage 1, The basic gpu kernels in CudaLucas can easily be modified to do everything expect providing the list of prime powers to multiply, multiplying that list of prime powers, and then finding the gcd at the end. The first two would not be that hard to implement, but I don't think I have the tenacity to code an efficient cuda gdc algorithm. However, there's no reason all this has to be done on the gpu.

Last fiddled with by owftheevil on 2013-02-26 at 21:32
owftheevil is offline   Reply With Quote
Old 2013-02-26, 22:04   #18
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2DEB16 Posts
Default

GCD need be done infrequently enough - once at the end of stage 1 and perhaps every few hours during stage 2 - that one should just use some decent third-party code (GMP or George's optimized version of the Crandall/Buhler "giants library" GCD.
ewmayer is offline   Reply With Quote
Old 2013-02-27, 01:37   #19
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32·5·7 Posts
Default

Pretty much what I had in mind. I was thinking at first to do the product of the prime powers with gmp too. Also, maybe its possible to save the result of stage 1 in mprime or prime95 readable format and let those programs finish up the stage 2. On many cards without much memory this would be useful.
owftheevil is offline   Reply With Quote
Old 2013-02-27, 02:39   #20
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32·5·7 Posts
Default

Proto-cuda-pm1 now computes powers of 3. Now to teach it which powers of 3 to compute.

Anybody know where i can get some residues of relatively small powers of 3 mod 2^q - 1 for various value of q?
owftheevil is offline   Reply With Quote
Old 2013-02-27, 11:21   #21
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

178F16 Posts
Default

Quote:
Originally Posted by owftheevil View Post
Proto-cuda-pm1 now computes powers of 3. Now to teach it which powers of 3 to compute.

Anybody know where i can get some residues of relatively small powers of 3 mod 2^q - 1 for various value of q?
Run the following code through pfgw:
Code:
SCRIPT

DIM expo, 31
DIM base, 3
DIM max_n, 100
DIM min_n, 1
OPENFILEAPP r_file,results.txt



DIM n, min_n
DIM result, 0
DIM mers, 2^expo-1
DIMS rstr



LABEL next_n
POWMOD result,base,n,mers
WRITE r_file,result
SET n, n+1
IF (n<=max_n) THEN GOTO next_n

LABEL end
henryzz is offline   Reply With Quote
Old 2013-02-27, 13:20   #22
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

10010111110012 Posts
Default

Quote:
Originally Posted by henryzz View Post
Run the following code through pfgw:
Code:
SCRIPT

DIM expo, 31
DIM base, 3
DIM max_n, 100
DIM min_n, 1
OPENFILEAPP r_file,results.txt



DIM n, min_n
DIM result, 0
DIM mers, 2^expo-1
DIMS rstr



LABEL next_n
POWMOD result,base,n,mers
WRITE r_file,result
SET n, n+1
IF (n<=max_n) THEN GOTO next_n

LABEL end
Thanks Henry, you gave me a hint to solve another problem

Luigi
ET_ is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3622 2023-01-25 16:41
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26
Factoring program need help Citrix Lone Mersenne Hunters 8 2005-09-16 02:31
Factoring program ET_ Programming 3 2003-11-25 02:57

All times are UTC. The time now is 00:58.


Sat Jan 28 00:58:57 UTC 2023 up 162 days, 22:27, 0 users, load averages: 0.93, 0.96, 1.05

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔