mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-12-01, 04:33   #12
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5·359 Posts
Default

Luigi:
I think you need to remember Jasonp's comment about speeding up Block Lanczos...namely, ask what happens if you have every GPU processor sieve the heck out of its (tiny) 64K block of memory, and return sparse results?

Anyway, TF process on mfaktc/mfakto is roughly: sieve on CPU for candidate factors, then calculate 2^p-1 mod candidate factors, success if result is 0.
Christenson is offline   Reply With Quote
Old 2011-12-01, 04:50   #13
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

41×251 Posts
Default

Quote:
Originally Posted by Christenson View Post
TF process on mfaktc/mfakto is roughly: sieve (on CPU) for candidate factors, then calculate (on GPU) 2^p-1 mod candidate factors.
you missed the most important, my addition in red
LaurV is offline   Reply With Quote
Old 2011-12-01, 06:52   #14
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5×359 Posts
Default

Quote:
Originally Posted by LaurV View Post
you missed the most important, my addition in red
Indeed...but what I am trying to get across to Luigi is that with 500+ parallel processors on a GPU, each with a "tiny" chunk of memory, that the naive eratosthenes sieve algorithm works as follows:

Set up large array of prime candidates in memory. Each processor gets a different prime number, starting at 2, and writes all the locations corresponding to multiples to "not prime".

But this is not the best way to use the GPU -- Instead, each processor has memory corresponding to a small range of prime number candidates, and then all processors in parallel evaluate whether each prime, starting at 2, has multiples within its range of memory(prime candidates) to be marked "not prime".

Some computation gets repeated, but enough long latencies to global memory get avoided to make up for it.

Likewise, mfaktc could do the same thing on a GPU when sieving for factor candidates to calculate 2^p-1 mod factor candidate. I'm thinking other sieving algorithms could and do take the same approach for efficient GPU implementation.

Now, before RDS comes along and tells me the above descriptions are naive, I am very aware that both mfaktc and jasonp's programs do a *lot* more optimisation over what's described above.
Christenson is offline   Reply With Quote
Old 2011-12-01, 09:48   #15
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

5·7·139 Posts
Default

Quote:
Originally Posted by Christenson View Post
Indeed...but what I am trying to get across to Luigi is that with 500+ parallel processors on a GPU, each with a "tiny" chunk of memory, that the naive eratosthenes sieve algorithm works as follows:

Set up large array of prime candidates in memory. Each processor gets a different prime number, starting at 2, and writes all the locations corresponding to multiples to "not prime".

But this is not the best way to use the GPU -- Instead, each processor has memory corresponding to a small range of prime number candidates, and then all processors in parallel evaluate whether each prime, starting at 2, has multiples within its range of memory(prime candidates) to be marked "not prime".

Some computation gets repeated, but enough long latencies to global memory get avoided to make up for it.

Likewise, mfaktc could do the same thing on a GPU when sieving for factor candidates to calculate 2^p-1 mod factor candidate. I'm thinking other sieving algorithms could and do take the same approach for efficient GPU implementation.

Now, before RDS comes along and tells me the above descriptions are naive, I am very aware that both mfaktc and jasonp's programs do a *lot* more optimisation over what's described above.
Actually, the time used in the "sieve" can be negligible if you do it once at start and just walk through the array of classes. Factor5 does it that way (even if I only used 16 classes out of 60 instead of 960 out of 4620).

I initially used to sieve a (very) big array holding all numbers: if you follow that path, then each processor may cancel its composites from the array. Then I changed the source and cut the big array into 16 (short) residual classes, calculated the first element of each class and cleared all multiples by addition: no need to resieve, just work with the "distances" between cleared positions. In Factor5 this is (already) done by each thread, i.e. could be handled by GPU processors. I did not (yet) study the sieve algorithm used by mfaktc, so I can't speak about it, but afaik it receives a range of Ks to test, so maybe the same idea is used.

Luigi
ET_ is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
AGM algorithms for ln(x) ewmayer Other Mathematical Topics 17 2018-09-12 16:06
algorithms for special factorizations jjcale Factoring 6 2011-07-28 02:06
Quantum Algorithms? ShiningArcanine Lounge 4 2007-12-16 12:11
Implementing algorithms, did I do this right? ShiningArcanine Programming 18 2005-12-29 21:47
do you think there are better algorithms to be discovered? ixfd64 Lounge 5 2004-03-29 06:07

All times are UTC. The time now is 15:08.


Fri Jul 7 15:08:55 UTC 2023 up 323 days, 12:37, 0 users, load averages: 1.30, 1.19, 1.16

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔