![]() |
|
|
#12 |
|
Dec 2010
Monticello
5·359 Posts |
Luigi:
I think you need to remember Jasonp's comment about speeding up Block Lanczos...namely, ask what happens if you have every GPU processor sieve the heck out of its (tiny) 64K block of memory, and return sparse results? Anyway, TF process on mfaktc/mfakto is roughly: sieve on CPU for candidate factors, then calculate 2^p-1 mod candidate factors, success if result is 0. |
|
|
|
|
|
#13 |
|
Romulan Interpreter
"name field"
Jun 2011
Thailand
41×251 Posts |
|
|
|
|
|
|
#14 |
|
Dec 2010
Monticello
5×359 Posts |
Indeed...but what I am trying to get across to Luigi is that with 500+ parallel processors on a GPU, each with a "tiny" chunk of memory, that the naive eratosthenes sieve algorithm works as follows:
Set up large array of prime candidates in memory. Each processor gets a different prime number, starting at 2, and writes all the locations corresponding to multiples to "not prime". But this is not the best way to use the GPU -- Instead, each processor has memory corresponding to a small range of prime number candidates, and then all processors in parallel evaluate whether each prime, starting at 2, has multiples within its range of memory(prime candidates) to be marked "not prime". Some computation gets repeated, but enough long latencies to global memory get avoided to make up for it. Likewise, mfaktc could do the same thing on a GPU when sieving for factor candidates to calculate 2^p-1 mod factor candidate. I'm thinking other sieving algorithms could and do take the same approach for efficient GPU implementation. Now, before RDS comes along and tells me the above descriptions are naive, I am very aware that both mfaktc and jasonp's programs do a *lot* more optimisation over what's described above. |
|
|
|
|
|
#15 | |
|
Banned
"Luigi"
Aug 2002
Team Italia
5·7·139 Posts |
Quote:
I initially used to sieve a (very) big array holding all numbers: if you follow that path, then each processor may cancel its composites from the array. Then I changed the source and cut the big array into 16 (short) residual classes, calculated the first element of each class and cleared all multiples by addition: no need to resieve, just work with the "distances" between cleared positions. In Factor5 this is (already) done by each thread, i.e. could be handled by GPU processors. I did not (yet) study the sieve algorithm used by mfaktc, so I can't speak about it, but afaik it receives a range of Ks to test, so maybe the same idea is used. Luigi |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| AGM algorithms for ln(x) | ewmayer | Other Mathematical Topics | 17 | 2018-09-12 16:06 |
| algorithms for special factorizations | jjcale | Factoring | 6 | 2011-07-28 02:06 |
| Quantum Algorithms? | ShiningArcanine | Lounge | 4 | 2007-12-16 12:11 |
| Implementing algorithms, did I do this right? | ShiningArcanine | Programming | 18 | 2005-12-29 21:47 |
| do you think there are better algorithms to be discovered? | ixfd64 | Lounge | 5 | 2004-03-29 06:07 |