mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Prime Gap Searches (https://www.mersenneforum.org/forumdisplay.php?f=131)
-   -   GPU Prime Gap searching (https://www.mersenneforum.org/showthread.php?t=26944)

MJansen 2021-06-21 12:15

GPU Prime Gap searching
 
[QUOTE=CraigLo;581445]Thanks. I use 1 1080 TI. My code doesn't work well for large numbers so I decided to switch to the max gap search until I have time to rewrite it. I planned to run it over the summer with the hope of finding 1 new record above 1432. I got lucky that this gap is so close to the previous max gap.[/QUOTE]

Cool actually and I would love to know more about gpu progamming, so many questions regarding GPU programming:
a. what code language do you use
b. is it under windows or Linux (or a linux dialect)?
c. can you give a (simplified) code example?
c. how does a cuda core compare to a cpu thread? Can you use all 1564 cuda cores for primegap hunting in parallel, just like a cpu thread?
d. is your energy bill going up by a lot?
e. I looked for gpu's and the 1080 ti is available second hand (introduced in 2017), but still at high prices (Euro 500-700) and some have the explicit note that they have not been used for mining bit coins. Does mining (and I guess 24/7 prime gap hunting) wear out the GPU faster than usual?

Curious as ever,
Michiel Jansen

CraigLo 2021-06-22 07:02

[QUOTE=MJansen;581481]Cool actually and I would love to know more about gpu progamming, so many questions regarding GPU programming:
a. what code language do you use[/QUOTE]
I use Cuda which is an extension of C. It is Nvidia specific. I looked at OpenCL when I started but at the time it didn't have support for arithmetic carry. I'm not sure if this has changed.
[QUOTE]
b. is it under windows or Linux (or a linux dialect)?[/QUOTE]
I've only used Linux but others have used windows.
[QUOTE]
c. can you give a (simplified) code example? [/QUOTE]
The cuda programming guide is good.
[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html[/url]
[QUOTE]
c. how does a cuda core compare to a cpu thread? Can you use all 1564 cuda cores for primegap hunting in parallel, just like a cpu thread? [/QUOTE]
See the programming guide for details.
The GPU is broken up into SMs (streaming multiprocessors). Each SM has a fixed amount of fast shared memory and a fixed number of registers than can be used by all threads in the SM. Each SM can run 128 threads at a time (this varies with the GPU architecture). Code is executed in a warp which consists of 32 threads. Every thread in a warp runs the same code and executes the same line of code.

My GPU has 28 SMs*128 = 3584 cores. They can all run at the same time. I have each thread process a different possible prime. My code slows down as the numbers get bigger because each thread uses too many resources to be able to use all the cores. To fix this I will need to have multiple threads process a single prime (summer/fall project maybe, and I really need something like Karatsuba multiplication).

[QUOTE]
d. is your energy bill going up by a lot?[/QUOTE]
The 1080 TI has a peak of about 250 W. Over the winter I was heating my basement with primes so it didn't cost me much. Now the cost is noticeable but pretty small.

[QUOTE]
e. I looked for gpu's and the 1080 ti is available second hand (introduced in 2017), but still at high prices (Euro 500-700) and some have the explicit note that they have not been used for mining bit coins. Does mining (and I guess 24/7 prime gap hunting) wear out the GPU faster than usual?[/QUOTE]
I bought mine on ebay 3 years ago for $500. Chip prices are crazy right now. I've probably had mine running for less than a year since I've had it. I'm sure running it constantly will decrease the life but I haven't had any problems with mine.

MJansen 2021-06-22 19:21

Hi Craig,

thanx for the info! I have been reading up on the nvidia guide, but it is a lot to take in and it seems like a tool kit that needs to take into consideration memory management, parallel programming, sending tasks to cores-warp's-SM's, writing your own math functions, retrieving results, preventing dead locks ... pfff no easy task! Cudo's to you for having come this far!

Conceptually I am trying to wrap my head around what can be done in parallel. You wrote that you have each thread (equals 32 cores? So 28 x 128/32 = 112 threads simultaneous? Would be a lot more than the 24 of my AMD 3900 ...) process a single prime (or do you mean prime gap?). Forgive me if I get this wrong, but this confuses me a little.

Regarding a parallel approach, my assumption would be, but I do not know if it is possible using a GPU, that having each of the 112 thread's check a single candidate of a prime gap (with a Fermat test or Miller - Rabin test) and after no PRP's are found, feed the next 112 candidates until a PRP is found, that would be a method. Of course the workload could be smaller, so you can work on more intervals at a time I guess.

Or sieving the interval around the primorial start value, for composites so only the stronger candidates remain, that can be checked using a pseudoprime test on the cpu would be another approach I can think of. But I am totally new to this and nowhere near knowledgeable enough to have a good understanding of the possibilities of using the GPU to its fullest! Again my curiosity speaking: what conceptual approach have you implemented?

Ps no room in my house that needs an extra heater ;-) and 1564 was my mistake, I did scramble the correct core number (3584).

Kind regards
Michiel

CraigLo 2021-06-22 20:32

[QUOTE=MJansen;581560]Hi Craig,

thanx for the info! I have been reading up on the nvidia guide, but it is a lot to take in and it seems like a tool kit that needs to take into consideration memory management, parallel programming, sending tasks to cores-warp's-SM's, writing your own math functions, retrieving results, preventing dead locks ... pfff no easy task! Cudo's to you for having come this far!

Conceptually I am trying to wrap my head around what can be done in parallel. You wrote that you have each thread (equals 32 cores? So 28 x 128/32 = 112 threads simultaneous? Would be a lot more than the 24 of my AMD 3900 ...) process a single prime (or do you mean prime gap?). Forgive me if I get this wrong, but this confuses me a little.

Regarding a parallel approach, my assumption would be, but I do not know if it is possible using a GPU, that having each of the 112 thread's check a single candidate of a prime gap (with a Fermat test or Miller - Rabin test) and after no PRP's are found, feed the next 112 candidates until a PRP is found, that would be a method. Of course the workload could be smaller, so you can work on more intervals at a time I guess.

Or sieving the interval around the primorial start value, for composites so only the stronger candidates remain, that can be checked using a pseudoprime test on the cpu would be another approach I can think of. But I am totally new to this and nowhere near knowledgeable enough to have a good understanding of the possibilities of using the GPU to its fullest! Again my curiosity speaking: what conceptual approach have you implemented?

Ps no room in my house that needs an extra heater ;-) and 1564 was my mistake, I did scramble the correct core number (3584).

Kind regards
Michiel[/QUOTE]

It is actually 3584 cores. I was trying to say that you cannot have 3584 separate codes running at the same time. The cores are broken up into blocks of 32 called warps which all run the same code at the same time. You can have all 3584 cores running a separate Fermat test.

MJansen 2021-06-23 13:47

Hi Craig,

I am trying to compare my CPU only calculations to the GPU calculating power. Because of the differences in architecture not quite the same, but am I correct if I say that your GPU can run 3584 / 32 = 112 simultaneous gap searches (scripts) on the 112 warps? And each warp can process 32 prime candidates at a time? Is that correct?

I read a thread elsewhere on this forum on GPU sieving (300+ pages ...) and some notions:
a. the thread started in 2009 already! There must be some people around who have built up knowledge about GPU programming since then
b. if you use full capacity of the GPU, your display will be hampered/frozen, so disconnect the GPU from the monitor and use a separate GPU for display.
c. developing cross platform/ generic GPU code is tricky in a sense that NVIDIA driver updates, windows or Unix/Linux updates, different video cards, etc can cause errors in older (working) code
d. not sure were I read it but there was some mention about a penalty for using % operations on the GPU, is that something you encountered?

Kind regards
Michiel

CraigLo 2021-06-23 16:02

[QUOTE=MJansen;581618]Hi Craig,

I am trying to compare my CPU only calculations to the GPU calculating power. Because of the differences in architecture not quite the same, but am I correct if I say that your GPU can run 3584 / 32 = 112 simultaneous gap searches (scripts) on the 112 warps? And each warp can process 32 prime candidates at a time? Is that correct?

I read a thread elsewhere on this forum on GPU sieving (300+ pages ...) and some notions:
a. the thread started in 2009 already! There must be some people around who have built up knowledge about GPU programming since then
b. if you use full capacity of the GPU, your display will be hampered/frozen, so disconnect the GPU from the monitor and use a separate GPU for display.
c. developing cross platform/ generic GPU code is tricky in a sense that NVIDIA driver updates, windows or Unix/Linux updates, different video cards, etc can cause errors in older (working) code
d. not sure were I read it but there was some mention about a penalty for using % operations on the GPU, is that something you encountered?

Kind regards
Michiel[/QUOTE]

We should probably start a separate GPU thread.

You can have 3584 simultaneous gap searches. In my 65-bit optimized code roughly 1/2 the candidates are prime after sieving. If I processed 32 candidates for a single gap I would be giving up about a factor of 16 in performance. I think the term Nvidia uses for a warp is single instruction multiple thread (SIMT) which is similar to SIMD.

A single GPU core is going to be slower than a single CPU core. GPU operations are 32-bit. Integer multiplication is slower than floating point multiplication (I think it is ~7x cycles for a 32 bit integer multiply 32-bit float on mine). Integer division and mod are very slow on a GPU and should be avoided as much as possible. GPUs are designed for fast 32-bit float multiply and fast memory access. Everything else is slower.

For an idea of the processing power the specs are listed here
[url]https://en.wikipedia.org/wiki/GeForce_10_series[/url]

The 1080 Ti has about 10000 GFlops. That counts a multiply and accumulate as separate operations so 5000 GMACs. Divide by 7 gives me about 700E9/sec 32-bit integer multiply and accumulate.

On the memory side, global memory bandwidth is about 480 GB/s and shared memory is about a factor of 10 higher.

In my experience, my internet browser is slow while the GPU is running. Everything else I do is unaffected, but I don't use that computer for much.

I've only used linux and the GTX 10 series so not sure about cross platform / driver updates. Different GPU architectures (9 series vs 10 series, etc) have different instruction sets and features. Even within a single family the optimal parameters will probably change due to different capabilities.

MJansen 2021-06-24 13:00

Craig wrote:
We should probably start a separate GPU thread.
MJ: would be good, could anybody move the relevant posts to a separate post? Or should I open a new post altogether?

The next bit should be discussed in the new post.

You can have 3584 simultaneous gap searches. In my 65-bit optimized code roughly 1/2 the candidates are prime after sieving. If I processed 32 candidates for a single gap I would be giving up about a factor of 16 in performance. I think the term Nvidia uses for a warp is single instruction multiple thread (SIMT) which is similar to SIMD.

A single GPU core is going to be slower than a single CPU core. GPU operations are 32-bit. Integer multiplication is slower than floating point multiplication (I think it is ~7x cycles for a 32 bit integer multiply 32-bit float on mine). Integer division and mod are very slow on a GPU and should be avoided as much as possible. GPUs are designed for fast 32-bit float multiply and fast memory access. Everything else is slower.

For an idea of the processing power the specs are listed here
[url]https://en.wikipedia.org/wiki/GeForce_10_series[/url]

The 1080 Ti has about 10000 GFlops. That counts a multiply and accumulate as separate operations so 5000 GMACs. Divide by 7 gives me about 700E9/sec 32-bit integer multiply and accumulate.

On the memory side, global memory bandwidth is about 480 GB/s and shared memory is about a factor of 10 higher.

In my experience, my internet browser is slow while the GPU is running. Everything else I do is unaffected, but I don't use that computer for much.

I've only used linux and the GTX 10 series so not sure about cross platform / driver updates. Different GPU architectures (9 series vs 10 series, etc) have different instruction sets and features. Even within a single family the optimal parameters will probably change due to different capabilities.

MJansen 2021-06-24 15:54

Wow, that was fast! Many thanks to however did this for making this a separate thread!

Regarding the rest of the post, @Craig: I am going to try, in my own words, to summarize what I think I understand. This probably means I will not use the correct terms, so please give me some slack on that regard:

I think you use your CPU (1 or more threads?) to sieve an interval and feed the remaining candidates to the GPU. In the GPU a SIMT (let's use that term) has the code to send the candidates to the cores in that SIMT and perform a Fermat test. Is that the basic concept?

How the results get stored or processed you have not yet mentioned so I will not guess on that.

Regarding sieving: Jens Kruse Andersen made a very efficient treesieve program (especially for larger gaps) that is 32 bit C-code and outputs text files of the remaining candidates of an interval around the prime gap center, is that something that can be used? Let me rephrase: can you store an exe file at the GPU/SIMT and run it?

Or can you - using Linux - access or store the GMP library from the GPU to allow for fast multiplication?

I am not familiar with linux (just windows sorry!) but there is software around (and maybe people here have their own sieving code, Dana has incorporated a very efficient sieving program in his Perl next_prime / prev_prime functions).

What else have we learned:
a. A GPU core is about half as (or even less) fast as a CPU core.
b. division or modulo operations are very slow on a GPU
c. use memory of the GPU if possible (much faster), but the practical implications are not yet clear (to me at least)

Kind regards and thanks again for creating this thread!
Michiel

CraigLo 2021-06-24 22:59

[QUOTE=MJansen;581746]Wow, that was fast! Many thanks to however did this for making this a separate thread!

Regarding the rest of the post, @Craig: I am going to try, in my own words, to summarize what I think I understand. This probably means I will not use the correct terms, so please give me some slack on that regard:

I think you use your CPU (1 or more threads?) to sieve an interval and feed the remaining candidates to the GPU. In the GPU a SIMT (let's use that term) has the code to send the candidates to the cores in that SIMT and perform a Fermat test. Is that the basic concept?

How the results get stored or processed you have not yet mentioned so I will not guess on that.

Regarding sieving: Jens Kruse Andersen made a very efficient treesieve program (especially for larger gaps) that is 32 bit C-code and outputs text files of the remaining candidates of an interval around the prime gap center, is that something that can be used? Let me rephrase: can you store an exe file at the GPU/SIMT and run it?

Or can you - using Linux - access or store the GMP library from the GPU to allow for fast multiplication?

I am not familiar with linux (just windows sorry!) but there is software around (and maybe people here have their own sieving code, Dana has incorporated a very efficient sieving program in his Perl next_prime / prev_prime functions).

What else have we learned:
a. A GPU core is about half as (or even less) fast as a CPU core.
b. division or modulo operations are very slow on a GPU
c. use memory of the GPU if possible (much faster), but the practical implications are not yet clear (to me at least)

Kind regards and thanks again for creating this thread!
Michiel[/QUOTE]

I do all my sieving in the GPU. Sieving is limited by memory speeds which are much faster in the GPU. I run the sieving and prime check code at the same time. They work well together because one is memory bandwidth limited and the other is processor limited. While waiting for a memory access the GPU can switch to another warp that is ready to run. The GPU does this automatically and instantly. Running them at the same time will take less time than running one and then the other.

If you already had pre-computed candidates you could send them to the GPU for processing. The 65-bit code I'm running now is fast enough that even if you had pre-computed candidates I think you would be limited by how quickly you could send them from the CPU to the GPU.

MJansen 2021-06-25 15:09

[QUOTE=CraigLo;581810]I do all my sieving in the GPU. Sieving is limited by memory speeds which are much faster in the GPU. I run the sieving and prime check code at the same time. They work well together because one is memory bandwidth limited and the other is processor limited. While waiting for a memory access the GPU can switch to another warp that is ready to run. The GPU does this automatically and instantly. Running them at the same time will take less time than running one and then the other.

If you already had pre-computed candidates you could send them to the GPU for processing. The 65-bit code I'm running now is fast enough that even if you had pre-computed candidates I think you would be limited by how quickly you could send them from the CPU to the GPU.[/QUOTE]

This means at least the following two conceptual approaches:
A. using the GPU for pre-sieving and PRP test
B. pre-sieve on CPU and using the GPU for PRP test only

The latter approach was used in the thread I mentioned ([url]https://mersenneforum.org/showthread.php?t=12827[/url]) and there were regular remarks about the CPU being not fast enough so the GPU was waiting. So I guess your approach with both pre sieving and PRP test on the GPU circumvents that.

Usually pre-sieving depends on (a combination of) wheel based approaches, modulo operations or a gcd check for small factors. These methods depend on division or modulo remainder calculations, so not the most efficient approach for a GPU it seems. Treesieve is fast but also heavy on modulo operations. I admit I am curious how your 65-bit code handles the pre-sieving!

Regarding the hardware, I am not sure when I will have a GPU installed to start experimenting, so for the time being conceptual discussion only. I found a nice overview of NVIDIA GPU's and their specs: [url]https://www.studio1productions.com/Articles/NVidia-GPU-Chart.htm[/url] this makes for a nice wishlist ;-)

Kind regards
Michiel

ATH 2021-06-25 17:42

[QUOTE=MJansen;581876]The latter approach was used in the thread I mentioned ([url]https://mersenneforum.org/showthread.php?t=12827[/url]) and there were regular remarks about the CPU being not fast enough so the GPU was waiting. So I guess your approach with both pre sieving and PRP test on the GPU circumvents that.[/QUOTE]

mfaktc can also sieve on the GPU: SieveOnGPU=1 in mfaktc.ini. It think almost everyone uses that now, and I think it was only in the very beginning of mfaktc that people were using CPU for sieving, but I could be wrong about that.

MJansen 2021-06-25 18:41

[QUOTE=ATH;581898]mfaktc can also sieve on the GPU: SieveOnGPU=1 in mfaktc.ini. It think almost everyone uses that now, and I think it was only in the very beginning of mfaktc that people were using CPU for sieving, but I could be wrong about that.[/QUOTE]

Thanks ATH! The thread is 300+ pages long over 10 years, I did not read them all yet ;-) Cool to see the sieving has been moved to the GPU though! Any indication where in the thread that change is announced?

Kind regards
Michiel

ATH 2021-06-25 19:39

[QUOTE=MJansen;581906]Any indication where in the thread that change is announced?[/QUOTE]

Post #1948. Seems it was released in version 0.20 Dec 2012, about 3 years after the thread started, and SieveOnGPU was set to be automatically on from the start, so no one probably used CPU sieving since then.

MJansen 2021-06-26 21:59

Hi,

In the mfaktc thread, the sieving on GPU seemed to get a boost after post #1610. No specifics though as far as I could find. I remember Dana once posted that a prp test is very fast and the mfaktc thread seems to indicate that pre-sieving should be kept to a minimum (1000 primes tops). I have no data on real life performance, but my intuitive start point would be:

a. pre-sieve and PRP on the GPU:
b. I would try a wheel approach first as a base reference, and prp the remaining candidates, i.e. 1 (or more?) fermat test(s)
c. additional tests would be to play around with the number of primes in the pre-sieve to determine the trade off between pre-sieving and prp-ing

Question is how to avoid possible pseudoprimes? Is it enough to perform a second Fermat test in a different base, or do you need more tests?

Kind regards
Michiel

ATH 2021-06-27 15:37

[QUOTE=MJansen;582010]Question is how to avoid possible pseudoprimes? Is it enough to perform a second Fermat test in a different base, or do you need more tests?[/QUOTE]

Personally in my own CPU program if the first SPRP test is positive I do a Lucas test, so effectively a BPSW test. It is a lot slower than just a Fermat test, but it is almost sure not to find any pseudoprimes, since BPSW pseudoprimes are so rare that none are known. I have no idea if a Lucas test is possible or feasible on a GPU.

But above 2[SUP]64[/SUP] Fermat and SPRP pseudoprimes are very rare, and the risk that a pseudoprime also should be "blocking" a record prime gap must be very very low, but I'm not sure if it is low enough to be negligible.

CraigLo 2021-06-29 03:44

For sieving I start with a small wheel. I'm currently using 2310 but I haven't had a chance to test 30030 yet. Using 2310 gives me 480 possible candidates. I then sieve for each of the possible candidates. For example the first candidate is 1 so I sieve for 1, 2311, 4621, 6931, etc. This starts with 2 more wheels with 5 primes each. The first is copied into each possible candidate to initialize the sieve. The second wheel is combined with the first using a bitwise-or. Then I do 5 rounds of sieving with 2048 primes each. These are done using shared memory. I sieve 2048 primes for 192x1024 elements at a time. I then copy these into my global sieve. I find this approach fastest up to around 100k primes. After that approaches sieving directly into global memory are faster.

In a single processing loop I do 80 of the shared sieve blocks. That means a single candidate gets sieved out to 192*1024*80=15.7E6. With 2310 elements in the first wheel that means I get through 36E9 per loop. Using a minimum gap of 1300 it takes about .24 seconds per loop which gives me 150E9 per sec. I haven't tested sieving only in a while but it is probably 2-3 times faster.

CraigLo 2021-06-29 03:49

I looked briefly at a Lucas test but haven't thought about implementing it yet. On a CPU how much slower is it than a Fermat test?

a1call 2021-06-29 11:15

Well the experts seem to have chosen to stay silent so far so I will give you a non expert reply. You will have to know all the prime factors of p-1. So there is Factoring-computation-cost. And then you will have to do modular calculations computationally as expensive as a single Fermat test per each prime factor:
Mod(base,p)^((p-1)/eachPrimeFactor) != 1
Then you have to try enough "random" bases to get a !=1 for all the found prime factors.
Normally it does not take more than a few random trials.

BTW this is an excellent thread. Thank you for sharing your expertise without the common-attitudes that seem to dominate the members here.

ETA one issue is that you might hit a Fermat-pseudiprime and have to have implementations to break the loop. While they are rare for given base for larger numbers there exist infinitude of them per given base growing exponentially in size.

ATH 2021-07-15 12:18

I switched my program from BPSW test (Miller-Rabin + Lucas) to 2 x Miller-Rabin instead, and it is now twice as fast. I'm pretty sure BPSW test is overkill here. I'm using base 2 and then base 9375, which is one of the bases with few survivors when testing all the 2-SPRP in the Feitsma list up to 2[SUP]64[/SUP].

Craig: You are using only 1 [URL="https://en.wikipedia.org/wiki/Fermat_primality_test"]Fermat test[/URL] ? Did you consider switching to the [URL="https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test"]Miller-Rabin[/URL] (SPRP) test ?
The complexity according to Wikipedia is O(log[SUP]3[/SUP]n) for MR and O(log[SUP]2[/SUP]n*loglog n) for Fermat, so it should not be that much slower, but maybe it is hard to implement on GPU?
The advantage is that there are no Carmichael numbers for SPRP test (where all bases shows PRP incorrectly), and there are fewer composite SPRP numbers than Fermat PRP.

I'm not sure when doing 1 Fermat or 1 MR test, how likely it is that a false PRP will block a big prime gap, it seems to be very unlikely, but the question is if it is unlikely enough.
2-SPRP numbers are pretty rare, when I tried searching above 2[SUP]64[/SUP] a few years ago, and if you look at the Feitsma list, there are 808 2-SPRP in the last 10[SUP]15[/SUP] below 2[SUP]64[/SUP] or 1 every 1.24*10[SUP]12[/SUP] on average. I'm not sure about the numbers for Fermat PRP.


Are your program testing all PRP surviving your sieve? The way I do it is that I set a minimum gap I want to find, in my case 1320 because it is the highest I have found above 2[SUP]64[/SUP], then after I have sieved an interval I jump 1320 ahead from the last PRP I found and then test backwards until I find a PRP, then I jump ahead 1320 again and repeat.
If I happen to get from 1320 down to 0 without finding a PRP then I search upwards again from 1320 until I find a new "record" gap. That way I'm sure no gap above or equal to my minimum is missed, but I will not find any gaps below the minimum. On average my program only reach from 1320 down to 1100-1200 before it finds a PRP.

I'm not sure if that would work on a GPU. If you use all the threads to PRP test the same interval it probably will not, but if each thread tests its own small interval, it could be useful, but maybe you are already using this strategy? If you do I guess your minimum gap is set to 1000.

CraigLo 2021-07-16 05:05

It probably wouldn't be too hard to convert my Fermat code to a MR test. The time complexity should be the same. I think the O(log[SUP]2[/SUP]n*loglog n) you wrote for Fermat uses an FFT multiply which is only faster for very large numbers.

[url]http://www.janfeitsma.nl/math/psp2/statistics[/url]
From the Fietsma list there is about 1 2-PRP every 3E11 from 1E19 to 2[SUP]64[/SUP]. This is about 4x more than the 2-SPRP. I think there will be about 500 gaps >= 1400 from 2[SUP]64[/SUP] to 2[SUP]65[/SUP] and each will require about 70 prime checks on average. The chance that a gap above 1400 from 2[SUP]64[/SUP] to 2[SUP]65[/SUP] is missed due to a PRP is about 1 in 3E11/500/70 = 1 in 8E6.

I'm currently saving gaps >= 1300 using an approach similar to yours. All gaps found in the GPU above 1300 are sent back to the CPU for a more thorough check using the gmp library.

MJansen 2021-09-03 09:49

I bought a laptop with a GPU (Geforce 1650 mobile, 1024 cuda cores, 16 SM's of 64 cores each) for testing. I will be using windows however (not Linux) and will install the necessary programs next week. A fast search showed that I will have to use C++ (new to me ...) to code.

I looked at the pseudoprime problem and the solution ATH uses, seems most interesting (and fast) but also no clue as to how to implement that yet on a GPU.

I have been looking at Dana's Perl code, but not sure this is the correct 64 bit code: [url]https://metacpan.org/dist/Math-Prime-Util/source/lib/Math/Prime/Util/PP.pm[/url]

For prime gaps, the fastest I found was using his next_prime / prev_prime code.

sub next_prime {
my($n) = @_;
_validate_positive_integer($n);
return $_prime_next_small[$n] if $n <= $#_prime_next_small;
# This turns out not to be faster.
# return $_primes_small[1+_tiny_prime_count($n)] if $n < $_primes_small[-1];

return Math::BigInt->new(MPU_32BIT ? "4294967311" : "18446744073709551629")
if ref($n) ne 'Math::BigInt' && $n >= MPU_MAXPRIME;
# n is now either 1) not bigint and < maxprime, or (2) bigint and >= uvmax

if ($n > 4294967295 && Math::Prime::Util::prime_get_config()->{'gmp'}) {
return Math::Prime::Util::_reftyped($_[0], Math::Prime::Util::GMP::next_prime($n));
}

if (ref($n) eq 'Math::BigInt') {
do {
$n += $_wheeladvance30[$n%30];
} while !Math::BigInt::bgcd($n, B_PRIM767)->is_one ||
!_miller_rabin_2($n) || !is_extra_strong_lucas_pseudoprime($n);
} else {
do {
$n += $_wheeladvance30[$n%30];
} while !($n%7) || !_is_prime7($n);
}
$n;
}

And it seems like he uses a small wheel (30) and a Miller_Rabin base 2 test followed by a full Lucas, so in effect a full BPSW test? Is this a correct assumption?

LaurV 2021-09-05 06:20

Please use code tabs when you post code, it takes less space on the screen (especially for long routines, and not all of us have high-resolution, super-wide monitors), and it makes the code easily to read (guess what, not all of us are perl experts either :razz:). Moreover, Mike may add some syntax-highlighting for perl in the future, which will make our life even easier :mike: :unsure:

(edit: albeit code in quotes looks crap, as I just found out from below... grrr :yucky:)

[QUOTE=MJansen;587141][CODE] sub next_prime {

my($n) = @_;
_validate_positive_integer($n);
return $_prime_next_small[$n] if $n <= $#_prime_next_small;

# This turns out not to be faster.
# return $_primes_small[1+_tiny_prime_count($n)] if $n < $_primes_small[-1];
return Math::BigInt->new(MPU_32BIT ? "4294967311" : "18446744073709551629")

if ref($n) ne 'Math::BigInt' && $n >= MPU_MAXPRIME;
# n is now either 1) not bigint and < maxprime, or (2) bigint and >= uvmax

if ($n > 4294967295 && Math::Prime::Util::prime_get_config()->{'gmp'}) {
return Math::Prime::Util::_reftyped($_[0], Math::Prime::Util::GMP::next_prime($n));
}

if (ref($n) eq 'Math::BigInt') {
do {
$n += $_wheeladvance30[$n%30];
} while !Math::BigInt::bgcd($n, B_PRIM767)->is_one ||
!_miller_rabin_2($n) || !is_extra_strong_lucas_pseudoprime($n);
} else {
do {
$n += $_wheeladvance30[$n%30];
} while !($n%7) || !_is_prime7($n);
}
$n;
}
[/CODE][/QUOTE]

danaj 2021-09-07 09:06

[QUOTE=MJansen;587141]I have been looking at Dana's Perl code, but not sure this is the correct 64 bit code: [url]https://metacpan.org/dist/Math-Prime-Util/source/lib/Math/Prime/Util/PP.pm[/url][/quote]

No, that is fall-back code if using big numbers without GMP. It works but it's both ugly and slow. I admit some of the C code mis ugly as well, and the whole thing kind of grew organically from a tiny project.

[quote]And it seems like he uses a small wheel (30) and a Miller_Rabin base 2 test followed by a full Lucas, so in effect a full BPSW test? Is this a correct assumption?[/QUOTE]

That's mostly true, but not for inputs > 120 bits. The part about BPSW [b]is[/b] true. For 64-bit inputs, the results are deterministic (like most math libraries/packages). For larger results, they have passed extra-strong BPSW.


[B][SIZE="4"]64-bit:[/SIZE][/B]

64-bit next_prime: [url]https://github.com/danaj/Math-Prime-Util/blob/410f67b110f35c9aa7fb560d448871c5d9871cb8/util.c#L127[/url]

64-bit prev_prime: [url]https://github.com/danaj/Math-Prime-Util/blob/410f67b110f35c9aa7fb560d448871c5d9871cb8/util.c#L155[/url]

The algorithm is:
1. If the value is tiny, use a fixed data set.
2. If the value is in our small saved sieve, use that.
3. Skip forward in a mod-30 wheel doing a primality test.

Steps 1 and 2 are useful for the many calls done with small values.

The primality test: [url]https://github.com/danaj/Math-Prime-Util/blob/410f67b110f35c9aa7fb560d448871c5d9871cb8/primality.c#L1274[/url]

That's hard-coded tiny trial division followed by "AES" BPSW, which is deterministic for 64-bit inputs. The MR base 2 test followed by a fast Lucas test.

On an x86 for native inputs, there seems to be no benefit in running a Fermat test rather than a Miller-Rabin test. For that matter, even with GMP I wasn't able to measure any advantage.

For the full test, a hashed Miller-Rabin set (2 or 3 total tests needed) is slightly faster for base-2 pseudoprimes, but not obviously faster for primes. On platforms without fast Montgomery math, it may be different.


[B][SIZE="4"]Larger than 64-bit:[/SIZE][/B]

next_prime: [url]https://github.com/danaj/Math-Prime-Util-GMP/blob/db88b861fedb16f879500c592d218aa7a39d19f3/gmp_main.c#L332[/url]

prev_prime: [url]https://github.com/danaj/Math-Prime-Util-GMP/blob/db88b861fedb16f879500c592d218aa7a39d19f3/gmp_main.c#L360[/url]

surround_primes: [url]https://github.com/danaj/Math-Prime-Util-GMP/blob/db88b861fedb16f879500c592d218aa7a39d19f3/gmp_main.c#L389[/url]

For next and prev prime, with values less than 121 bits we use a very similar simple algorithm. Skip forward using a mod-30 wheel, do trivial trial division, then call the primality test.

primality test: [url]https://github.com/danaj/Math-Prime-Util-GMP/blob/db88b861fedb16f879500c592d218aa7a39d19f3/primality.c#L1471[/url]

which is pretests (trial division), then ES BPSW (base-2 Miller-Rabin then extra-strong Lucas test).

pretests: [url]https://github.com/danaj/Math-Prime-Util-GMP/blob/db88b861fedb16f879500c592d218aa7a39d19f3/gmp_main.c#L108[/url]

which is overly complicated amounts of trial division. This includes cached large GCDs (because GMP does those very quickly), and also a choice between simple trivial division and a treesieve routine (D Bernstein describes the concept at a high level, Jens K Andersen has a nice detailed description of the concept, my implementation may or may not be particularly efficient, but it's much faster than one-at-a-time trial division for large enough inputs).


This leaves numbers larger than 120 bits, as well as surround_primes, which is meant for prime gaps. Here we do a partial sieve with a width of about 30 merits and a convoluted "educated guess" as to a depth that will give us the best performance. After the sieve, the ES BPSW test is done on candidates in order. It repeats the process if no result is found of course, though at 30 merits this isn't common.

surround_primes sieves 20 merits on either side of the given input, then tests candidates. If both candidates are not found, it repeats with 40, 80 merits, etc. to find whichever one (or both) haven't been found.


I have had on my todo list for a long time, going over the method in [url]https://arxiv.org/abs/2012.03771[/url], which is claimed to be faster.

SethTro 2021-09-07 22:10

[QUOTE=danaj;587432]
I have had on my todo list for a long time, going over the method in [url]https://arxiv.org/abs/2012.03771[/url], which is claimed to be faster.[/QUOTE]

That's my paper :)

Nothing in it helps with very small primorials or single calls to surround_primes. But it has a number of useful speedups for running many sequential surround_primes.

I did recently add a [URL="https://github.com/sethtroisi/prime-gap/blob/develop/gap_test_gpu.cu"]GPU tester to my repository[/URL] that uses CGBN which I've found to be very useful for doing math on 256-2048 bit numbers. the CGBN repository had a pre-written miller-rabin implementation I made use, on my 1080ti it's much faster using all the 12 cores of my Ryzen 3900x

CraigLo 2021-09-08 06:35

[QUOTE=MJansen;587141]I bought a laptop with a GPU (Geforce 1650 mobile, 1024 cuda cores, 16 SM's of 64 cores each) for testing.[/QUOTE]

Nice. I started with an old laptop that had a 940mx gpu (384 cores I think).

[QUOTE=SethTro;587467]I did recently add a GPU tester to my repository that uses CGBN which I've found to be very useful for doing math on 256-2048 bit numbers. [/QUOTE]

I haven't had a chance to look at the CGBN library yet. It wasn't available when I started a few years ago. One of the big improvements I made to my 65-bit code was a 65-bit multiplier. Instead of representing the 65-bit number as 3 32-bit numbers which requires 9 multiplies

[CODE](x1 x2 x3) * (y1 y2 y3) = x1y1 x1y2+x2y1 x1y3+x2y2+x3y1 x2y3+x3y2 x3y3[/CODE]

I use

[CODE](1 x2 x3) * (1 y2 y3) = 1 x2+y2 x2y2+x3+y3 x2y3+x3y2 x3y3[/CODE]

which requires 4 multiplies and 4 additions. For squaring, 6 multiplies becomes 3 multiplies. This gives me about a factor of 2 improvement. Do you know if CGBN does this or if it would be possible to add it?

SethTro 2021-09-08 17:13

[QUOTE=CraigLo;587493]which requires 4 multiplies and 4 additions. For squaring, 6 multiplies becomes 3 multiplies. This gives me about a factor of 2 improvement. Do you know if CGBN does this or if it would be possible to add it?[/QUOTE]

CGBN doesn't have BITS+1 optimizations or and I doubt the author will want to add them. It's much more oriented around large input numbers.

MJansen 2021-09-09 07:37

I have not been looking at the forum too regularly lately, lots of posts:

[QUOTE=danaj;587432]No, that is fall-back code if using big numbers without GMP. It works but it's both ugly and slow. I admit some of the C code mis ugly as well, and the whole thing kind of grew organically from a tiny project.
....[/QUOTE]

Thanks Dana, appreciate the answer and the pointers, will look into them!

@Dana/@Craig I have a question regarding the bignum library in C++ under windows, let me elaborate a little first:
The fastest library for bignum calculations is GMP and GMP is native under Unix/Linux, but not under windows. Perl has a GMP library incorporated in the strawberry core files. So under Perl it is relatively easy to call on GMP functions.

C++ itself does not seem to have a bignum library incorporated (that I could find, maybe I did not search in the right place?). @Craig: how did you solve this? What bignum library are you using in cuda C++?

If one would like to use GMP under C++ windows, google will show some links about MinGw and using GMP from that instance. Does anybody have any experience with setting that up? Help would be appreciated!

Kind regards
Michiel

CraigLo 2021-09-09 17:44

I use GMP for my CPU code but I've only used linux. I have my laptop set up so I can boot into either linux or Windows. I'm not sure if that will work for you.

For the GPU stuff I have written all my own bignum routines. It looks like CGBN is a good option now but it didn't exist when I started. I haven't tried it yet. There might be some instances where a specialized routine will be faster (e.g. 65-bit numbers).

robert44444uk 2021-09-09 17:49

[QUOTE=MJansen;587546]I have not been looking at the forum too regularly lately, lots of posts:



Thanks Dana, appreciate the answer and the pointers, will look into them!

@Dana/@Craig I have a question regarding the bignum library in C++ under windows, let me elaborate a little first:
The fastest library for bignum calculations is GMP and GMP is native under Unix/Linux, but not under windows. Perl has a GMP library incorporated in the strawberry core files. So under Perl it is relatively easy to call on GMP functions.

C++ itself does not seem to have a bignum library incorporated (that I could find, maybe I did not search in the right place?). @Craig: how did you solve this? What bignum library are you using in cuda C++?

If one would like to use GMP under C++ windows, google will show some links about MinGw and using GMP from that instance. Does anybody have any experience with setting that up? Help would be appreciated!

Kind regards
Michiel[/QUOTE]

I made the migration to Linux under Windows 10 operating system using Ubuntu 20.04 LTS. I don't really understand what I'm doing, but with help from others I can now run Linux codes.

MJansen 2021-09-09 21:14

Thnx Graig, I was wondering, after installing Strawberry Perl, a gmp.h file is installed at the laptop in the directory C:\Strawberry\c\include. The C++ code seems to accept a reference to gmp.h (include gmp.h). Could this work? I will try later with some test code, but what are your thoughts? Worth a try? Or does the GPU not work welll with GMP in your experience?

I spotted this thread online: [url]https://stackoverflow.com/questions/14126393/c-program-using-gmp-library[/url]

If GMP and Cuda do not mix, I am interested in your homebrew code ;-)

I found some references to other libraries for GPU:
CUMP [url]https://github.com/skystar0227/CUMP[/url]
XMP [url]https://github.com/NVlabs/xmp[/url]
Campary [url]https://homepages.laas.fr/mmjoldes/campary/[/url]

Any thoughts on those?

And an article regarding GMP implementation:
[url]http://individual.utoronto.ca/haojunliu/courses/ECE1724_Report.pdf[/url]

I get a feeling programming for a GPU will require a lot of programming and research ...

Kind regards
Michiel



[QUOTE=CraigLo;587586]I use GMP for my CPU code but I've only used linux. I have my laptop set up so I can boot into either linux or Windows. I'm not sure if that will work for you.

For the GPU stuff I have written all my own bignum routines. It looks like CGBN is a good option now but it didn't exist when I started. I haven't tried it yet. There might be some instances where a specialized routine will be faster (e.g. 65-bit numbers).[/QUOTE]

CraigLo 2021-09-10 06:39

I think you will be better off with a library that is designed to work well with the GPU architecture instead of trying to run GMP in the GPU.

CUMP looks like it is for floating point numbers.
XMP looks like an old version of CGBN (XMP 2.0)
Campary doesn't have any documentation that I could find.

I would start with CGBN. It is written by NVIDIA and well tested. It will handle larger numbers than my code at the moment and it is a more complete library.

You will probably still want GMP for your CPU code. I haven't used MinGW or run GMP/CUDA on Windows so I can't help there.

CraigLo 2021-09-10 06:47

[QUOTE=SethTro;587514]CGBN doesn't have BITS+1 optimizations or and I doubt the author will want to add them. It's much more oriented around large input numbers.[/QUOTE]

I was reading through some of the CGBN documentation.

It says CGBN currently requires 4, 8, 16 or 32 thread per CGBN group. What is a CGBN group? Is it a single bignum?

Also, the size must be evenly divisible by 32. Does this mean it won't work for a 65 bit number? Or can you treat it as a 96 bit number with leading 0s and everything will work correctly?

SethTro 2021-09-10 07:46

[QUOTE=CraigLo;587623]
It says CGBN currently requires 4, 8, 16 or 32 thread per CGBN group. What is a CGBN group? Is it a single bignum?
[/QUOTE]

It's a single bignum (made up of 32 bit limbs)

[QUOTE=CraigLo;587623]
Also, the size must be evenly divisible by 32. Does this mean it won't work for a 65 bit number? Or can you treat it as a 96 bit number with leading 0s and everything will work correctly?[/QUOTE]

IMO CGBN isn't going to be of much use for 65 bit numbers, maybe 96 and defiantly 128 bit numbers but not 65 bits.

MJansen 2021-09-10 07:54

Hi Craig,

appreciated, seems like sound advice! Goal is to get started with some coding and if CGBN is NVIDIA supported, it might develop further in the future. Will try CGBN. Another advantage is that it already has a probable prime test incorporated (Miller - Rabin according to Seth).

Regarding your response to CNBG groups, we'll have to find out ;-) Have to say the CUDA terminology is not that self explanatory (SM, Kernel, Thread, Block, Cores ....


@Robert, I am hesitant to switch to Unix just yet, I will stick to windows for now.

Regards
Michiel

Bobby Jacobs 2021-10-14 22:37

This is very cool. Now, when do you believe you will have the new code working to confirm the new maximal prime gaps?

MJansen 2021-10-19 17:12

[QUOTE=Bobby Jacobs;590601]This is very cool. Now, when do you believe you will have the new code working to confirm the new maximal prime gaps?[/QUOTE]

Hi Bobby,
not sure who you are reacting to. If it was me let me be clear: I am not going to put energy into making a deterministic gapsearch from 2^64 onwards. I will be experimenting with coding for a GPU to see how much more efficient that is compared to the traditional CPU gap search.

Update on that:
I took a detour and have now installed Ubuntu under windows (WSL2?) and am currently learning the basics of the GMP datatype eccentricities. GMP is not very intuitive I must admit and I have not begun to look for speed optimizations yet. So slow going still. But I have a basic understanding of Ubuntu, GMP and C++. In summary: still confused but at a higher level ;-)

And coding for a GPU would be the next step, after I get to grips with GMP, C++ and Cuda (necessary to talk to the GPU). So not sure how this journey (in my spare time) will develop, but I am still finding it interesting!

Kind regards
Michiel

MJansen 2021-10-19 19:30

@Dana: I tried GMP_nextprime (primorial value = 10000455041*3607#/210) and it took 39 seconds under Ubuntu GMP compiled with g++. A simple gcd wheel of 9699690 took only 18 seconds to find the gap end on the plus side (+26734). And around 42 secs for the minus side (-47298).

Dana's Perl module (prime utils under Ubuntu Perl) takes only 39 seconds to find both the plus and the minus side of the gap. But also uses the gmp_nextprime code. Not sure where the speed up is coming from, maybe the datatype of n is important here. Will look into it further ...

If somebody has a cpu script that improves Dana's 39 seconds, I am interested ;-)

But your

henryzz 2021-10-20 10:19

[QUOTE=MJansen;591074]Hi Bobby,
not sure who you are reacting to. If it was me let me be clear: I am not going to put energy into making a deterministic gapsearch from 2^64 onwards. I will be experimenting with coding for a GPU to see how much more efficient that is compared to the traditional CPU gap search.

Update on that:
I took a detour and have now installed Ubuntu under windows (WSL2?) and am currently learning the basics of the GMP datatype eccentricities. GMP is not very intuitive I must admit and I have not begun to look for speed optimizations yet. So slow going still. But I have a basic understanding of Ubuntu, GMP and C++. In summary: still confused but at a higher level ;-)

And coding for a GPU would be the next step, after I get to grips with GMP, C++ and Cuda (necessary to talk to the GPU). So not sure how this journey (in my spare time) will develop, but I am still finding it interesting!

Kind regards
Michiel[/QUOTE]

You may be aware of this already but mpz_class (the c++ bindings for gmp) is quite a bit less clunky than mpz_t although some functions require mpz_t (which can be easily extracted from mpz_class).

MJansen 2021-10-20 11:13

Hi Henryzz,

I saw that there are some samples with the mpz_class but have so far only used the mpz_t datatype after mixing up different types in the wrong places ;-) Thanks for the tip!

Kind regards
Michiel

Bobby Jacobs 2021-10-23 18:18

[QUOTE=MJansen;591074]
Hi Bobby,
not sure who you are reacting to.
[/QUOTE]

I was reacting to you. Since this thread used to be part of the "New Maximal Gaps" thread, I assumed that you were using this code to find maximal gaps greater than 2[SUP]64[/SUP].

MJansen 2021-10-30 21:25

I did an update on the conceptual gapsearch approach I want to take, and am curious if there are other thoughts (discourse is the academical basis for improvement after all):

1. pretest
Steps:
Use primorials for pretesting, or better primorial intervals like:
pri1 = P(1)*P(2)*...*P(20),
pri2 = P(21)*P(22)*...*P(40),
etc ..
Q: or P(1-100), P(101-200), ...) --> testing needed!

Determine the remainder R1: R1 = Multiplier*P(x)#/Divider % pri1

After that, go through the interval [0, +2, +4, +6, etc] and then if the interval value is not yet set to composite
determine if gcd(pri1,R1+interval value)!=1, if so set it to composite, else goto next interval value

Determine the remainder R2=m*P(x)#/D % pri2,
loop through the interval and test if an interval value is not found composite already,
then determine if gcd(pri2,R2+interval value)!=1, if so set it to composite, else goto next interval value
etc, etc, till the desired depth of presieving is reached

Note: Jens KA had a ball park figure of (d=log10(M*P(x)#/D)) and max presieve prime = d/2,5)^2,5, this was way too high according to my tests, I use 40.000*x if x > 250.
For M*P(504)#/210 this would mean in JKA's formula: d = ca 1500 and the max prime would be in the region of 2.31*10^17. In my approach it would mean stopping at the prime a little over 2*10^7. There is a trade off and the larger the x, the more interesting presieving is of course.

After presieve is done, check if remaining candidates in the interval are below a threshold count, if so the chance of finding a large gap are bigger, so discard the denser intervals (time is too short to test all possible intervals anyway, look where the chances are higher)

Q: 1 or 2 intervals to pretest?
1 interval to pretest [start = -20*x, end = -20*x] and later split into plus and minus for step 3, or
2 plus [0, +2, +4, +6, ... , 20*x] and minus interval [0, -2, -4, -6, ... , -20*x] pretested separately in two threads?

2. PRP (CPU-Pfgw or GPU-code to be written) the remaining candidates
Q I tend towards skipping the PRP step all together, but some testing needed!

3. Perform Extra strong BSWP test on the remaing candidates to avoid missing an even bigger gap
if a PRP is found below 3*P(x), stop the search end move to the next Multiplier

Q how to get the GPU involved, for step 3 only? or would it make sense to do some presieving or even PRP-ing on the GPU?
Q does the GPU-CGBN implementation of the Miller Rabin test result in strong enough PRP's?

In general: I found Multipliers with remainder 1 mod 30 to be effective (as are the multipliers remainder 7, 11, 13, 17, 19, 23, 29 mod 30) for finding bigger primes. Remainder 5 and 25 give more gaps over merit 20, but not the highest merits. I am running some extensive tests on the efficiency of the Multiplier (M) and Divider (D) that I will share when its done.

Oh and I am making a complete switch from windows to Ubuntu-C/C++ for gapsearching at the moment. It is a lot faster and after an initial steep learning curve it seems to get easier every day.

Kind regards
Michiel

CraigLo 2021-11-01 06:33

I use the GPU for everything except for some initialization and verifying large gaps.


I skip step 3 (BPSW). For numbers around 2[SUP]64[/SUP] there is only about a 1 in 1E12 chance of being a SPRP-2. The chances are even lower for larger numbers.


I use a different sieving approach. I think what I use is the same as Seth's hybrid approach.


I sieve for the next interval at the same time I am doing prime checks. It is efficient to do a memory bandwidth limited operation at the same time as a computation limited operation. The latencies for the memory operations can be hidden by the computations in another kernel.

MJansen 2021-11-01 21:46

[QUOTE=CraigLo;592178]I use the GPU for everything except for some initialization and verifying large gaps.


I skip step 3 (BPSW). For numbers around 2[SUP]64[/SUP] there is only about a 1 in 1E12 chance of being a SPRP-2. The chances are even lower for larger numbers.


I use a different sieving approach. I think what I use is the same as Seth's hybrid approach.


I sieve for the next interval at the same time I am doing prime checks. It is efficient to do a memory bandwidth limited operation at the same time as a computation limited operation. The latencies for the memory operations can be hidden by the computations in another kernel.[/QUOTE]

I have to say I really like the way you and Seth are thinking! Not conventional, but exploring new angles. I am trying to bend my head around the article Seth wrote regarding the hybrid approach and I probably will have to ask him for a Masterclass ;-) Especially about the specifics to make it applicable in code. I don't want to just copy, but understand what and why things are happening.

I am still not coding for the GPU yet, so a serious gap to close in practical and theoretical knowledge!

PS I will PM you to see how your sieve method compares to the treesieve Jens developed. That is the fastest sieving method I came accross yet.

Kind regards
Michiel

CraigLo 2021-11-02 15:43

Treesieve looks like it is useful when you want to compute C%p for the same C with a lot of different primes. I've never tried it, but it seems like it is better for trial division than sieving.
[URL]https://www.mersenneforum.org/showpost.php?p=152965&postcount=8[/URL]
Is there a description of tree sieve anywhere?



Here are the basics of my approach. Let take an easy example and use A*11#/6.

With a divider of 6 we want A%6 to be 1 or 5. We choose one of them. I'll use 1, so A will be 1, 7, 13, ...


Find offsets from 1*11#/6 that can be prime after removing multiples of 2, 3, 5, 7, 11.
With 2 we eliminate ..., -5, -3, -1, 1, 3, 5, ...
With 3 we eliminate ..., -4, -1, 2, 5, ...
With 5 we eliminate ..., -10, -5, 0, 5, 10, ...
With 7 we eliminate ..., -14, -7, 0, 7, 14, ...
With 11 we eliminate ..., -22, -11, 0, 11, 22, ...
This leaves possible primes of ..., -8, -6, -2, 4, 6, 12, ...


Now sieve starting with 13 using each of the possible prime offsets.

With 13 and an offset of -2 we want to eliminate all cases where ((1 + 6*i)*11#/6 - 2)%13 = 0
The first i where this is true is 8.
We can add multiples of 13 to this and the results will still be divisible by 13.
So, for 13 and an offset of -2 we eliminate 8, 21, 34, ...
Repeat for all primes in our sieve and all offsets.


When doing a search I pick a fixed number of bits for the prime numbers. I usually pick a multiple of 32 which is a little more efficient. I then choose a primorial so that I can run for a couple days without needing to change the number of bits in my primes.


When looking for merits >25 I use offsets up to +/- 20*merit.


In the GPU I use 2 sieving algorithms. The first uses small primes and works in shared memory. The second uses large primes and works in global memory. I have an idea for a third that I need to experiment with that would go in between these 2. I sieve as deeply (in multiplier) as memory will allow.


I've never tried optimizing this in a CPU but I would probably take the same approach. One algorithm for small primes where the sieve interval fits in cache and one algorithm for larger primes.

MJansen 2021-11-03 18:28

[QUOTE=CraigLo;592293]Treesieve looks like it is useful when you want to compute C%p for the same C with a lot of different primes. I've never tried it, but it seems like it is better for trial division than sieving.
[URL]https://www.mersenneforum.org/showpost.php?p=152965&postcount=8[/URL]
Is there a description of tree sieve anywhere?
[/QUOTE]

I do not have the code, Jens does (C code with maybe some machine language incorporated iirc), I only have a compiled exe file for windows and some idea about what it does. But to be clear: every gap search basically is an interval search and must deal with sieving the interval. Treesieve was developed for large gaps (say P(x)# with x > 250, but I think initially it was even for different formats than primorials).

In my words, how I understand it:
The idea is to set up a tree of nodes where each node (= a product of more and more primes) is computed once and used to determine the remainder of that node to the interval start ( say IntStart= M*P(x)#/D-25*P(x)).

The tree has levels as Jens described in the link you gave, where the base level is made up of nodes consisting of the product of say 10 primes each.
Node(level=0,node=1)=P(1)*P(2)*P(3)*...*P(10)=6469693230
Node(level=0,node=2)=P(11)*P(12)*P(13)*...*P(20)
Node(level=0,node=3)=P(21)*P(22)*P(23)*...*P(30)
Node(level=0,node=4)=P(21)*P(22)*P(23)*...*P(30)

Level 1 has the multiplied nodes of level 0, say:
Node(level=1,node=1)=Node(0,1)*Node(0,2)
Node(level=1,node=2)=Node(0,1)*Node(0,2)

Level 2 has the multiplied nodes of level 1, say:
Node(level=2,node=1)=Node(1,1)*Node(1,2)

And assume Node(2,1) is in the order of sqrt(IntStart) (you have to build a tree with this in mind)

Now determine IntStart%Node(2,1) = R(2,1).
Use this remainder R(2,1) to determine:
R(1,1)=R(2,1)%Node(1,1)
R(1,2)= R(2,1)%Node(1,2)

And move to the lowest level:
R(0,1)=R(1,1)%Node(0,1)
R(0,2)=R(1,1)%Node(0,2)

R(0,3)=R(1,2)%Node(0,3)
R(0,4)=R(1,2)%Node(0,4)

Now determine the remainder of the individual primes (eg R(0,1)%3), so you can use that remainder to start crossing off the candidates in the prime gap interval.

This in a nutshell, hope it is clear. From experience I can say it works really well for big numbers.


[QUOTE=CraigLo;592293]Here are the basics of my approach. Let take an easy example and use A*11#/6.

With a divider of 6 we want A%6 to be 1 or 5. We choose one of them. I'll use 1, so A will be 1, 7, 13, ...


Find offsets from 1*11#/6 that can be prime after removing multiples of 2, 3, 5, 7, 11.
With 2 we eliminate ..., -5, -3, -1, 1, 3, 5, ...
With 3 we eliminate ..., -4, -1, 2, 5, ...
With 5 we eliminate ..., -10, -5, 0, 5, 10, ...
With 7 we eliminate ..., -14, -7, 0, 7, 14, ...
With 11 we eliminate ..., -22, -11, 0, 11, 22, ...
This leaves possible primes of ..., -8, -6, -2, 4, 6, 12, ...


Now sieve starting with 13 using each of the possible prime offsets.

With 13 and an offset of -2 we want to eliminate all cases where ((1 + 6*i)*11#/6 - 2)%13 = 0
The first i where this is true is 8.
We can add multiples of 13 to this and the results will still be divisible by 13.
So, for 13 and an offset of -2 we eliminate 8, 21, 34, ...
Repeat for all primes in our sieve and all offsets.


When doing a search I pick a fixed number of bits for the prime numbers. I usually pick a multiple of 32 which is a little more efficient. I then choose a primorial so that I can run for a couple days without needing to change the number of bits in my primes.


When looking for merits >25 I use offsets up to +/- 20*merit.


In the GPU I use 2 sieving algorithms. The first uses small primes and works in shared memory. The second uses large primes and works in global memory. I have an idea for a third that I need to experiment with that would go in between these 2. I sieve as deeply (in multiplier) as memory will allow.


I've never tried optimizing this in a CPU but I would probably take the same approach. One algorithm for small primes where the sieve interval fits in cache and one algorithm for larger primes.[/QUOTE]

Your approach looks a lot like mine in the sense that I discard all even numbers while sieving, my code looks a little complex but the sieving is more efficient (CPU only! Sieving seems like a ideal parallel process, so that should improve the speed of gap searches most definitely). My code is directly based on the Sieve of Eratosthenes: you can discard all odd values that match x^2 + 2kx.

By the by:
NP = x^2+2kx is the formula for all non-primes,
with x a whole number [3.--> and k a whole number [0,-->

Use graph.exe to show you a grafic representation of that formula. Also nice, let graph.exe show the equation -NP = x^2+2kx (in the 4th quadrant, where k is negative). This shows where Fermat got his idea for NP = X^2 - Y^2. Based on the formula NP = x^2 + 2kx, you can be more exact:
NP + k^2 = x^2 + 2kx + k^2 = (x+k)^2
NP = (x+k)^2 - k^2.

Another way of coming to this formula is:
NP = x * y (x and y whole numbers [3,-->
Say x<=y and y is odd then y = x + 2k with k a whole number [0,-->
Then NP = x (x + 2k) = x^2 + 2kx

Note:
The formula for the non-primes (NP = x^2 + 2kx) moves asymptotically to k=-1/2x and since that is the place where no Non-Prime will ever be, it seems like a good place to look for primes in the imaginary mathematical world.
My point: the Riemann critical line is the intrinsic property of the primes.
The better question is: what is the relation between the Riemann zero's and the primes themselves? Can it be used to determine the actual primes?

if anybody gets inspired and proves the Riemann Hypothesis by this, at least mention me in your proof, A nice share in the prize money in the order of USD 100.000 would do nicely as well!

Kind regards
Michiel

CraigLo 2021-11-04 06:26

The advantage of my approach is that you can do much better than Sieve of Eratosthenes. When using a wheel with the first 250 primes, only about 1 in 13 numbers can be prime. It's even better than this when using deficient primorials. My sieving method allows you to only sieve numbers which aren't already eliminated by the primorial. The Sieve of Eratosthenes requires eliminating all elements except the even numbers so you end up having to remove about 8 times more elements.

I think you are sieving for one multiplier at a time. Are you re-computing the mods for each multiplier? The mods only need to be computed once. If I was going to sieve one multiplier at a time I would use the following:
Compute A*P#/Divider%p, for each prime p
Compute P#%p, for each prime p and use this to adjust the starting position from one multiplier to the next.

Using the example from before: A*11#/6, A = 1, 7, 13, ...
with p = 13
1*11#/6%13 = 8, so for A = 1 we can remove 5, 31, 57, ... and -21, -47, ...

11#%13 = 9, so each time we move to the next multiplier we just need to add 4 to the previous numbers.

For 7*11#/6 we start with 9, (7*11#/6 + 9)%13 = 0
For 13*11#/6 we start with 13

MJansen 2021-11-04 07:57

[QUOTE=CraigLo;592397]The advantage of my approach is that you can do much better than Sieve of Eratosthenes. When using a wheel with the first 250 primes, only about 1 in 13 numbers can be prime. It's even better than this when using deficient primorials. My sieving method allows you to only sieve numbers which aren't already eliminated by the primorial. The Sieve of Eratosthenes requires eliminating all elements except the even numbers so you end up having to remove about 8 times more elements.

I think you are sieving for one multiplier at a time. Are you re-computing the mods for each multiplier? The mods only need to be computed once. If I was going to sieve one multiplier at a time I would use the following:
Compute A*P#/Divider%p, for each prime p
Compute P#%p, for each prime p and use this to adjust the starting position from one multiplier to the next.

Using the example from before: A*11#/6, A = 1, 7, 13, ...
with p = 13
1*11#/6%13 = 8, so for A = 1 we can remove 5, 31, 57, ... and -21, -47, ...

11#%13 = 9, so each time we move to the next multiplier we just need to add 4 to the previous numbers.

For 7*11#/6 we start with 9, (7*11#/6 + 9)%13 = 0
For 13*11#/6 we start with 13[/QUOTE]

Hi Craig,
you are correct. I do make too many calculations atm, mainly because I do not keep to one divider and also because I do not search one primorial at the time, but go through a wide range of those. The optimizations you (and Seth) are using, make sense if the search is more limited in P(x)# and D. You find much more 30+ gaps than I do, so you guys must be doing something right ;-)

This is also why I like the article (and the insights) of Seth. I will incorporate those ideas into my new approach. There are some other ideas I would like to explore, one has to do with a more efficient number presentation:
there are 8 candidates that remain in a wheel of 30. You can represent them as a matrix of 8 by N elements. The 3rd element of the 11th byte represents 11 + 11*30 = 341

(Note 8 candidates represented as 0 or 1 make up one byte). Sieve each bit separately in an interval and you have a better parallel sieving method that I think can work on GPU.
Changing to a wheel of 210 gives 48 candidates, but I am under the impression that a SMT has 64 cores, so what to do with the remaining 16 cores?

Anywayz, interesting times!

And one more thing about NP = x^2 + 2kx:
if k = -1/2 x + 1/2 then NP = x^2 + 2x(-1/2x + 1/2) = x
And k = -1/2 x - 1/2 then NP = x^2 + 2x(-1/2x - 1/2) = -x
Here you will find the trivial zero's.

CraigLo 2021-11-04 14:58

I always wondered why Seth claimed a factor of 10000 improvement. I would have expected less than 10. Re-computing the mods probably explains it.

In my sieve I use 1 bit per multiplier per possible prime offset so memory use is as efficient as possible.

There are a couple of ways to parallelize. You can have each threadblock work on a different set of possible prime offsets. You can have each threadblock work with a different set of primes. There might be others.

MJansen 2021-11-04 18:35

[QUOTE=CraigLo;592416]I always wondered why Seth claimed a factor of 10000 improvement. I would have expected less than 10. Re-computing the mods probably explains it.

In my sieve I use 1 bit per multiplier per possible prime offset so memory use is as efficient as possible.

There are a couple of ways to parallelize. You can have each threadblock work on a different set of possible prime offsets. You can have each threadblock work with a different set of primes. There might be others.[/QUOTE]

Hi Craig,

it is not so much about the storage of the in essence boolean value for candidate or composite, but sieving on more gpu cores at the same time. I don't know how to make this clear yet. I will let this sink in and let it stew for a while. Do some tests and come back on the sieving part. Maybe a varied approach should be in order, one for fixed P(x)# and div, and another for broader searches. Thanks for thinking along and sharing! Appreciated

Kind regards
michiel


All times are UTC. The time now is 06:01.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.