20160730, 18:03  #1 
Aug 2002
2·3^{3} Posts 
Nvidia GPU for PRP testing proth numbers?
It looks like the llrCUDA thread has not had much activity in years.
Did it or another program workout to be viable for proth style numbers? k*2^n+1? 
20160730, 22:08  #2 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
2×3^{3}×13^{2} Posts 
It is viable but relatively slow. (Relative to other possible uses for a GPU.)
Think about it this way  would you use llrCUDA on a GPU for a month to find one prime or use another GPU app (searching for prime of other forms) and find five of the same size? 
20160731, 04:23  #3 
Aug 2002
110110_{2} Posts 
Good points although a GPU sieve for specialized prime chains is unlikely as well :(

20160801, 00:08  #4  
Sep 2006
The Netherlands
677 Posts 
Quote:
status: busy improving its speed. intention: In k * 2 ^ n  1 first goal is sieve a single k <= 32 bits over a domain of n bit larger than PG usually is doing. A range of 10M for the n's shouldn't be big trouble is expectation. First incarnations will only sieve a single k. Maybe later on a few more kernels for sieving multiple k's, as those are so easy to add. Sieving then a couple of k's definitely should be faster, yet if you really want to sieve thousands of different k's at the same time, we'll to see how interesting GPU is for that. Note i do target right now Fermi generation GPU. Bought myself a GTX580 for 30 euro 2nd hand. If it works great at that gpu obviously newer generations of gpus should be fast as well (though i don't know for sure about the 600 series). Though we have to see how fast Kepler is gonna be for larger ranges  It just has little cache per block as it has more streamcores (cuda cores) in each SM. 32 for Fermi versus 192 for Kepler. Both have 48KB L1 data cache available (which nvidia calls confusingly 'shared ram'). 

20160801, 09:15  #5  
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
2·2,861 Posts 
Quote:


20160801, 10:41  #6  
Sep 2006
The Netherlands
677 Posts 
Quote:
Actually they have less memory a core nowadays. The fermi has in each multiprocessor with 32 streamcores (cuda cores nowadays called), 64KB L1 which gets split in instruction and datacache. If you manage to get your program small enough and it fits in 16KB L1 instruction cache, you can use the remaining 48KB. That has however to be divided by 2 warps of 32 cores. So you have actually 24KB for each Warp. I would call such warp a 'thread' as it executes the same code over all cores at atomically more or less the same time. Yet they have some artificial definition of that something that runs on 1 core is called a thread. Actually nothing runs there as it's a vector processor of course. It runs on the SM (the streaming multiprocessor), which you can see as a core. The 6xx series has 192 streamcores in each SM yet also 48KB L1 datacache. A faster cache which sure will boost things. So each warp of 32 streamcores has a lot less. To make things weird, it just can do, says manual, 4 hardware warps. 4 x 32 = 128 cores. As if 64 cores are unused on it. Yet let's assume 4 warps. 48 / 4 = 12KB. Realize that with 64 bits numbers this is 1536 quadwords you have. Or in my case 3072 integers. To make things worse it is not normal L1 datacache. The L1 internal is splitup in 32 memory banks. It's working in parallel. In itself that is genius, yet if 2 cores at the same time read from the same memory bank you get a memory bank conflict. Measurements here indicate you really don't want that. It's duck slow. Turtle slow even. So the current implementation there needs a rewrite there in order to generate less bank conflicts. When i test it, it is simply too slow. Fully avoiding them is impossible. The current algorithm uses a hashtable with a linked list and i'm using a 32 bits hash. That's simply not very efficient making use of the huge calculation strength the GPU has. A new implementation will change that to a huge bittable and then use 2 passes to reduce the number of babysteps and giant steps until we have nearly nothing left. Allows more bits per hashkey as well. Took me a while to figure that one out  though it is a trivial algorithm. We'll see whether it will speed it up. 

20160801, 11:53  #7  
Banned
"Luigi"
Aug 2002
Team Italia
2×2,383 Posts 
Quote:


20160801, 12:19  #8 
Sep 2006
The Netherlands
677_{10} Posts 
Right now: Riesel.
Would need some help how to solve Proth in such case. To see if small prime p divides a riesel => k * 2^n  1 To solve it: k ^ (p2) == inverse (mod p) and then try to find the 2^n (mod p) that suits this then use BSGS algorithm to find the correct n (if it exists). How to solve it for Proth i don't know. Anyone? 
20160801, 12:39  #9 
Sep 2006
The Netherlands
677 Posts 
We get to: k * 2^n + 1 = 0 (mod p) <==> 2^n = 1 / k (mod p) <==> 2^n == (k)^(p2) ==  k^(p2) <==> 2^n = (pk) ^ (p2)
Looks correct? 
20160801, 12:58  #10 
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
13132_{8} Posts 
I am fairly sure there should be a formula which covers all c. Have a look at the srsieve source and you should find it.

20160802, 01:57  #11 
Aug 2002
54_{10} Posts 
Nice work and its sounds a bit complex dealing with the NVidia GPU.
Any chance this would work for fixed n, and large k range k * 2^n+1? Or a small n range with a large k range of say k ranging 4 billion? My projects with Cunningham primes are still using Newpgen (485 MB bitmap limit) and I would love to have another option. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Why have ECM testing for known nonprime Mersenne numbers?  sd235  Information & Answers  12  20181206 17:56 
Testing my numbers  Vicodin  Information & Answers  21  20170502 05:11 
Small Request for Proth Numbers  c10ck3r  Programming  6  20131128 03:18 
Can gmpecm be optimised for Proth numbers?  geoff  GMPECM  2  20081217 13:28 
testing big numbers  sagan_fan  Math  8  20021009 21:20 