![]() |
|
|
#1 |
|
Aug 2002
1101102 Posts |
It looks like the llrCUDA thread has not had much activity in years.
Did it or another program workout to be viable for proth style numbers? k*2^n+-1? |
|
|
|
|
|
#2 |
|
"Serge"
Mar 2008
San Diego, Calif.
32×7×163 Posts |
It is viable but relatively slow. (Relative to other possible uses for a GPU.)
Think about it this way - would you use llrCUDA on a GPU for a month to find one prime or use another GPU app (searching for prime of other forms) and find five of the same size? |
|
|
|
|
|
#3 |
|
Aug 2002
668 Posts |
Good points although a GPU sieve for specialized prime chains is unlikely as well :(
|
|
|
|
|
|
#4 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
status: busy improving its speed. intention: In k * 2 ^ n - 1 first goal is sieve a single k <= 32 bits over a domain of n bit larger than PG usually is doing. A range of 10M for the n's shouldn't be big trouble is expectation. First incarnations will only sieve a single k. Maybe later on a few more kernels for sieving multiple k's, as those are so easy to add. Sieving then a couple of k's definitely should be faster, yet if you really want to sieve thousands of different k's at the same time, we'll to see how interesting GPU is for that. Note i do target right now Fermi generation GPU. Bought myself a GTX580 for 30 euro 2nd hand. If it works great at that gpu obviously newer generations of gpus should be fast as well (though i don't know for sure about the 600 series). Though we have to see how fast Kepler is gonna be for larger ranges - It just has little cache per block as it has more streamcores (cuda cores) in each SM. 32 for Fermi versus 192 for Kepler. Both have 48KB L1 data cache available (which nvidia calls confusingly 'shared ram'). |
|
|
|
|
|
|
#5 | |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3·23·89 Posts |
Quote:
|
|
|
|
|
|
|
#6 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
Actually they have less memory a core nowadays. The fermi has in each multiprocessor with 32 streamcores (cuda cores nowadays called), 64KB L1 which gets split in instruction and datacache. If you manage to get your program small enough and it fits in 16KB L1 instruction cache, you can use the remaining 48KB. That has however to be divided by 2 warps of 32 cores. So you have actually 24KB for each Warp. I would call such warp a 'thread' as it executes the same code over all cores at atomically more or less the same time. Yet they have some artificial definition of that something that runs on 1 core is called a thread. Actually nothing runs there as it's a vector processor of course. It runs on the SM (the streaming multiprocessor), which you can see as a core. The 6xx series has 192 streamcores in each SM yet also 48KB L1 datacache. A faster cache which sure will boost things. So each warp of 32 streamcores has a lot less. To make things weird, it just can do, says manual, 4 hardware warps. 4 x 32 = 128 cores. As if 64 cores are unused on it. Yet let's assume 4 warps. 48 / 4 = 12KB. Realize that with 64 bits numbers this is 1536 quadwords you have. Or in my case 3072 integers. To make things worse it is not normal L1 datacache. The L1 internal is splitup in 32 memory banks. It's working in parallel. In itself that is genius, yet if 2 cores at the same time read from the same memory bank you get a memory bank conflict. Measurements here indicate you really don't want that. It's duck slow. Turtle slow even. So the current implementation there needs a rewrite there in order to generate less bank conflicts. When i test it, it is simply too slow. Fully avoiding them is impossible. The current algorithm uses a hashtable with a linked list and i'm using a 32 bits hash. That's simply not very efficient making use of the huge calculation strength the GPU has. A new implementation will change that to a huge bittable and then use 2 passes to reduce the number of babysteps and giant steps until we have nearly nothing left. Allows more bits per hashkey as well. Took me a while to figure that one out - though it is a trivial algorithm. We'll see whether it will speed it up. |
|
|
|
|
|
|
#7 | |
|
Banned
"Luigi"
Aug 2002
Team Italia
5·7·139 Posts |
Quote:
|
|
|
|
|
|
|
#8 |
|
Sep 2006
The Netherlands
80710 Posts |
Right now: Riesel.
Would need some help how to solve Proth in such case. To see if small prime p divides a riesel => k * 2^n - 1 To solve it: k ^ (p-2) == inverse (mod p) and then try to find the 2^n (mod p) that suits this then use BSGS algorithm to find the correct n (if it exists). How to solve it for Proth i don't know. Anyone? |
|
|
|
|
|
#9 |
|
Sep 2006
The Netherlands
3×269 Posts |
We get to: k * 2^n + 1 = 0 (mod p) <==> 2^n = 1 / -k (mod p) <==> 2^n == (-k)^(p-2) == - k^(p-2) <==> 2^n = (p-k) ^ (p-2)
Looks correct? |
|
|
|
|
|
#10 |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3×23×89 Posts |
I am fairly sure there should be a formula which covers all c. Have a look at the srsieve source and you should find it.
|
|
|
|
|
|
#11 |
|
Aug 2002
2·33 Posts |
Nice work and its sounds a bit complex dealing with the NVidia GPU.
Any chance this would work for fixed n, and large k range k * 2^n+-1? Or a small n range with a large k range of say k ranging 4 billion? My projects with Cunningham primes are still using Newpgen (485 MB bitmap limit) and I would love to have another option. |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Why have ECM testing for known non-prime Mersenne numbers? | sd235 | Information & Answers | 12 | 2018-12-06 17:56 |
| Testing my numbers | Vicodin | Information & Answers | 21 | 2017-05-02 05:11 |
| Small Request for Proth Numbers | c10ck3r | Programming | 6 | 2013-11-28 03:18 |
| Can gmp-ecm be optimised for Proth numbers? | geoff | GMP-ECM | 2 | 2008-12-17 13:28 |
| testing big numbers | sagan_fan | Math | 8 | 2002-10-09 21:20 |