mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2016-07-30, 18:03   #1
Angular
 
Aug 2002

2×33 Posts
Default Nvidia GPU for PRP testing proth numbers?

It looks like the llrCUDA thread has not had much activity in years.

Did it or another program workout to be viable for proth style numbers? k*2^n+-1?
Angular is offline   Reply With Quote
Old 2016-07-30, 22:08   #2
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

9,127 Posts
Default

It is viable but relatively slow. (Relative to other possible uses for a GPU.)

Think about it this way - would you use llrCUDA on a GPU for a month to find one prime or use another GPU app (searching for prime of other forms) and find five of the same size?
Batalov is offline   Reply With Quote
Old 2016-07-31, 04:23   #3
Angular
 
Aug 2002

2·33 Posts
Default

Good points although a GPU sieve for specialized prime chains is unlikely as well :(
Angular is offline   Reply With Quote
Old 2016-08-01, 00:08   #4
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

10101001012 Posts
Default

Quote:
Originally Posted by Angular View Post
Good points although a GPU sieve for specialized prime chains is unlikely as well :(
Busy on a riesel sieve in CUDA as we speak :)

status: busy improving its speed.
intention: In k * 2 ^ n - 1 first goal is sieve a single k <= 32 bits over a domain of n bit larger than PG usually is doing.

A range of 10M for the n's shouldn't be big trouble is expectation.

First incarnations will only sieve a single k. Maybe later on a few more kernels for
sieving multiple k's, as those are so easy to add. Sieving then a couple of k's definitely should be faster, yet if you really want to sieve thousands of different k's at the same time, we'll to see how interesting GPU is for that.

Note i do target right now Fermi generation GPU. Bought myself a GTX580 for 30 euro 2nd hand. If it works great at that gpu obviously newer generations of gpus should be fast as well (though i don't know for sure about the 600 series). Though we have to see how fast Kepler is gonna be for larger ranges - It just has little cache per block as it has more streamcores (cuda cores) in each SM. 32 for Fermi versus 192 for Kepler. Both have 48KB L1 data cache available (which nvidia calls confusingly 'shared ram').
diep is offline   Reply With Quote
Old 2016-08-01, 09:15   #5
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2·2,861 Posts
Default

Quote:
Originally Posted by diep View Post
Busy on a riesel sieve in CUDA as we speak :)

status: busy improving its speed.
intention: In k * 2 ^ n - 1 first goal is sieve a single k <= 32 bits over a domain of n bit larger than PG usually is doing.

A range of 10M for the n's shouldn't be big trouble is expectation.

First incarnations will only sieve a single k. Maybe later on a few more kernels for
sieving multiple k's, as those are so easy to add. Sieving then a couple of k's definitely should be faster, yet if you really want to sieve thousands of different k's at the same time, we'll to see how interesting GPU is for that.

Note i do target right now Fermi generation GPU. Bought myself a GTX580 for 30 euro 2nd hand. If it works great at that gpu obviously newer generations of gpus should be fast as well (though i don't know for sure about the 600 series). Though we have to see how fast Kepler is gonna be for larger ranges - It just has little cache per block as it has more streamcores (cuda cores) in each SM. 32 for Fermi versus 192 for Kepler. Both have 48KB L1 data cache available (which nvidia calls confusingly 'shared ram').
BSGS? rouge commented a while back that memory would be an issue. Of course cards these days have a bit more memory but not more than enough to sieve more than a very limited number of ks from what I recall.
henryzz is online now   Reply With Quote
Old 2016-08-01, 10:41   #6
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

67710 Posts
Default

Quote:
Originally Posted by henryzz View Post
BSGS? rouge commented a while back that memory would be an issue. Of course cards these days have a bit more memory but not more than enough to sieve more than a very limited number of ks from what I recall.
Yes, BSGS gets used.

Actually they have less memory a core nowadays. The fermi has in each multiprocessor with 32 streamcores (cuda cores nowadays called), 64KB L1 which gets split in instruction and datacache. If you manage to get your program small enough and it fits in 16KB L1 instruction cache, you can use the remaining 48KB.

That has however to be divided by 2 warps of 32 cores. So you have actually 24KB for each Warp. I would call such warp a 'thread' as it executes the same code over all cores at atomically more or less the same time.

Yet they have some artificial definition of that something that runs on 1 core is called a thread. Actually nothing runs there as it's a vector processor of course. It runs on the SM (the streaming multiprocessor), which you can see as a core.

The 6xx series has 192 streamcores in each SM yet also 48KB L1 datacache. A faster cache which sure will boost things. So each warp of 32 streamcores has a lot less. To make things weird, it just can do, says manual, 4 hardware warps.

4 x 32 = 128 cores. As if 64 cores are unused on it.

Yet let's assume 4 warps. 48 / 4 = 12KB. Realize that with 64 bits numbers this is 1536 quadwords you have. Or in my case 3072 integers.

To make things worse it is not normal L1 datacache. The L1 internal is splitup in 32 memory banks. It's working in parallel. In itself that is genius, yet if 2 cores at the same time read from the same memory bank you get a memory bank conflict. Measurements here indicate you really don't want that. It's duck slow. Turtle slow even.

So the current implementation there needs a rewrite there in order to generate less bank conflicts. When i test it, it is simply too slow.

Fully avoiding them is impossible.

The current algorithm uses a hashtable with a linked list and i'm using a 32 bits hash. That's simply not very efficient making use of the huge calculation strength the GPU has.

A new implementation will change that to a huge bittable and then use 2 passes to reduce the number of babysteps and giant steps until we have nearly nothing left. Allows more bits per hashkey as well.

Took me a while to figure that one out - though it is a trivial algorithm.

We'll see whether it will speed it up.
diep is offline   Reply With Quote
Old 2016-08-01, 11:53   #7
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

2×2,383 Posts
Default

Quote:
Originally Posted by diep View Post
Yes, BSGS gets used.

Actually they have less memory a core nowadays. The fermi has in each multiprocessor with 32 streamcores (cuda cores nowadays called), 64KB L1 which gets split in instruction and datacache. If you manage to get your program small enough and it fits in 16KB L1 instruction cache, you can use the remaining 48KB.

That has however to be divided by 2 warps of 32 cores. So you have actually 24KB for each Warp. I would call such warp a 'thread' as it executes the same code over all cores at atomically more or less the same time.

Yet they have some artificial definition of that something that runs on 1 core is called a thread. Actually nothing runs there as it's a vector processor of course. It runs on the SM (the streaming multiprocessor), which you can see as a core.

The 6xx series has 192 streamcores in each SM yet also 48KB L1 datacache. A faster cache which sure will boost things. So each warp of 32 streamcores has a lot less. To make things weird, it just can do, says manual, 4 hardware warps.

4 x 32 = 128 cores. As if 64 cores are unused on it.

Yet let's assume 4 warps. 48 / 4 = 12KB. Realize that with 64 bits numbers this is 1536 quadwords you have. Or in my case 3072 integers.

To make things worse it is not normal L1 datacache. The L1 internal is splitup in 32 memory banks. It's working in parallel. In itself that is genius, yet if 2 cores at the same time read from the same memory bank you get a memory bank conflict. Measurements here indicate you really don't want that. It's duck slow. Turtle slow even.

So the current implementation there needs a rewrite there in order to generate less bank conflicts. When i test it, it is simply too slow.

Fully avoiding them is impossible.

The current algorithm uses a hashtable with a linked list and i'm using a 32 bits hash. That's simply not very efficient making use of the huge calculation strength the GPU has.

A new implementation will change that to a huge bittable and then use 2 passes to reduce the number of babysteps and giant steps until we have nearly nothing left. Allows more bits per hashkey as well.

Took me a while to figure that one out - though it is a trivial algorithm.

We'll see whether it will speed it up.
Is your software working on the minus side only, or also on the plus side (Sierpinski / Riesel) ?
ET_ is online now   Reply With Quote
Old 2016-08-01, 12:19   #8
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

677 Posts
Default

Right now: Riesel.

Would need some help how to solve Proth in such case.

To see if small prime p divides a riesel => k * 2^n - 1

To solve it: k ^ (p-2) == inverse (mod p)
and then try to find the 2^n (mod p) that suits this
then use BSGS algorithm to find the correct n (if it exists).

How to solve it for Proth i don't know. Anyone?
diep is offline   Reply With Quote
Old 2016-08-01, 12:39   #9
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

677 Posts
Default

We get to: k * 2^n + 1 = 0 (mod p) <==> 2^n = 1 / -k (mod p) <==> 2^n == (-k)^(p-2) == - k^(p-2) <==> 2^n = (p-k) ^ (p-2)

Looks correct?
diep is offline   Reply With Quote
Old 2016-08-01, 12:58   #10
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2·2,861 Posts
Default

I am fairly sure there should be a formula which covers all c. Have a look at the srsieve source and you should find it.
henryzz is online now   Reply With Quote
Old 2016-08-02, 01:57   #11
Angular
 
Aug 2002

1101102 Posts
Default

Nice work and its sounds a bit complex dealing with the NVidia GPU.

Any chance this would work for fixed n, and large k range k * 2^n+-1?
Or a small n range with a large k range of say k ranging 4 billion?

My projects with Cunningham primes are still using Newpgen (485 MB bitmap limit) and I would love to have another option.
Angular is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Why have ECM testing for known non-prime Mersenne numbers? sd235 Information & Answers 12 2018-12-06 17:56
Testing my numbers Vicodin Information & Answers 21 2017-05-02 05:11
Small Request for Proth Numbers c10ck3r Programming 6 2013-11-28 03:18
Can gmp-ecm be optimised for Proth numbers? geoff GMP-ECM 2 2008-12-17 13:28
testing big numbers sagan_fan Math 8 2002-10-09 21:20

All times are UTC. The time now is 14:01.

Thu Oct 1 14:01:58 UTC 2020 up 21 days, 11:12, 2 users, load averages: 1.51, 1.59, 1.75

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.