mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet

Reply
 
Thread Tools
Old 2011-05-11, 11:46   #540
axn
 
axn's Avatar
 
Jun 2003

2×2,543 Posts
Default

Quote:
Originally Posted by Christenson View Post
Question: would P-1 on CUDA offer the same kind of speedup that is offered by TF on CUDA?
Answer: P-1 on CUDA will offer the same kind of speedup that is offered by LL on CUDA
axn is offline   Reply With Quote
Old 2011-05-11, 14:12   #541
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

36 Posts
Default

Quote:
Originally Posted by axn View Post
Answer: P-1 on CUDA will offer the same kind of speedup that is offered by LL on CUDA
You can design an adapted version algorithm which randomly looks for factors of limited size. Say 256 or 512 bits; that means you can run it at every core rather than needing all cores to calculate at million bitlevel. In similar way like TF then you can see whether it divides P.

That'll give a massive boost of course and might find a few percent.

Regards,
Vincent
diep is offline   Reply With Quote
Old 2011-05-11, 14:17   #542
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

36 Posts
Default

Quote:
Originally Posted by axn View Post
Answer: P-1 on CUDA will offer the same kind of speedup that is offered by LL on CUDA
A litterary implementation of P-1 will of course not work at the same speed like LL, because it is difficult to implement the GCD at an efficient manner at GPU.

You will have many compute units idle.
diep is offline   Reply With Quote
Old 2011-05-11, 15:26   #543
axn
 
axn's Avatar
 
Jun 2003

2×2,543 Posts
Default

Quote:
Originally Posted by diep View Post
You can design an adapted version algorithm which randomly looks for factors of limited size. Say 256 or 512 bits; that means you can run it at every core rather than needing all cores to calculate at million bitlevel. In similar way like TF then you can see whether it divides P.
That would not be a P-1 algorithm. P-1 algorithm doesn't really use any randomisation, nor does it find factors of specific sizes.

Quote:
Originally Posted by diep View Post
A litterary implementation of P-1 will of course not work at the same speed like LL, because it is difficult to implement the GCD at an efficient manner at GPU.

You will have many compute units idle.
Maybe so. However, gcd computation is only a _very_ small fraction of the total runtime. For a given FFT size, you can essentially treat gcd as a constant-time operation.
axn is offline   Reply With Quote
Old 2011-05-11, 15:41   #544
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

36 Posts
Default

Quote:
Originally Posted by axn View Post
That would not be a P-1 algorithm. P-1 algorithm doesn't really use any randomisation, nor does it find factors of specific sizes.
I don't have the original publication of Pollard. Wiki uses a random number to initialize. See: http://en.wikipedia.org/wiki/Pollard's_p_−_1_algorithm

"randomly pick a coprime to n (note: we can actually fix a, random selection here is not imperative)"

Crandall&Pomerance write it different in the book primes i see now. They use c=2 as a starting point and note you can use as well a random number there.

Quote:
Maybe so. However, gcd computation is only a _very_ small fraction of the total runtime. For a given FFT size, you can essentially treat gcd as a constant-time operation.
Well modern gpu's allow really high IPC's; it is difficult for me to see how to avoid losing a significant amount of that IPC. A single multiplication mod P is 3n + 2 n log n

Many of the log n steps are not dependant upon RAM speed. About factor 3 reduction you have. So RAM dependant is roughly: 4 + (log n-9) / 3. The O ( n ) operation initially for the GCD is dependant upon the RAM as the gpu doesn't have enough caches. After this you stumble upon the problem of taking non-easy modulo's. So not some sort of power of 2 modulo's. That's rather slow at a gpu.

The total price of the GCD could easily be more than the price of a FFT.

Obviously you can see this as lineair, but in fact you slow down over factor 2 compared to a LL.

Factor 2 is a significant slowdown.
diep is offline   Reply With Quote
Old 2011-05-11, 20:23   #545
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

61·79 Posts
Default

Quote:
Originally Posted by diep View Post
The total price of the GCD could easily be more than the price of a FFT.

Obviously you can see this as lineair, but in fact you slow down over factor 2 compared to a LL.

Factor 2 is a significant slowdown.
Can't you devote that last stage to the CPU?

Quote:
Originally Posted by diep View Post
There sure is room for an adapted version of it at a gpu. Yet if you do all that effort anyway, why not go for a new algorithm, it's not so hard to design a new one.

If you take a look to old ones anyway i'd argue why not take a look at pomerance? As basically it needs to sieve quick.
May I have some more pointers at it?

Luigi
ET_ is online now   Reply With Quote
Old 2011-05-11, 20:44   #546
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

36 Posts
Default

Quote:
Originally Posted by ET_ View Post
Can't you devote that last stage to the CPU?
Bandwidth internally at the GPU is say 2x170GB/s up to 190GB/s.
Depends which propaganda channel you believe more. The 140GB/s is what i measured at home, yet it is also serving as a videocard at the same time.

When the 'butterfly' works great or actually the unit calculation of a NTT works great at the gpu, then the limitation is bandwidth to the RAM for that FFT.

Say we get units of 96 bits within 3 x 32 bits stored, from the RAM. We need that 2 times for an unit calculation and we store it 2 times.

So we read and write in total 4 units * 3 integers * 4 bytes = 48 bytes
GPU can deliver this 140GB/s / 48 = 2.92 billion times per second.

We have 1.35T instructions per second that we can execute at the gpu.
So the RAM already is quite a problem then. So to speak we've got 500
cycles available or can execute 500 instructions per limb.

Now if we communicate to the CPU. At my mainboard the linkspeed from CPU to GPU is 2.0 GB/s and from GPU to CPU is 2.20 GB/s

Sure it's also serving as a videocard at the same time. Let me no longer use that excuse as it won't matter much in this case :)

So the difference in bandwidth between what the mainboard over the pci-e delivers versus the bandwidth of the GPU's RAM is too much of a difference, whereas the GPU's bandwidth already is not huge luxury for the transform.

Quote:

May I have some more pointers at it?

Luigi
http://en.wikipedia.org/wiki/Quadratic_sieve

Last fiddled with by diep on 2011-05-11 at 20:49
diep is offline   Reply With Quote
Old 2011-05-11, 21:04   #547
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

10010110100112 Posts
Default

Quote:
Originally Posted by diep View Post
Bandwidth internally at the GPU is say 2x170GB/s up to 190GB/s.
Depends which propaganda channel you believe more. The 140GB/s is what i measured at home, yet it is also serving as a videocard at the same time.

When the 'butterfly' works great or actually the unit calculation of a NTT works great at the gpu, then the limitation is bandwidth to the RAM for that FFT.

Say we get units of 96 bits within 3 x 32 bits stored, from the RAM. We need that 2 times for an unit calculation and we store it 2 times.

So we read and write in total 4 units * 3 integers * 4 bytes = 48 bytes
GPU can deliver this 140GB/s / 48 = 2.92 billion times per second.

We have 1.35T instructions per second that we can execute at the gpu.
So the RAM already is quite a problem then. So to speak we've got 500
cycles available or can execute 500 instructions per limb.

Now if we communicate to the CPU. At my mainboard the linkspeed from CPU to GPU is 2.0 GB/s and from GPU to CPU is 2.20 GB/s

Sure it's also serving as a videocard at the same time. Let me no longer use that excuse as it won't matter much in this case :)

So the difference in bandwidth between what the mainboard over the pci-e delivers versus the bandwidth of the GPU's RAM is too much of a difference, whereas the GPU's bandwidth already is not huge luxury for the transform.
I'll have to study better. I confused the GCD final stage of ECM
Obviously, a continuous flow of data from GPU to CPU and vice versa is to be excluded.

Thank you for the link!

Luigi
ET_ is online now   Reply With Quote
Old 2011-05-12, 01:34   #548
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5·359 Posts
Default

Quote:
Originally Posted by axn View Post
Answer: P-1 on CUDA will offer the same kind of speedup that is offered by LL on CUDA
Axn, roughly how much LL speedup do we get on CUDA? (10-100 is typical for TF; I assume it's not nearly as good for LL). Also, Vincent, the high end CUDA cards have almost as much memory on them as my CPUs; typically a few gig. Given that LL requires perhaps 100 Meg of RAM, can we get multiple LLs in parallel on one CUDA card for better GPU utilization?.

I was hoping for a better understanding of P-1; I take it it spends the majority of its time doing FFTs and inverses?
Christenson is offline   Reply With Quote
Old 2011-05-12, 01:47   #549
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

7×503 Posts
Default

Quote:
Originally Posted by diep View Post
What does this have to do with P-1? The two algorithms are designed to achieve very different goals. QS is useless for Mersenne candidates with exponents greater than 512 or so (and thats being fairly generous).
bsquared is offline   Reply With Quote
Old 2011-05-12, 04:29   #550
axn
 
axn's Avatar
 
Jun 2003

2×2,543 Posts
Default

Quote:
Originally Posted by Christenson View Post
Axn, roughly how much LL speedup do we get on CUDA? (10-100 is typical for TF; I assume it's not nearly as good for LL).
Definitely less. From the numbers I have seen here, on the order of 3x-10x (compared to a single core of high end CPU). Maybe higher for the really high end cards.

Quote:
Originally Posted by Christenson View Post
I was hoping for a better understanding of P-1; I take it it spends the majority of its time doing FFTs and inverses?
Essentially, it is a large modular exponentiation operation -- lots of multiplication. So, yes.
axn is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 10:25.


Fri Aug 6 10:25:45 UTC 2021 up 14 days, 4:54, 1 user, load averages: 3.74, 3.74, 3.78

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.