![]() |
|
|
#540 |
|
Jun 2003
2×2,543 Posts |
|
|
|
|
|
|
#541 | |
|
Sep 2006
The Netherlands
36 Posts |
Quote:
That'll give a massive boost of course and might find a few percent. Regards, Vincent |
|
|
|
|
|
|
#542 | |
|
Sep 2006
The Netherlands
36 Posts |
Quote:
You will have many compute units idle. |
|
|
|
|
|
|
#543 | |
|
Jun 2003
2×2,543 Posts |
Quote:
Maybe so. However, gcd computation is only a _very_ small fraction of the total runtime. For a given FFT size, you can essentially treat gcd as a constant-time operation. |
|
|
|
|
|
|
#544 | ||
|
Sep 2006
The Netherlands
36 Posts |
Quote:
"randomly pick a coprime to n (note: we can actually fix a, random selection here is not imperative)" Crandall&Pomerance write it different in the book primes i see now. They use c=2 as a starting point and note you can use as well a random number there. Quote:
Many of the log n steps are not dependant upon RAM speed. About factor 3 reduction you have. So RAM dependant is roughly: 4 + (log n-9) / 3. The O ( n ) operation initially for the GCD is dependant upon the RAM as the gpu doesn't have enough caches. After this you stumble upon the problem of taking non-easy modulo's. So not some sort of power of 2 modulo's. That's rather slow at a gpu. The total price of the GCD could easily be more than the price of a FFT. Obviously you can see this as lineair, but in fact you slow down over factor 2 compared to a LL. Factor 2 is a significant slowdown. |
||
|
|
|
|
|
#545 | ||
|
Banned
"Luigi"
Aug 2002
Team Italia
61·79 Posts |
Quote:
Quote:
![]() Luigi |
||
|
|
|
|
|
#546 | |
|
Sep 2006
The Netherlands
36 Posts |
Bandwidth internally at the GPU is say 2x170GB/s up to 190GB/s.
Depends which propaganda channel you believe more. The 140GB/s is what i measured at home, yet it is also serving as a videocard at the same time. When the 'butterfly' works great or actually the unit calculation of a NTT works great at the gpu, then the limitation is bandwidth to the RAM for that FFT. Say we get units of 96 bits within 3 x 32 bits stored, from the RAM. We need that 2 times for an unit calculation and we store it 2 times. So we read and write in total 4 units * 3 integers * 4 bytes = 48 bytes GPU can deliver this 140GB/s / 48 = 2.92 billion times per second. We have 1.35T instructions per second that we can execute at the gpu. So the RAM already is quite a problem then. So to speak we've got 500 cycles available or can execute 500 instructions per limb. Now if we communicate to the CPU. At my mainboard the linkspeed from CPU to GPU is 2.0 GB/s and from GPU to CPU is 2.20 GB/s Sure it's also serving as a videocard at the same time. Let me no longer use that excuse as it won't matter much in this case :) So the difference in bandwidth between what the mainboard over the pci-e delivers versus the bandwidth of the GPU's RAM is too much of a difference, whereas the GPU's bandwidth already is not huge luxury for the transform. Quote:
Last fiddled with by diep on 2011-05-11 at 20:49 |
|
|
|
|
|
|
#547 | |
|
Banned
"Luigi"
Aug 2002
Team Italia
10010110100112 Posts |
Quote:
![]() Obviously, a continuous flow of data from GPU to CPU and vice versa is to be excluded. Thank you for the link! Luigi |
|
|
|
|
|
|
#548 | |
|
Dec 2010
Monticello
5·359 Posts |
Quote:
I was hoping for a better understanding of P-1; I take it it spends the majority of its time doing FFTs and inverses? |
|
|
|
|
|
|
#549 | |
|
"Ben"
Feb 2007
7×503 Posts |
Quote:
|
|
|
|
|
|
|
#550 | |
|
Jun 2003
2×2,543 Posts |
Quote:
Essentially, it is a large modular exponentiation operation -- lots of multiplication. So, yes. |
|
|
|
|