![]() |
[QUOTE=Christenson;261100]Question: would P-1 on CUDA offer the same kind of speedup that is offered by TF on CUDA?[/QUOTE]
Answer: P-1 on CUDA will offer the same kind of speedup that is offered by [B]LL[/B] on CUDA |
[QUOTE=axn;261108]Answer: P-1 on CUDA will offer the same kind of speedup that is offered by [B]LL[/B] on CUDA[/QUOTE]
You can design an adapted version algorithm which randomly looks for factors of limited size. Say 256 or 512 bits; that means you can run it at every core rather than needing all cores to calculate at million bitlevel. In similar way like TF then you can see whether it divides P. That'll give a massive boost of course and might find a few percent. Regards, Vincent |
[QUOTE=axn;261108]Answer: P-1 on CUDA will offer the same kind of speedup that is offered by [B]LL[/B] on CUDA[/QUOTE]
A litterary implementation of P-1 will of course not work at the same speed like LL, because it is difficult to implement the GCD at an efficient manner at GPU. You will have many compute units idle. |
[QUOTE=diep;261120]You can design an adapted version algorithm which randomly looks for factors of limited size. Say 256 or 512 bits; that means you can run it at every core rather than needing all cores to calculate at million bitlevel. In similar way like TF then you can see whether it divides P.[/QUOTE]
That would not be a P-1 algorithm. P-1 algorithm doesn't really use any randomisation, nor does it find factors of specific sizes. [QUOTE=diep;261122]A litterary implementation of P-1 will of course not work at the same speed like LL, because it is difficult to implement the GCD at an efficient manner at GPU. You will have many compute units idle.[/QUOTE] Maybe so. However, gcd computation is only a _very_ small fraction of the total runtime. For a given FFT size, you can essentially treat gcd as a constant-time operation. |
[QUOTE=axn;261131]That would not be a P-1 algorithm. P-1 algorithm doesn't really use any randomisation, nor does it find factors of specific sizes.
[/quote] I don't have the original publication of Pollard. Wiki uses a random number to initialize. See: [url]http://en.wikipedia.org/wiki/Pollard's_p_−_1_algorithm[/url] "randomly pick a coprime to n (note: we can actually fix a, random selection here is not imperative)" Crandall&Pomerance write it different in the book primes i see now. They use c=2 as a starting point and note you can use as well a random number there. [quote] Maybe so. However, gcd computation is only a _very_ small fraction of the total runtime. For a given FFT size, you can essentially treat gcd as a constant-time operation.[/QUOTE] Well modern gpu's allow really high IPC's; it is difficult for me to see how to avoid losing a significant amount of that IPC. A single multiplication mod P is 3n + 2 n log n Many of the log n steps are not dependant upon RAM speed. About factor 3 reduction you have. So RAM dependant is roughly: 4 + (log n-9) / 3. The O ( n ) operation initially for the GCD is dependant upon the RAM as the gpu doesn't have enough caches. After this you stumble upon the problem of taking non-easy modulo's. So not some sort of power of 2 modulo's. That's rather slow at a gpu. The total price of the GCD could easily be more than the price of a FFT. Obviously you can see this as lineair, but in fact you slow down over factor 2 compared to a LL. Factor 2 is a significant slowdown. |
[QUOTE=diep;261136]
The total price of the GCD could easily be more than the price of a FFT. Obviously you can see this as lineair, but in fact you slow down over factor 2 compared to a LL. Factor 2 is a significant slowdown.[/QUOTE] Can't you devote that last stage to the CPU? [QUOTE=diep;261136]There sure is room for an adapted version of it at a gpu. Yet if you do all that effort anyway, why not go for a new algorithm, it's not so hard to design a new one. If you take a look to old ones anyway i'd argue why not take a look at pomerance? As basically it needs to sieve quick.[/QUOTE] May I have some more pointers at it? :smile: Luigi |
[QUOTE=ET_;261180]Can't you devote that last stage to the CPU?
[/quote] Bandwidth internally at the GPU is say 2x170GB/s up to 190GB/s. Depends which propaganda channel you believe more. The 140GB/s is what i measured at home, yet it is also serving as a videocard at the same time. When the 'butterfly' works great or actually the unit calculation of a NTT works great at the gpu, then the limitation is bandwidth to the RAM for that FFT. Say we get units of 96 bits within 3 x 32 bits stored, from the RAM. We need that 2 times for an unit calculation and we store it 2 times. So we read and write in total 4 units * 3 integers * 4 bytes = 48 bytes GPU can deliver this 140GB/s / 48 = 2.92 billion times per second. We have 1.35T instructions per second that we can execute at the gpu. So the RAM already is quite a problem then. So to speak we've got 500 cycles available or can execute 500 instructions per limb. Now if we communicate to the CPU. At my mainboard the linkspeed from CPU to GPU is 2.0 GB/s and from GPU to CPU is 2.20 GB/s Sure it's also serving as a videocard at the same time. Let me no longer use that excuse as it won't matter much in this case :) So the difference in bandwidth between what the mainboard over the pci-e delivers versus the bandwidth of the GPU's RAM is too much of a difference, whereas the GPU's bandwidth already is not huge luxury for the transform. [quote] May I have some more pointers at it? :smile: Luigi[/QUOTE] [url]http://en.wikipedia.org/wiki/Quadratic_sieve[/url] |
[QUOTE=diep;261183]Bandwidth internally at the GPU is say 2x170GB/s up to 190GB/s.
Depends which propaganda channel you believe more. The 140GB/s is what i measured at home, yet it is also serving as a videocard at the same time. When the 'butterfly' works great or actually the unit calculation of a NTT works great at the gpu, then the limitation is bandwidth to the RAM for that FFT. Say we get units of 96 bits within 3 x 32 bits stored, from the RAM. We need that 2 times for an unit calculation and we store it 2 times. So we read and write in total 4 units * 3 integers * 4 bytes = 48 bytes GPU can deliver this 140GB/s / 48 = 2.92 billion times per second. We have 1.35T instructions per second that we can execute at the gpu. So the RAM already is quite a problem then. So to speak we've got 500 cycles available or can execute 500 instructions per limb. Now if we communicate to the CPU. At my mainboard the linkspeed from CPU to GPU is 2.0 GB/s and from GPU to CPU is 2.20 GB/s Sure it's also serving as a videocard at the same time. Let me no longer use that excuse as it won't matter much in this case :) So the difference in bandwidth between what the mainboard over the pci-e delivers versus the bandwidth of the GPU's RAM is too much of a difference, whereas the GPU's bandwidth already is not huge luxury for the transform. [/QUOTE] I'll have to study better. I confused the GCD final stage of ECM :redface: Obviously, a continuous flow of data from GPU to CPU and vice versa is to be excluded. Thank you for the link! Luigi |
[QUOTE=axn;261108]Answer: P-1 on CUDA will offer the same kind of speedup that is offered by [B]LL[/B] on CUDA[/QUOTE]
Axn, roughly how much LL speedup do we get on CUDA? (10-100 is typical for TF; I assume it's not nearly as good for LL). Also, Vincent, the high end CUDA cards have almost as much memory on them as my CPUs; typically a few gig. Given that LL requires perhaps 100 Meg of RAM, can we get multiple LLs in parallel on one CUDA card for better GPU utilization?. I was hoping for a better understanding of P-1; I take it it spends the majority of its time doing FFTs and inverses? |
[QUOTE=diep;261183]
[url]http://en.wikipedia.org/wiki/Quadratic_sieve[/url][/QUOTE] What does this have to do with P-1? The two algorithms are designed to achieve very different goals. QS is useless for Mersenne candidates with exponents greater than 512 or so (and thats being fairly generous). |
[QUOTE=Christenson;261195]Axn, roughly how much LL speedup do we get on CUDA? (10-100 is typical for TF; I assume it's not nearly as good for LL).[/quote]
Definitely less. From the numbers I have seen here, on the order of 3x-10x (compared to a single core of high end CPU). Maybe higher for the really high end cards. [QUOTE=Christenson;261195]I was hoping for a better understanding of P-1; I take it it spends the majority of its time doing FFTs and inverses?[/QUOTE] Essentially, it is a large modular exponentiation operation -- lots of multiplication. So, yes. |
| All times are UTC. The time now is 23:05. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.