![]() |
[QUOTE=heliosh;488402]Yes, I have the opencl drivers installed and gpuowl appears to detect the GTX 1050.
[CODE]ii nvidia-opencl-common 390.48-3 amd64 NVIDIA OpenCL driver - common files ii nvidia-opencl-dev:amd64 9.1.85-4+b1 amd64 NVIDIA OpenCL development files [/CODE]But as far as I've read in this thread, nvidia GPUs aren't supported.[/QUOTE] Yes the nvidia are not supported yet, though there are other mersenne programs for nvidia, like CUDALucas |
Improved recovery from Windows TDRs
Windows TDRs occasionally have been observed to derail a gpuOwL run on AMD gpus (gpuOwL 1.9 and RX550). A recovery method tested the last few days on CUDAPm1 and GTX480 may also apply here. See the detailed writeup at [URL]http://www.mersenneforum.org/showpost.php?p=488288&postcount=37[/URL]
|
GpuOwl future
I'd like to discuss a bit the medium-term evolution of GpuOwl; here are some of my opinions.
GpuOwl used powers-of-two FFT sizes, and it so happened that those worked well (fast) in the under-80M exponent region. But that area is exhausted now. Prime95 (on CPU) is very fast on any FFT size (i.e., including non-power-of-two). OTOH GpuOwl (on GPU) is not fast enough IMO on the NPOT (non-power-2) FFTs that I experimented with. As such, in the region immediately above 80M, I see Prime95 as the better tool for PRP. If GpuOwl remains on powers-of-two only, then it would best be used at the exponent ranges that are maximal for some power-of-two. In particular, those are: - under about 153M for 8M FFT, - under about 300 - 305M for 16M FFT. Unfortunately the 100M-digits exponents, which start around 320M, are just out of the reach of the 16M FFT, while at the same time the 32M FFT is overkill for 320M. So while the 150M, or 300M, regions would bring in good credits, those exponents also have drawbacks: not as likely to produce the next mersenne, and not 100M-digits. |
Some thoughts and questions on what's next for gpuOwL
1 Attachment(s)
[QUOTE=preda;488705]I'd like to discuss a bit the medium-term evolution of GpuOwl; here are some of my opinions.
GpuOwl used powers-of-two FFT sizes, and it so happened that those worked well (fast) in the under-80M exponent region. But that area is exhausted now. Prime95 (on CPU) is very fast on any FFT size (i.e., including non-power-of-two). OTOH GpuOwl (on GPU) is not fast enough IMO on the NPOT (non-power-2) FFTs that I experimented with. As such, in the region immediately above 80M, I see Prime95 as the better tool for PRP. If GpuOwl remains on powers-of-two only, then it would best be used at the exponent ranges that are maximal for some power-of-two. In particular, those are: - under about 153M for 8M FFT, - under about 300 - 305M for 16M FFT. Unfortunately the 100M-digits exponents, which start around 320M, are just out of the reach of the 16M FFT, while at the same time the 32M FFT is overkill for 320M. So while the 150M, or 300M, regions would bring in good credits, those exponents also have drawbacks: not as likely to produce the next mersenne, and not 100M-digits.[/QUOTE] There's no PRP Mersenne code that will run on NVIDIA gpus currently. Either adapting to NVIDIA's OpenCl or to CUDA would open new possibilities there. (GpuOwlN?) If it brings a speed advantage to NVIDIA, we all win. I was unable to test in gpuOwL V1.9 the -M61's exponent limits at 8M, because of the exponent cap in the program based on the -DP capabilities. But at 4M, it handled ~7.5 percent higher exponent. That's likely to be similar at 8M, or 16M, so could provide a little more reach. Implementing a 10000K seems likely to overshadow the 8M -M61 in both reach and speed, as implementing the 5000K did the 4M -M61 reach and speed. Would that also be the case for a 20,000K versus 16M -M61? It's my understanding that Prime95 is currently limited to ~595M (and can be pushed through extreme measures to 604M). It might be decades before we _NEED_ 32M fft, but scouting the terrain well before the army and settlers get there is worthwhile. The clLucas timings indicated there might be something to be gained by implementing ~6M and 12M lengths. What lengths did you experiment with? 1/log10(2)=3.321928... so 100 Mdigit decimal Mersenne numbers are 2^p-1 where p>~332,192,806. (log10(Mn)>99,999,999) A 16M -M61 is likely to fall a percent or two short of 100Mdigits. Is there a way you could stretch it a bit further? Or does it require a 20,000K DP fft or something near it? Can you go higher than -M61 effectively? (I'm guessing that would involve Toom-Cook on the bottom, to simulate triple or quad precision integer, which could be too slow.) Is total throughput greater or less if running -M61 and -DP simultaneously? There's certainly plenty of work within reach of 8M -DP fft. Documentation, bug fixes, & other maintenance are always welcome. |
[QUOTE=kriesel;488767]There's no PRP Mersenne code that will run on NVIDIA gpus currently.[/QUOTE]
OK, I have an Nvidia GPU :). It's a 660M, but should be enough for testing. First I'll be looking into tapping into cuFFT. [QUOTE]I was unable to test in gpuOwL V1.9 the -M61's exponent limits at 8M, because of the exponent cap in the program based on the -DP capabilities.[/QUOTE]Sorry for that. That limit is simple to change in the source if you could re-compile. About M61, last time I tried it, it seemed so slow as to be not interesting, so I didn't continue to investigate that. [QUOTE] 1/log10(2)=3.321928... so 100 Mdigit decimal Mersenne numbers are 2^p-1 where p>~332,192,806. [/QUOTE]Yep, I stand corrected. The point was, this exponent size is too large for 16M FFT, and too small for 32M FFT. So not good for POT FFTs. --- Given the current GPU-vs-CPU prices, I think Prime95 on CPU might be a better choice than GPU for PRP at the wavefront. I try to find an edge for GpuOwl, a niche where it offers some advantage over the CPU. Achieving same-or-worse than the CPU is not interesting. |
[QUOTE=preda;488832]OK, I have an Nvidia GPU :). It's a 660M, but should be enough for testing. First I'll be looking into tapping into cuFFT.
Sorry for that. That limit is simple to change in the source if you could re-compile. About M61, last time I tried it, it seemed so slow as to be not interesting, so I didn't continue to investigate that. Yep, I stand corrected. The point was, this exponent size is too large for 16M FFT, and too small for 32M FFT. So not good for POT FFTs. --- Given the current GPU-vs-CPU prices, I think Prime95 on CPU might be a better choice than GPU for PRP at the wavefront. I try to find an edge for GpuOwl, a niche where it offers some advantage over the CPU. Achieving same-or-worse than the CPU is not interesting.[/QUOTE] A quad-core cpu at 3.60 GHz spits out a result in 14-15 days, while the GPU spits out a result in 4-5 days. I hope you still maintain gpuOwL for a future. |
[QUOTE=preda;488832]OK, I have an Nvidia GPU :). It's a 660M, but should be enough for testing. First I'll be looking into tapping into cuFFT.
Sorry for that. That limit is simple to change in the source if you could re-compile. About M61, last time I tried it, it seemed so slow as to be not interesting, so I didn't continue to investigate that. Yep, I stand corrected. The point was, this exponent size is too large for 16M FFT, and too small for 32M FFT. So not good for POT FFTs. Given the current GPU-vs-CPU prices, I think Prime95 on CPU might be a better choice than GPU for PRP at the wavefront. I try to find an edge for GpuOwl, a niche where it offers some advantage over the CPU. Achieving same-or-worse than the CPU is not interesting.[/QUOTE] Re starting on NVIDIA, that's great news. When you have something to test, let us know. I could test it on several other models. Keep making good code, that is the fastest or only available PRP for the gpus in question. How it's applied is the users' responsibility. (Although, you could code in a simple bit of coaching that some exponents launched might be more effectively run on the cpu, then run it anyway.) Please go back and tweak v1.9, for the M61 limits and any other maintenance items. It's more effective if the changes are applied to your github repository. I'd like to see a merge build, where v1.9's DP and M61 transforms and (2, optional,) 4M and 8M lengths, plus the 5000K DP, and whatever's next, are rolled into one package. Then the users have one executable to deal with, whether they're doing first time checks, double checks, or whatever, versus time. Such convenience adds up when it's dozens of gpus. |
Reference material
I was offered "a blog area to consolidate all of your pdfs and guides and stuff" and accepted.
Feel free to have a look and suggest content. (G-rated only;) General interest gpu related reference material [URL]http://www.mersenneforum.org/showthread.php?t=23371[/URL] gpuOwL primality testing on OpenCl gpus [URL]http://www.mersenneforum.org/showthread.php?t=23391[/URL] Future updates to material previously posted in this thread will probably occur on the blog threads and not here. Having in-place update without a time limit makes it more manageable there. |
Congratz!
GP2's excellent tutorials on Cloud Computing and other topics should also be collected this way. |
[QUOTE=kladner;489093]Congratz!
GP2's excellent tutorials on Cloud Computing and other topics should also be collected this way.[/QUOTE] Thanks. Haven't seen those GP2 tutorials. (Don't do cloud Mersenne computing myself.) Glad the info is out there for those who do, or are curious or considering it. |
[QUOTE=kriesel;489094]Thanks.
Haven't seen those GP2 tutorials. (Don't do cloud Mersenne computing myself.) Glad the info is out there for those who do, or are curious or considering it.[/QUOTE] Oops. Those [U]are[/U] collected in a sub-thread of Hardware: [URL]http://www.mersenneforum.org/forumdisplay.php?f=134[/URL] |
| All times are UTC. The time now is 22:54. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.