mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   What size numbers can CudaLucas handle? (https://www.mersenneforum.org/showthread.php?t=23737)

robertfrost 2018-10-26 12:26

What size numbers can CudaLucas handle?
 
I'm currently performing a Lucas Lehmer test on a 100 million digit prime using CudaLucas. Can it handle numbers that large?

kriesel 2018-10-26 13:18

[QUOTE=robertfrost;498795]I'm currently performing a Lucas Lehmer test on a 100 million digit prime using CudaLucas. Can it handle numbers that large?[/QUOTE]
Yes, and considerably larger. See the reference material at [URL]https://www.mersenneforum.org/forumdisplay.php?f=154.The[/URL] attachment in post two of [URL]https://www.mersenneforum.org/showthread.php?t=23371[/URL] lists the commonly used gpu software for mersenne hunting and gives nutshell descriptions of their limits. There are also bug and wish lists for several programs, in application-specific threads, including CUDALucas. This material is currently being actively maintained, with several updates made yesterday.

robertfrost 2018-10-26 14:06

oh dear:(
 
Thanks - I searched for ages without finding that (before I asked here). The exponent in question is 3.3*10^8 which looks to be above the limit. Does that mean I must abandon my test and find another way?


EDIT... SORRY, IT GOES UP TO 1*10^9 doesn't it? So I'm okay. Not sure if I'm being daft.

kriesel 2019-01-07 16:36

[QUOTE=robertfrost;498805]SORRY, IT GOES UP TO 1*10^9 doesn't it?[/QUOTE]
Actually, it turns out, upon further investigation, CUDALucas theoretically goes up to 2[SUP]31[/SUP]-1. It will fft benchmark and thread benchmark to 256M length, and its max exponent is capped at 2147483647. See the attachment at post 3 of

[url]https://www.mersenneforum.org/showthread.php?t=23371[/url] and the CUDALucas reference thread linked at that thread.

LaurV 2019-01-09 06:46

Yes, cudaLucas is limited to signed 32-bits word for exponent, but sooner you will reach the limit for FFT due to the memory of the card, unless you rewrite the cuFFT library by yourself.

kriesel 2019-01-09 11:05

[QUOTE=LaurV;505373]Yes, cudaLucas is limited to signed 32-bits word for exponent, but sooner you will reach the limit for FFT due to the memory of the card, unless you rewrite the cuFFT library by yourself.[/QUOTE]
A quick test on GTX1080Ti:
[CODE]Wed Jan 09 04:41:05 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 378.78 Driver Version: 378.78 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro 2000 WDDM | 0000:02:00.0 On | N/A |
|100% 78C P0 N/A / N/A | 88MiB / 1024MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 0000:03:00.0 Off | N/A |
| 66% 82C P2 220W / 250W | 1619MiB / 11264MiB | 100% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1868 C ... Documents\mfaktc q2000\mfaktc-win-64.exe N/A |
| 1 4644 C ...CUDALucas2.06beta-CUDA8.0-Windows-x64.exe N/A |
+-----------------------------------------------------------------------------+[/CODE][CODE]Continuing M999999937 @ iteration 4302 with fft length 57344K, 0.00% done

| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Jan 09 04:45:26 | M999999937 5000 0xb723ad2cf90fefd5 | 57344K 0.18750 40.3755 28.18s | 473:09:25:34 0.00% |
| Jan 09 04:46:07 | M999999937 6000 0x00c230e56a4bc3ca | 57344K 0.20313 40.6178 40.61s | 472:20:17:29 0.00% |
| Jan 09 04:46:48 | M999999937 7000 0x7d01674dde8ecc02 | 57344K 0.18945 40.9224 40.92s | 472:22:59:37 0.00% |
[/CODE]Run time, reliability, and hardware life are probably an issue before gpu memory. Run time per exponent/primality test applies equally to PRP as to LL.

Extrapolating linearly (which is optimistic; above 2G, code gets a bit bigger) and note, while I was composing this, as the gpu warmed up, the projected run time increased about 0.5% beyond what's tabulated here:

[CODE]p VRAM GB runtime (years per exponent)
M1G 1.62 1.3
M2G 3.24 2.6
M3G 4.86 3.9
M3.7G 5.99 4.8
M4G 6.48 5.2
M5G 8.10 6.5
M6G 9.72 7.8
M6.8G 11.02 8.8
M7G 11.34 9.1[/CODE]An 8gb or even 6gb card seems adequate for gigadigit exponents if fast enough. (Yes that would also take some coding extensions.)

Any idea why signed int was used instead of unsigned for exponent, or how hard it would be to change (hidden complications)?

kriesel 2019-01-10 03:53

Oops
 
Please disregard the run times in the preceding post. The only one that's credible is the 1.3 years for M1G. The run times should be scaling at approximately p[SUP]2.1[/SUP], not linearly. The extrapolation table has been adjusted and extended to include estimates for some typical gpu memory capacities, and posted at [URL]https://www.mersenneforum.org/showpost.php?p=505493&postcount=7[/URL]

LaurV 2019-01-10 09:16

For the records, cuFFT uses more memory than gwlib/P95 does, and not always transparent for the user. I was never able to run 100M digit LL test (332M+ expo) with my GTX580's with 1.5GB memory (I still own 4 of them, only 2 in production, the other 2 shelved, no available PCIE slots). It will not say that it can't run, but you get a lot of strange errors and mismatches somewhere after a million iteration (for example) and you are never able to finish a test.

For the 3GB version of the same card, you can go to about 550M (can't remember exactly the numbers, I had 2 such cards and sold them years ago).

However, my 6GB Titans are currently testing [M]M666666667[/M] (ETA in ~4 months) and there is no problem with it.

Your CPU does the calculus sequential, and therefore one iteration of LL does not need much memory. In the GPU, all the butterfly is done in the same time in parallel, so cuFFT operates with all the data, somehow (well, this is not really true, but that is the idea) so it needs more memory that the few MB you give to p95 for LL tests.

More I can't say, but you don't know if it works until you really do a complete test at that size - backed up by a parallel run in a second card, of course, otherwise you lose the time - I get mismatching errors and i need to resume weekly (2-3 times per month) at the clocks I push the Titans.

kriesel 2019-01-10 13:14

[QUOTE=LaurV;505513]For the records, cuFFT uses more memory than gwlib/P95 does, and not always transparent for the user.[/QUOTE]Have you tried using nvidia-smi to show gpu memory usage? Gpu-Z is useful for some things but it seems to show memory usage approximately mod 4GB by comparison to nvidia-smi.


All times are UTC. The time now is 14:57.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.