![]() |
[QUOTE=aaronhaviland;500255]Success compiling with MPIR.
64-bit binary attached Requires CUDA 10, and a GPU with Compute Capability >= 3.5. Unsure of other requirements, I'm not too familiar with Windows dependencies. [CODE]Microsoft Windows [Version 10.0.17134.407] ...[/CODE][/QUOTE] Congrats. Thanks for sharing the exe. Which commit is this, 1165353? Any chance of a CUDA 8 build? Or share the process of setting up a Windows build environment perhaps? I may give your exe a spin on Win7, but would rather not have to upgrade all gpu systems' drivers to CUDA10 capability until I know it works on Win7 and the necessary driver version doesn't impact throughput. Speed and limit testing of v0.20 will be completed first. Plus many of my gpus are CC 2.x. Is there any reason to believe this version will handle big exponents like the 511M you reported issues with earlier? |
[QUOTE=kriesel;500258]Congrats. Thanks for sharing the exe. Which commit is this, 1165353?
Any chance of a CUDA 8 build? Or share the process of setting up a Windows build environment perhaps? I may give your exe a spin on Win7, but would rather not have to upgrade all gpu systems' drivers to CUDA10 capability until I know it works on Win7 and the necessary driver version doesn't impact throughput. Speed and limit testing of v0.20 will be completed first. Plus many of my gpus are CC 2.x. Is there any reason to believe this version will handle big exponents like the 511M you reported issues with earlier?[/QUOTE] It's commit b456ecbffc908927ccb37d0240f66af6ef2e4bb I can try to set up a Win7/CUDA 8 VM build environment, but I make no promises. There's been no functional code change yet that would improve the ability to process higher exponents. So far, it's all just been mostly housekeeping. |
Thanks!
I'll try it on my GTX1080ti when I've got some time. |
[QUOTE=kriesel;500245]
I've seen in recent testing that sometimes CUDAPm1 significantly underutilized gpu memory in stage 2. Not sure what that's about, or if it's still present in your modified version.[/QUOTE]The above was based on GPU-Z's indication of memory usage. I think now the issue is with GPU-Z. [CODE]During a CUDAPm1 v0.20 run on a 300M exponent, stage 2, nvidia-smi reports on a GTX1080Ti, FB Memory Usage Total : 11264 MiB Used : 4967 MiB Free : 6297 MiB Utilization Gpu : 99 % Memory : 74 % Encoder : 0 % Decoder : 0 % while GPU-Z 2.14.0 reports for the same gpu at the same time, memory usage (dedicated) 750MB memory usage (dynamic) 43MB total is 793MB 4967-4096=871 871-793=78[/CODE]It's not clear whether GPU-Z's numbers are decimal MB or MiB. But there seems to at least sometimes be a large discrepancy, more than 2^32 bytes, between what nvidia-smi reports and what GPU-Z reports as memory used, for large-memory gpus. Or maybe what GPU-Z reports are small subsets of the total used. But as I recall it seemed to be a good indicator on a 4GB or smaller memory gpu. HWMonitor indicates different usage and terms yet: Memory 25% Frame buffer 75% |
I seem to recall making some modifications to the memory allocations prior to my first git commit, but I cannot recall what they are.
We have to remember that it checks the available RAM before stage 1, as part of the bounds calculations: [CODE]CUDA reports 7473M of 7949M GPU memory free. Using threads: norm1 256, mult 128, norm2 64. Using up to 7350M GPU memory. Selected B1=660000, B2=14520000, 4.02% chance of finding a factor Starting stage 1 P-1, M58039669, B1 = 660000, B2 = 14520000, fft length = 3200K ... Starting stage 2. Using b1 = 660000, b2 = 14520000, d = 2310, e = 12, nrp = 240 Zeros: 650369, Ones: 742591, Pairs: 145550 Processing 1 - 240 of 480 relative primes.[/CODE]But this memory is not actually allocated until much later, and the amount could have changed in that time. We have to be very careful not to exceed it because therein lies fatal errors, and we do not have control over other applications that may also be using the same memory. One reason I find the code uses less memory than what is available is that it (based on my understanding, at least): [LIST=1][*]Determines the value of nrp based on the available memory and fft size (and for some reason restricts it to 4GiB on Windows. Possibly a 32-bit issue, or something from older CUDA versions?)[*]Determines how many passes it takes to process all relative primes[*]Balances each pass so they're the same size.[/LIST]E.g. for my above exponent: [LIST=1][*]nrp initially = 287 (would use all of the available ram)[*]Requires ~1.7 passes for 480 relative primes[*]Round up to make that 2 passes. Now nrp=240 relative primes per pass instead of running two wildly different sized passes (287 in the first and 193 in the second)[*]actual ram usage is 240*x instead of 287*x[/LIST]I'm not sure of [I]all[/I] of the reasons for this, but the one I can definitely be thankful for is that it is much less likely to crash from insufficient memory. |
I like this nvidia-smi view because it's a nice simple summary, and i still get to see how much memory each application is using
[CODE]+-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.73 Driver Version: 410.73 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A | | 0% 56C P2 126W / 185W | 6891MiB / 7949MiB | 100% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1181 G /usr/lib/xorg/Xorg 191MiB | | 0 2298 G cinnamon 94MiB | | 0 9334 G /usr/lib/firefox/firefox 3MiB | | 0 15779 C ./CUDAPm1 6587MiB | | 0 16292 G /usr/lib/firefox/firefox 3MiB | +-----------------------------------------------------------------------------+[/CODE] |
[QUOTE=aaronhaviland;500309]
One reason I find the code uses less memory than what is available is that it (based on my understanding, at least): [LIST=1][*]Determines the value of nrp based on the available memory and fft size (and for some reason restricts it to 4GiB on Windows. Possibly a 32-bit issue, or something from older CUDA versions?)[/LIST][/QUOTE] if addressed per byte, 4 GiB is all a 32 bit ( 4 byte) computer can address. |
In V0.20 CUDAPm1, higher exponents and particularly in smaller memory gpus, NRP is smaller, and I see instances of several passes sometimes followed by a runt final pass. I'm even seeing small prime numbers occasionally for the value of NRP for most passes of a run.
|
1 Attachment(s)
[QUOTE=science_man_88;500314]if addressed per byte, 4 GiB is all a 32 bit ( 4 byte) computer can address.[/QUOTE]
Yeah... that's why I'm speculating it might be a 32-bit specific issue. Anyway, here's the Cuda-8.0, 64-bit, compute capability 2.0 binary I didn't promise (lol). Completely untested... I'm actually attaching it here so I can download it when I reboot into windows. |
[QUOTE=aaronhaviland;500317]Yeah... that's why I'm speculating it might be a 32-bit specific issue.
Anyway, here's the Cuda-8.0, 64-bit, compute capability 2.0 binary I didn't promise (lol). Completely untested... I'm actually attaching it here so I can download it when I reboot into windows.[/QUOTE] Thanks! Is that the same commit as the other image, or today's (c1afcee...)? |
[QUOTE=kriesel;500319]Thanks! Is that the same commit as the other image, or today's (c1afcee...)?[/QUOTE]
Same commit. c1afcee is effectively the same, just prior to the minor fixes i needed for VS to process the build. |
| All times are UTC. The time now is 23:19. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.