![]() |
Okay, obviously I need to add some more verbosity and safety checks around certain spots to diagnose these silent halts.
I still feel like there's something wrong with the malloc in Stef42's case. It might be possible that even though X gb of ram is available, that not all of it is available to a single malloc, and we currently don't have any code to deal with that. I didn't think to check the square() kernel timings for invalid results, so I'll need to add a check for that as well. However as a side note, I am not sure 23040K is actually the best FFT length for this exponent, so I find it interesting that it's what being chosen. I may need to enforce invalidation of previously run timings if we suspect there's issues with them. |
Why don't these match?[CODE]Best time for fft = 23040K, time: 1.6869, t1 = 512, [COLOR=red][B]t2 = 64, t3 = 1024[/B][/COLOR]
Using threads: norm1 512, [COLOR=Red][B]mult 128, norm2 128[/B][/COLOR].[/CODE]fft file excerpt (from a quick v0.22 -cufftbench 1 32768 1)[CODE]16384 294471259 11.1004 18432 330441847 12.4128 18816 337176443 13.8883 20480 366326371 14.2631 20736 370806323 15.4177 21168 378363589 15.5279 23040 411074273 15.8456 23328 416101459 16.4934 23625 421284407 18.0096 24192 431175197 18.3017 25088 446794913 18.3473 32768 580225813 18.7128[/CODE]Given there's no averaging, tries=1, it looks ok to select 23040k for 400m exponent[CODE]fft size = 21168K, ave time = 15.5279 msec, max-ave = 0.00000 fft size = 21384K, ave time = 19.1372 msec, max-ave = 0.00000 fft size = 21504K, ave time = 16.2552 msec, max-ave = 0.00000 fft size = 21560K, ave time = 20.1739 msec, max-ave = 0.00000 fft size = 21600K, ave time = 16.3524 msec, max-ave = 0.00000 fft size = 21609K, ave time = 16.1015 msec, max-ave = 0.00000 fft size = 21840K, ave time = 21.0564 msec, max-ave = 0.00000 fft size = 21870K, ave time = 17.1037 msec, max-ave = 0.00000 fft size = 21875K, ave time = 17.4419 msec, max-ave = 0.00000 fft size = 21952K, ave time = 16.6437 msec, max-ave = 0.00000 fft size = 22000K, ave time = 20.8197 msec, max-ave = 0.00000 fft size = 22050K, ave time = 18.0051 msec, max-ave = 0.00000 fft size = 22113K, ave time = 21.7934 msec, max-ave = 0.00000 fft size = 22176K, ave time = 19.7003 msec, max-ave = 0.00000 fft size = 22275K, ave time = 22.6468 msec, max-ave = 0.00000 fft size = 22295K, ave time = 23.8952 msec, max-ave = 0.00000 fft size = 22400K, ave time = 17.7254 msec, max-ave = 0.00000 fft size = 22464K, ave time = 19.3179 msec, max-ave = 0.00000 fft size = 22500K, ave time = 18.3411 msec, max-ave = 0.00000 fft size = 22528K, ave time = 19.3020 msec, max-ave = 0.00000 fft size = 22638K, ave time = 23.7112 msec, max-ave = 0.00000 fft size = 22680K, ave time = 17.3189 msec, max-ave = 0.00000 fft size = 22750K, ave time = 22.4622 msec, max-ave = 0.00000 fft size = 22932K, ave time = 20.8813 msec, max-ave = 0.00000 fft size = 23040K, ave time = 15.8456 msec, max-ave = 0.00000[/CODE] |
[QUOTE=aaronhaviland;500632]In prior windows releases, this program would not make use of more than 4GiB video ram. I released that restriction for this build, because I found no issues with it on my 8GiB RTX 2070. The only other cards I had available were 3GiB and 2GiB, so I didn't bother trying them.
Noticing that you have a device with 11GiB, I'm very curious to find out if there was another reason for this limitation that I hadn't been able to determine. Especially since you mention it "starts filling the GPU memory", which it's trying to malloc, and failing. If you could please do me a favour and fiddle with the UnusedMem value in the .ini file, and see if you can determine a value that doesn't crash. I would start with a value something like 7168, as that would simulate the old 4GiB limitation. (11GiB - 4GiB = 7GiB * 1024 = 7168)[/QUOTE] I tried M89326001 and I had the same issue as Stef42 with my GTX1080ti (11GiB), stage 2 would quit without error message. I put: [code] UnusdedMem=7168[/code]in the CUDAPm1.ini file and it seems to run the stage 2 now. So you're on to something :). |
[QUOTE=aaronhaviland;500704]Okay, obviously I need to add some more verbosity and safety checks around certain spots to diagnose these silent halts.
I still feel like there's something wrong with the malloc in Stef42's case. It might be possible that even though X gb of ram is available, that not all of it is available to a single malloc, and we currently don't have any code to deal with that. I didn't think to check the square() kernel timings for invalid results, so I'll need to add a check for that as well. However as a side note, I am not sure 23040K is actually the best FFT length for this exponent, so I find it interesting that it's what being chosen. I may need to enforce invalidation of previously run timings if we suspect there's issues with them.[/QUOTE] So far I have managed to finish it with: [QUOTE]UnusedMem=2048 [/QUOTE] |
Okay, thanks both of you. Obviously there's a limit to cudaMalloc that differs from what is actually "free" RAM, and it varies depending on the card and the system. At this point, I believe the most likely culprit is the difference between free memory, and free contiguous memory, the latter of which is not a number we can query, but rather have to determine by trial and error (according to what I've been able to google, at least).
I'm going to try to modify so that it can keep trying to malloc in progressively smaller amounts, until it finds a value that works. But that might take a couple days to figure out, since changing the ram size will force a recalculation of other things that I haven't quite worked out yet. |
1 Attachment(s)
Could either one of you please run this app on one of the offending cards, and let me know the output? I threw it together really quick (cuda 10), but it reports what the driver says is free, and what the max cudaMalloc size it can claim. This would confirm the suspicions.
It defaults to device #0. Let me know if you need it to point to a different device. Since it was a quick build, I didn't include code for command-line options |
[QUOTE=aaronhaviland;500923]Could either one of you please run this app on one of the offending cards, and let me know the output? I threw it together really quick (cuda 10), but it reports what the driver says is free, and what the max cudaMalloc size it can claim. This would confirm the suspicions.
It defaults to device #0. Let me know if you need it to point to a different device. Since it was a quick build, I didn't include code for command-line options[/QUOTE] Output [QUOTE]C:\Users\steph\Downloads>cudamalloctest.exe Cuda reported Free VRAM: 9314MiB Total VRAM: 11264MiB Max cudaMalloc: 9314MiB[/QUOTE] |
gcd impact
1 Attachment(s)
On a dual X5650 Xeon HP 600, with prime95 workers using 2 cores each, when CUDAPm1 (0.20) uses a single core for gcd computations, it idles another core (stops one of the 6 prime95 workers). Duration for p~380M is about 18 minutes per gcd. Impact will be higher on 3-core or higher workers. The related gpu is also idle but vram committed during this time.
|
list updated
The CUDAPm1 bug and wish list has been somewhat updated.
Stef42's gpu ram issue has been added. Various fixes have been verified and indicated. [URL]https://www.mersenneforum.org/showpost.php?p=488534&postcount=3[/URL] |
gcd time and fail (0.20); excessive roundoff in 0.22
[QUOTE=kriesel;501353]On a dual X5650 Xeon HP 600, with prime95 workers using 2 cores each, when CUDAPm1 (0.20) uses a single core for gcd computations, it idles another core (stops one of the 6 prime95 workers). Duration for p~380M is about 18 minutes per gcd. Impact will be higher on 3-core or higher workers. The related gpu is also idle but vram committed during this time.[/QUOTE]
On a dual Xeon E5520 Lenovo D20, when CUDAPm1 v0.20 uses a single core for stage 1 gcd computations, it idles the gpu for duration of about 39 minutes with p~414M. A prime95 instance is not much affected in this case since hyperthreading is enabled on this system, so task manager shows prime95's 50% utilization unaffected. The gcd fails. The same t file used to start the 0.20 run was attempted on V0.22 but fails the roundoff error check in the next 100 iterations. History of this exponent is the run was started on a gtx1060 which failed on gtx1060 with out of memory crash in stage 2 after picking a higher than expected NRP; retry there failed, wanted 4GB, restart from late s1 on gtx1050ti failed s1gcd; quadro 5000 from late s1 file failed s1gcd; gtx480 try completed stage 1 through "found no factor". Running through a collection of c and t files (5 total) representing late stage 1 and just after stage1 gcd, neither v0.20 nor v0.22 can carry the computation forward to completion on the gtx1080ti. [CODE]batch wrapper reports (re)launch at Tue 12/04/2018 9:39:35.52 reset count 0 of max 3 CUDAPm1 v0.20 ------- DEVICE 0 ------- name GeForce GTX 1080 Ti Compatibility 6.1 clockRate (MHz) 1620 memClockRate (MHz) 5505 totalGlobalMem zu totalConstMem zu l2CacheSize 2883584 sharedMemPerBlock zu regsPerBlock 65536 warpSize 32 memPitch zu maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 28 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment zu deviceOverlap 1 CUDA reports 10988M of 11264M GPU memory free. Using threads: norm1 32, mult 32, norm2 32. Using up to 5285M GPU memory. Selected B1=3250000, B2=77187500, 3.64% chance of finding a factor Using B1 = 3215000 from savefile. Continuing stage 1 from a partial result of M414000007 fft length = 23328K, iteration = 4625001 M414000007, 0x4f7c556075b4f7f3, n = 23328K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 63:37:29batch wrapper reports (re)launch at Tue 12/04/2018 10:26:30.31 reset count 0 of max 3 CUDAPm1 v0.22 ------- DEVICE 0 ------- name GeForce GTX 1080 Ti Compatibility 6.1 clockRate (MHz) 1620 memClockRate (MHz) 5505 totalGlobalMem 11811160064 totalConstMem 65536 l2CacheSize 2883584 sharedMemPerBlock 49152 regsPerBlock 65536 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 28 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment 512 deviceOverlap 1 CUDA reports 10988M of 11264M GPU memory free. Using threads: norm1 512, mult 64, norm2 32. Using up to 10752M GPU memory. Selected B1=3990000, B2=93765000, 3.86% chance of finding a factor Using B1 = 3215000 from savefile. Continuing stage 1 from a partial result of M414000007 fft length = 23328K, iteration = 4625001 Iteration = 4625100, err = 0.5 >= 0.40, quitting. Estimated time spent so far: 63:33:25 batch wrapper reports exit at Tue 12/04/2018 10:28:09.65 [/CODE] |
[QUOTE=Stef42;500595]I have reserved the exponent through GPU72.com.
Worktodo does indeed look like this: [CODE]Pfactor=N/A,1,2,89326001,-1,76,2 [/CODE]A few assignments were completed from GPU72.com before this one. Funny thing was dat similar exponents in the 89M range only used roughly 4300MB memory.[/QUOTE] My only "beef" with it is that it will not accept the long form where one can specify the bounds: [CODE]Pminus1=1,2,<exponent>,-1,100000000,1000000000,65[/CODE]I never had any luck in trying to run it this way. |
| All times are UTC. The time now is 23:19. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.