mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   The P-1 factoring CUDA program (https://www.mersenneforum.org/showthread.php?t=17835)

aaronhaviland 2018-11-22 03:30

Okay, obviously I need to add some more verbosity and safety checks around certain spots to diagnose these silent halts.

I still feel like there's something wrong with the malloc in Stef42's case. It might be possible that even though X gb of ram is available, that not all of it is available to a single malloc, and we currently don't have any code to deal with that.
I didn't think to check the square() kernel timings for invalid results, so I'll need to add a check for that as well. However as a side note, I am not sure 23040K is actually the best FFT length for this exponent, so I find it interesting that it's what being chosen. I may need to enforce invalidation of previously run timings if we suspect there's issues with them.

kriesel 2018-11-22 07:10

Why don't these match?[CODE]Best time for fft = 23040K, time: 1.6869, t1 = 512, [COLOR=red][B]t2 = 64, t3 = 1024[/B][/COLOR]
Using threads: norm1 512, [COLOR=Red][B]mult 128, norm2 128[/B][/COLOR].[/CODE]fft file excerpt (from a quick v0.22 -cufftbench 1 32768 1)[CODE]16384 294471259 11.1004
18432 330441847 12.4128
18816 337176443 13.8883
20480 366326371 14.2631
20736 370806323 15.4177
21168 378363589 15.5279
23040 411074273 15.8456
23328 416101459 16.4934
23625 421284407 18.0096
24192 431175197 18.3017
25088 446794913 18.3473
32768 580225813 18.7128[/CODE]Given there's no averaging, tries=1, it looks ok to select 23040k for 400m exponent[CODE]fft size = 21168K, ave time = 15.5279 msec, max-ave = 0.00000
fft size = 21384K, ave time = 19.1372 msec, max-ave = 0.00000
fft size = 21504K, ave time = 16.2552 msec, max-ave = 0.00000
fft size = 21560K, ave time = 20.1739 msec, max-ave = 0.00000
fft size = 21600K, ave time = 16.3524 msec, max-ave = 0.00000
fft size = 21609K, ave time = 16.1015 msec, max-ave = 0.00000
fft size = 21840K, ave time = 21.0564 msec, max-ave = 0.00000
fft size = 21870K, ave time = 17.1037 msec, max-ave = 0.00000
fft size = 21875K, ave time = 17.4419 msec, max-ave = 0.00000
fft size = 21952K, ave time = 16.6437 msec, max-ave = 0.00000
fft size = 22000K, ave time = 20.8197 msec, max-ave = 0.00000
fft size = 22050K, ave time = 18.0051 msec, max-ave = 0.00000
fft size = 22113K, ave time = 21.7934 msec, max-ave = 0.00000
fft size = 22176K, ave time = 19.7003 msec, max-ave = 0.00000
fft size = 22275K, ave time = 22.6468 msec, max-ave = 0.00000
fft size = 22295K, ave time = 23.8952 msec, max-ave = 0.00000
fft size = 22400K, ave time = 17.7254 msec, max-ave = 0.00000
fft size = 22464K, ave time = 19.3179 msec, max-ave = 0.00000
fft size = 22500K, ave time = 18.3411 msec, max-ave = 0.00000
fft size = 22528K, ave time = 19.3020 msec, max-ave = 0.00000
fft size = 22638K, ave time = 23.7112 msec, max-ave = 0.00000
fft size = 22680K, ave time = 17.3189 msec, max-ave = 0.00000
fft size = 22750K, ave time = 22.4622 msec, max-ave = 0.00000
fft size = 22932K, ave time = 20.8813 msec, max-ave = 0.00000
fft size = 23040K, ave time = 15.8456 msec, max-ave = 0.00000[/CODE]

VictordeHolland 2018-11-22 09:29

[QUOTE=aaronhaviland;500632]In prior windows releases, this program would not make use of more than 4GiB video ram. I released that restriction for this build, because I found no issues with it on my 8GiB RTX 2070. The only other cards I had available were 3GiB and 2GiB, so I didn't bother trying them.

Noticing that you have a device with 11GiB, I'm very curious to find out if there was another reason for this limitation that I hadn't been able to determine. Especially since you mention it "starts filling the GPU memory", which it's trying to malloc, and failing.

If you could please do me a favour and fiddle with the UnusedMem value in the .ini file, and see if you can determine a value that doesn't crash. I would start with a value something like 7168, as that would simulate the old 4GiB limitation. (11GiB - 4GiB = 7GiB * 1024 = 7168)[/QUOTE]
I tried M89326001 and I had the same issue as Stef42 with my GTX1080ti (11GiB), stage 2 would quit without error message. I put:
[code]
UnusdedMem=7168[/code]in the CUDAPm1.ini file and it seems to run the stage 2 now. So you're on to something :).

Stef42 2018-11-23 20:25

[QUOTE=aaronhaviland;500704]Okay, obviously I need to add some more verbosity and safety checks around certain spots to diagnose these silent halts.

I still feel like there's something wrong with the malloc in Stef42's case. It might be possible that even though X gb of ram is available, that not all of it is available to a single malloc, and we currently don't have any code to deal with that.
I didn't think to check the square() kernel timings for invalid results, so I'll need to add a check for that as well. However as a side note, I am not sure 23040K is actually the best FFT length for this exponent, so I find it interesting that it's what being chosen. I may need to enforce invalidation of previously run timings if we suspect there's issues with them.[/QUOTE]

So far I have managed to finish it with:
[QUOTE]UnusedMem=2048
[/QUOTE]

aaronhaviland 2018-11-23 23:52

Okay, thanks both of you. Obviously there's a limit to cudaMalloc that differs from what is actually "free" RAM, and it varies depending on the card and the system. At this point, I believe the most likely culprit is the difference between free memory, and free contiguous memory, the latter of which is not a number we can query, but rather have to determine by trial and error (according to what I've been able to google, at least).

I'm going to try to modify so that it can keep trying to malloc in progressively smaller amounts, until it finds a value that works. But that might take a couple days to figure out, since changing the ram size will force a recalculation of other things that I haven't quite worked out yet.

aaronhaviland 2018-11-25 04:15

1 Attachment(s)
Could either one of you please run this app on one of the offending cards, and let me know the output? I threw it together really quick (cuda 10), but it reports what the driver says is free, and what the max cudaMalloc size it can claim. This would confirm the suspicions.

It defaults to device #0. Let me know if you need it to point to a different device. Since it was a quick build, I didn't include code for command-line options

Stef42 2018-11-25 19:35

[QUOTE=aaronhaviland;500923]Could either one of you please run this app on one of the offending cards, and let me know the output? I threw it together really quick (cuda 10), but it reports what the driver says is free, and what the max cudaMalloc size it can claim. This would confirm the suspicions.

It defaults to device #0. Let me know if you need it to point to a different device. Since it was a quick build, I didn't include code for command-line options[/QUOTE]

Output

[QUOTE]C:\Users\steph\Downloads>cudamalloctest.exe
Cuda reported
Free VRAM: 9314MiB
Total VRAM: 11264MiB
Max cudaMalloc: 9314MiB[/QUOTE]

kriesel 2018-11-30 20:51

gcd impact
 
1 Attachment(s)
On a dual X5650 Xeon HP 600, with prime95 workers using 2 cores each, when CUDAPm1 (0.20) uses a single core for gcd computations, it idles another core (stops one of the 6 prime95 workers). Duration for p~380M is about 18 minutes per gcd. Impact will be higher on 3-core or higher workers. The related gpu is also idle but vram committed during this time.

kriesel 2018-12-03 03:40

list updated
 
The CUDAPm1 bug and wish list has been somewhat updated.
Stef42's gpu ram issue has been added.
Various fixes have been verified and indicated.
[URL]https://www.mersenneforum.org/showpost.php?p=488534&postcount=3[/URL]

kriesel 2018-12-04 17:06

gcd time and fail (0.20); excessive roundoff in 0.22
 
[QUOTE=kriesel;501353]On a dual X5650 Xeon HP 600, with prime95 workers using 2 cores each, when CUDAPm1 (0.20) uses a single core for gcd computations, it idles another core (stops one of the 6 prime95 workers). Duration for p~380M is about 18 minutes per gcd. Impact will be higher on 3-core or higher workers. The related gpu is also idle but vram committed during this time.[/QUOTE]
On a dual Xeon E5520 Lenovo D20, when CUDAPm1 v0.20 uses a single core for stage 1 gcd computations, it idles the gpu for duration of about 39 minutes with p~414M. A prime95 instance is not much affected in this case since hyperthreading is enabled on this system, so task manager shows prime95's 50% utilization unaffected. The gcd fails.

The same t file used to start the 0.20 run was attempted on V0.22 but fails the roundoff error check in the next 100 iterations.
History of this exponent is the run was started on a gtx1060 which failed on gtx1060 with out of memory crash in stage 2 after picking a higher than expected NRP; retry there failed, wanted 4GB, restart from late s1 on gtx1050ti failed s1gcd; quadro 5000 from late s1 file failed s1gcd; gtx480 try completed stage 1 through "found no factor".

Running through a collection of c and t files (5 total) representing late stage 1 and just after stage1 gcd, neither v0.20 nor v0.22 can carry the computation forward to completion on the gtx1080ti.
[CODE]batch wrapper reports (re)launch at Tue 12/04/2018 9:39:35.52 reset count 0 of max 3
CUDAPm1 v0.20
------- DEVICE 0 -------
name GeForce GTX 1080 Ti
Compatibility 6.1
clockRate (MHz) 1620
memClockRate (MHz) 5505
totalGlobalMem zu
totalConstMem zu
l2CacheSize 2883584
sharedMemPerBlock zu
regsPerBlock 65536
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 28
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 10988M of 11264M GPU memory free.
Using threads: norm1 32, mult 32, norm2 32.
Using up to 5285M GPU memory.
Selected B1=3250000, B2=77187500, 3.64% chance of finding a factor
Using B1 = 3215000 from savefile.
Continuing stage 1 from a partial result of M414000007 fft length = 23328K, iteration = 4625001
M414000007, 0x4f7c556075b4f7f3, n = 23328K, CUDAPm1 v0.20
Stage 1 complete, estimated total time = 63:37:29batch wrapper reports (re)launch at Tue 12/04/2018 10:26:30.31 reset count 0 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1080 Ti
Compatibility 6.1
clockRate (MHz) 1620
memClockRate (MHz) 5505
totalGlobalMem 11811160064
totalConstMem 65536
l2CacheSize 2883584
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 28
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 10988M of 11264M GPU memory free.
Using threads: norm1 512, mult 64, norm2 32.
Using up to 10752M GPU memory.
Selected B1=3990000, B2=93765000, 3.86% chance of finding a factor
Using B1 = 3215000 from savefile.
Continuing stage 1 from a partial result of M414000007 fft length = 23328K, iteration = 4625001
Iteration = 4625100, err = 0.5 >= 0.40, quitting.
Estimated time spent so far: 63:33:25

batch wrapper reports exit at Tue 12/04/2018 10:28:09.65 [/CODE]

storm5510 2018-12-13 14:23

[QUOTE=Stef42;500595]I have reserved the exponent through GPU72.com.
Worktodo does indeed look like this:

[CODE]Pfactor=N/A,1,2,89326001,-1,76,2 [/CODE]A few assignments were completed from GPU72.com before this one. Funny thing was dat similar exponents in the 89M range only used roughly 4300MB memory.[/QUOTE]

My only "beef" with it is that it will not accept the long form where one can specify the bounds:

[CODE]Pminus1=1,2,<exponent>,-1,100000000,1000000000,65[/CODE]I never had any luck in trying to run it this way.


All times are UTC. The time now is 23:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.