mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2018-11-22, 03:30   #661
aaronhaviland
 
Jan 2011
Dudley, MA, USA

10010012 Posts
Default

Okay, obviously I need to add some more verbosity and safety checks around certain spots to diagnose these silent halts.

I still feel like there's something wrong with the malloc in Stef42's case. It might be possible that even though X gb of ram is available, that not all of it is available to a single malloc, and we currently don't have any code to deal with that.
I didn't think to check the square() kernel timings for invalid results, so I'll need to add a check for that as well. However as a side note, I am not sure 23040K is actually the best FFT length for this exponent, so I find it interesting that it's what being chosen. I may need to enforce invalidation of previously run timings if we suspect there's issues with them.

Last fiddled with by aaronhaviland on 2018-11-22 at 03:31
aaronhaviland is offline   Reply With Quote
Old 2018-11-22, 07:10   #662
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default

Why don't these match?
Code:
Best time for fft = 23040K, time: 1.6869, t1 = 512, t2 = 64, t3 = 1024 
 Using threads: norm1 512, mult 128, norm2 128.
fft file excerpt (from a quick v0.22 -cufftbench 1 32768 1)
Code:
16384  294471259  11.1004
18432  330441847  12.4128
18816  337176443  13.8883
20480  366326371  14.2631
20736  370806323  15.4177
21168  378363589  15.5279
23040  411074273  15.8456
23328  416101459  16.4934
23625  421284407  18.0096
24192  431175197  18.3017
25088  446794913  18.3473
32768  580225813  18.7128
Given there's no averaging, tries=1, it looks ok to select 23040k for 400m exponent
Code:
fft size = 21168K, ave time = 15.5279 msec, max-ave = 0.00000
fft size = 21384K, ave time = 19.1372 msec, max-ave = 0.00000
fft size = 21504K, ave time = 16.2552 msec, max-ave = 0.00000
fft size = 21560K, ave time = 20.1739 msec, max-ave = 0.00000
fft size = 21600K, ave time = 16.3524 msec, max-ave = 0.00000
fft size = 21609K, ave time = 16.1015 msec, max-ave = 0.00000
fft size = 21840K, ave time = 21.0564 msec, max-ave = 0.00000
fft size = 21870K, ave time = 17.1037 msec, max-ave = 0.00000
fft size = 21875K, ave time = 17.4419 msec, max-ave = 0.00000
fft size = 21952K, ave time = 16.6437 msec, max-ave = 0.00000
fft size = 22000K, ave time = 20.8197 msec, max-ave = 0.00000
fft size = 22050K, ave time = 18.0051 msec, max-ave = 0.00000
fft size = 22113K, ave time = 21.7934 msec, max-ave = 0.00000
fft size = 22176K, ave time = 19.7003 msec, max-ave = 0.00000
fft size = 22275K, ave time = 22.6468 msec, max-ave = 0.00000
fft size = 22295K, ave time = 23.8952 msec, max-ave = 0.00000
fft size = 22400K, ave time = 17.7254 msec, max-ave = 0.00000
fft size = 22464K, ave time = 19.3179 msec, max-ave = 0.00000
fft size = 22500K, ave time = 18.3411 msec, max-ave = 0.00000
fft size = 22528K, ave time = 19.3020 msec, max-ave = 0.00000
fft size = 22638K, ave time = 23.7112 msec, max-ave = 0.00000
fft size = 22680K, ave time = 17.3189 msec, max-ave = 0.00000
fft size = 22750K, ave time = 22.4622 msec, max-ave = 0.00000
fft size = 22932K, ave time = 20.8813 msec, max-ave = 0.00000
fft size = 23040K, ave time = 15.8456 msec, max-ave = 0.00000

Last fiddled with by kriesel on 2018-11-22 at 07:39
kriesel is offline   Reply With Quote
Old 2018-11-22, 09:29   #663
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

23×3×72 Posts
Default

Quote:
Originally Posted by aaronhaviland View Post
In prior windows releases, this program would not make use of more than 4GiB video ram. I released that restriction for this build, because I found no issues with it on my 8GiB RTX 2070. The only other cards I had available were 3GiB and 2GiB, so I didn't bother trying them.

Noticing that you have a device with 11GiB, I'm very curious to find out if there was another reason for this limitation that I hadn't been able to determine. Especially since you mention it "starts filling the GPU memory", which it's trying to malloc, and failing.

If you could please do me a favour and fiddle with the UnusedMem value in the .ini file, and see if you can determine a value that doesn't crash. I would start with a value something like 7168, as that would simulate the old 4GiB limitation. (11GiB - 4GiB = 7GiB * 1024 = 7168)
I tried M89326001 and I had the same issue as Stef42 with my GTX1080ti (11GiB), stage 2 would quit without error message. I put:
Code:
UnusdedMem=7168
in the CUDAPm1.ini file and it seems to run the stage 2 now. So you're on to something :).
VictordeHolland is offline   Reply With Quote
Old 2018-11-23, 20:25   #664
Stef42
 
Feb 2012
the Netherlands

1110102 Posts
Default

Quote:
Originally Posted by aaronhaviland View Post
Okay, obviously I need to add some more verbosity and safety checks around certain spots to diagnose these silent halts.

I still feel like there's something wrong with the malloc in Stef42's case. It might be possible that even though X gb of ram is available, that not all of it is available to a single malloc, and we currently don't have any code to deal with that.
I didn't think to check the square() kernel timings for invalid results, so I'll need to add a check for that as well. However as a side note, I am not sure 23040K is actually the best FFT length for this exponent, so I find it interesting that it's what being chosen. I may need to enforce invalidation of previously run timings if we suspect there's issues with them.
So far I have managed to finish it with:
Quote:
UnusedMem=2048
Stef42 is offline   Reply With Quote
Old 2018-11-23, 23:52   #665
aaronhaviland
 
Jan 2011
Dudley, MA, USA

73 Posts
Default

Okay, thanks both of you. Obviously there's a limit to cudaMalloc that differs from what is actually "free" RAM, and it varies depending on the card and the system. At this point, I believe the most likely culprit is the difference between free memory, and free contiguous memory, the latter of which is not a number we can query, but rather have to determine by trial and error (according to what I've been able to google, at least).

I'm going to try to modify so that it can keep trying to malloc in progressively smaller amounts, until it finds a value that works. But that might take a couple days to figure out, since changing the ram size will force a recalculation of other things that I haven't quite worked out yet.
aaronhaviland is offline   Reply With Quote
Old 2018-11-25, 04:15   #666
aaronhaviland
 
Jan 2011
Dudley, MA, USA

73 Posts
Default

Could either one of you please run this app on one of the offending cards, and let me know the output? I threw it together really quick (cuda 10), but it reports what the driver says is free, and what the max cudaMalloc size it can claim. This would confirm the suspicions.

It defaults to device #0. Let me know if you need it to point to a different device. Since it was a quick build, I didn't include code for command-line options
Attached Files
File Type: exe cudaMallocTest.exe (112.0 KB, 91 views)

Last fiddled with by aaronhaviland on 2018-11-25 at 04:16
aaronhaviland is offline   Reply With Quote
Old 2018-11-25, 19:35   #667
Stef42
 
Feb 2012
the Netherlands

2·29 Posts
Default

Quote:
Originally Posted by aaronhaviland View Post
Could either one of you please run this app on one of the offending cards, and let me know the output? I threw it together really quick (cuda 10), but it reports what the driver says is free, and what the max cudaMalloc size it can claim. This would confirm the suspicions.

It defaults to device #0. Let me know if you need it to point to a different device. Since it was a quick build, I didn't include code for command-line options
Output

Quote:
C:\Users\steph\Downloads>cudamalloctest.exe
Cuda reported
Free VRAM: 9314MiB
Total VRAM: 11264MiB
Max cudaMalloc: 9314MiB

Last fiddled with by Stef42 on 2018-11-25 at 19:36
Stef42 is offline   Reply With Quote
Old 2018-11-30, 20:51   #668
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

152B16 Posts
Default gcd impact

On a dual X5650 Xeon HP 600, with prime95 workers using 2 cores each, when CUDAPm1 (0.20) uses a single core for gcd computations, it idles another core (stops one of the 6 prime95 workers). Duration for p~380M is about 18 minutes per gcd. Impact will be higher on 3-core or higher workers. The related gpu is also idle but vram committed during this time.
Attached Thumbnails
Click image for larger version

Name:	s1gcd on m400m idles a cpu core for 18 minutes.png
Views:	54
Size:	19.9 KB
ID:	19337  

Last fiddled with by kriesel on 2018-11-30 at 20:52
kriesel is offline   Reply With Quote
Old 2018-12-03, 03:40   #669
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default list updated

The CUDAPm1 bug and wish list has been somewhat updated.
Stef42's gpu ram issue has been added.
Various fixes have been verified and indicated.
https://www.mersenneforum.org/showpo...34&postcount=3
kriesel is offline   Reply With Quote
Old 2018-12-04, 17:06   #670
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default gcd time and fail (0.20); excessive roundoff in 0.22

Quote:
Originally Posted by kriesel View Post
On a dual X5650 Xeon HP 600, with prime95 workers using 2 cores each, when CUDAPm1 (0.20) uses a single core for gcd computations, it idles another core (stops one of the 6 prime95 workers). Duration for p~380M is about 18 minutes per gcd. Impact will be higher on 3-core or higher workers. The related gpu is also idle but vram committed during this time.
On a dual Xeon E5520 Lenovo D20, when CUDAPm1 v0.20 uses a single core for stage 1 gcd computations, it idles the gpu for duration of about 39 minutes with p~414M. A prime95 instance is not much affected in this case since hyperthreading is enabled on this system, so task manager shows prime95's 50% utilization unaffected. The gcd fails.

The same t file used to start the 0.20 run was attempted on V0.22 but fails the roundoff error check in the next 100 iterations.
History of this exponent is the run was started on a gtx1060 which failed on gtx1060 with out of memory crash in stage 2 after picking a higher than expected NRP; retry there failed, wanted 4GB, restart from late s1 on gtx1050ti failed s1gcd; quadro 5000 from late s1 file failed s1gcd; gtx480 try completed stage 1 through "found no factor".

Running through a collection of c and t files (5 total) representing late stage 1 and just after stage1 gcd, neither v0.20 nor v0.22 can carry the computation forward to completion on the gtx1080ti.
Code:
batch wrapper reports (re)launch at Tue 12/04/2018  9:39:35.52 reset count 0 of max 3 
CUDAPm1 v0.20
------- DEVICE 0 -------
name                GeForce GTX 1080 Ti
Compatibility       6.1
clockRate (MHz)     1620
memClockRate (MHz)  5505
totalGlobalMem      zu
totalConstMem       zu
l2CacheSize         2883584
sharedMemPerBlock   zu
regsPerBlock        65536
warpSize            32
memPitch            zu
maxThreadsPerBlock  1024
maxThreadsPerMP     2048
multiProcessorCount 28
maxThreadsDim[3]    1024,1024,64
maxGridSize[3]      2147483647,65535,65535
textureAlignment    zu
deviceOverlap       1

CUDA reports 10988M of 11264M GPU memory free.
Using threads: norm1 32, mult 32, norm2 32.
Using up to 5285M GPU memory.
Selected B1=3250000, B2=77187500, 3.64% chance of finding a factor
Using B1 = 3215000 from savefile.
Continuing stage 1 from a partial result of M414000007 fft length = 23328K, iteration = 4625001
M414000007, 0x4f7c556075b4f7f3, n = 23328K, CUDAPm1 v0.20
Stage 1 complete, estimated total time = 63:37:29batch wrapper reports (re)launch at Tue 12/04/2018 10:26:30.31 reset count 0 of max 3 
CUDAPm1 v0.22
------- DEVICE 0 -------
name                GeForce GTX 1080 Ti
Compatibility       6.1
clockRate (MHz)     1620
memClockRate (MHz)  5505
totalGlobalMem      11811160064
totalConstMem       65536
l2CacheSize         2883584
sharedMemPerBlock   49152
regsPerBlock        65536
warpSize            32
memPitch            2147483647
maxThreadsPerBlock  1024
maxThreadsPerMP     2048
multiProcessorCount 28
maxThreadsDim[3]    1024,1024,64
maxGridSize[3]      2147483647,65535,65535
textureAlignment    512
deviceOverlap       1

CUDA reports 10988M of 11264M GPU memory free.
Using threads: norm1 512, mult 64, norm2 32.
Using up to 10752M GPU memory.
Selected B1=3990000, B2=93765000, 3.86% chance of finding a factor
Using B1 = 3215000 from savefile.
Continuing stage 1 from a partial result of M414000007 fft length = 23328K, iteration = 4625001
Iteration = 4625100, err =   0.5 >= 0.40, quitting.
Estimated time spent so far: 63:33:25

batch wrapper reports exit at Tue 12/04/2018 10:28:09.65
kriesel is offline   Reply With Quote
Old 2018-12-13, 14:23   #671
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009

36448 Posts
Default

Quote:
Originally Posted by Stef42 View Post
I have reserved the exponent through GPU72.com.
Worktodo does indeed look like this:

Code:
Pfactor=N/A,1,2,89326001,-1,76,2
A few assignments were completed from GPU72.com before this one. Funny thing was dat similar exponents in the 89M range only used roughly 4300MB memory.
My only "beef" with it is that it will not accept the long form where one can specify the bounds:

Code:
Pminus1=1,2,<exponent>,-1,100000000,1000000000,65
I never had any luck in trying to run it this way.
storm5510 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26
Factoring program need help Citrix Lone Mersenne Hunters 8 2005-09-16 02:31
Factoring program ET_ Programming 3 2003-11-25 02:57

All times are UTC. The time now is 08:17.


Mon Aug 2 08:17:58 UTC 2021 up 10 days, 2:46, 0 users, load averages: 2.62, 2.06, 1.72

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.