![]() |
|
|
#661 |
|
Jan 2011
Dudley, MA, USA
10010012 Posts |
Okay, obviously I need to add some more verbosity and safety checks around certain spots to diagnose these silent halts.
I still feel like there's something wrong with the malloc in Stef42's case. It might be possible that even though X gb of ram is available, that not all of it is available to a single malloc, and we currently don't have any code to deal with that. I didn't think to check the square() kernel timings for invalid results, so I'll need to add a check for that as well. However as a side note, I am not sure 23040K is actually the best FFT length for this exponent, so I find it interesting that it's what being chosen. I may need to enforce invalidation of previously run timings if we suspect there's issues with them. Last fiddled with by aaronhaviland on 2018-11-22 at 03:31 |
|
|
|
|
|
#662 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
Why don't these match?
Code:
Best time for fft = 23040K, time: 1.6869, t1 = 512, t2 = 64, t3 = 1024 Using threads: norm1 512, mult 128, norm2 128. Code:
16384 294471259 11.1004 18432 330441847 12.4128 18816 337176443 13.8883 20480 366326371 14.2631 20736 370806323 15.4177 21168 378363589 15.5279 23040 411074273 15.8456 23328 416101459 16.4934 23625 421284407 18.0096 24192 431175197 18.3017 25088 446794913 18.3473 32768 580225813 18.7128 Code:
fft size = 21168K, ave time = 15.5279 msec, max-ave = 0.00000 fft size = 21384K, ave time = 19.1372 msec, max-ave = 0.00000 fft size = 21504K, ave time = 16.2552 msec, max-ave = 0.00000 fft size = 21560K, ave time = 20.1739 msec, max-ave = 0.00000 fft size = 21600K, ave time = 16.3524 msec, max-ave = 0.00000 fft size = 21609K, ave time = 16.1015 msec, max-ave = 0.00000 fft size = 21840K, ave time = 21.0564 msec, max-ave = 0.00000 fft size = 21870K, ave time = 17.1037 msec, max-ave = 0.00000 fft size = 21875K, ave time = 17.4419 msec, max-ave = 0.00000 fft size = 21952K, ave time = 16.6437 msec, max-ave = 0.00000 fft size = 22000K, ave time = 20.8197 msec, max-ave = 0.00000 fft size = 22050K, ave time = 18.0051 msec, max-ave = 0.00000 fft size = 22113K, ave time = 21.7934 msec, max-ave = 0.00000 fft size = 22176K, ave time = 19.7003 msec, max-ave = 0.00000 fft size = 22275K, ave time = 22.6468 msec, max-ave = 0.00000 fft size = 22295K, ave time = 23.8952 msec, max-ave = 0.00000 fft size = 22400K, ave time = 17.7254 msec, max-ave = 0.00000 fft size = 22464K, ave time = 19.3179 msec, max-ave = 0.00000 fft size = 22500K, ave time = 18.3411 msec, max-ave = 0.00000 fft size = 22528K, ave time = 19.3020 msec, max-ave = 0.00000 fft size = 22638K, ave time = 23.7112 msec, max-ave = 0.00000 fft size = 22680K, ave time = 17.3189 msec, max-ave = 0.00000 fft size = 22750K, ave time = 22.4622 msec, max-ave = 0.00000 fft size = 22932K, ave time = 20.8813 msec, max-ave = 0.00000 fft size = 23040K, ave time = 15.8456 msec, max-ave = 0.00000 Last fiddled with by kriesel on 2018-11-22 at 07:39 |
|
|
|
|
|
#663 | |
|
"Victor de Hollander"
Aug 2011
the Netherlands
23×3×72 Posts |
Quote:
Code:
UnusdedMem=7168 |
|
|
|
|
|
|
#664 | ||
|
Feb 2012
the Netherlands
1110102 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#665 |
|
Jan 2011
Dudley, MA, USA
73 Posts |
Okay, thanks both of you. Obviously there's a limit to cudaMalloc that differs from what is actually "free" RAM, and it varies depending on the card and the system. At this point, I believe the most likely culprit is the difference between free memory, and free contiguous memory, the latter of which is not a number we can query, but rather have to determine by trial and error (according to what I've been able to google, at least).
I'm going to try to modify so that it can keep trying to malloc in progressively smaller amounts, until it finds a value that works. But that might take a couple days to figure out, since changing the ram size will force a recalculation of other things that I haven't quite worked out yet. |
|
|
|
|
|
#666 |
|
Jan 2011
Dudley, MA, USA
73 Posts |
Could either one of you please run this app on one of the offending cards, and let me know the output? I threw it together really quick (cuda 10), but it reports what the driver says is free, and what the max cudaMalloc size it can claim. This would confirm the suspicions.
It defaults to device #0. Let me know if you need it to point to a different device. Since it was a quick build, I didn't include code for command-line options Last fiddled with by aaronhaviland on 2018-11-25 at 04:16 |
|
|
|
|
|
#667 | ||
|
Feb 2012
the Netherlands
2·29 Posts |
Quote:
Quote:
Last fiddled with by Stef42 on 2018-11-25 at 19:36 |
||
|
|
|
|
|
#668 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
152B16 Posts |
On a dual X5650 Xeon HP 600, with prime95 workers using 2 cores each, when CUDAPm1 (0.20) uses a single core for gcd computations, it idles another core (stops one of the 6 prime95 workers). Duration for p~380M is about 18 minutes per gcd. Impact will be higher on 3-core or higher workers. The related gpu is also idle but vram committed during this time.
Last fiddled with by kriesel on 2018-11-30 at 20:52 |
|
|
|
|
|
#669 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
The CUDAPm1 bug and wish list has been somewhat updated.
Stef42's gpu ram issue has been added. Various fixes have been verified and indicated. https://www.mersenneforum.org/showpo...34&postcount=3 |
|
|
|
|
|
#670 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
Quote:
The same t file used to start the 0.20 run was attempted on V0.22 but fails the roundoff error check in the next 100 iterations. History of this exponent is the run was started on a gtx1060 which failed on gtx1060 with out of memory crash in stage 2 after picking a higher than expected NRP; retry there failed, wanted 4GB, restart from late s1 on gtx1050ti failed s1gcd; quadro 5000 from late s1 file failed s1gcd; gtx480 try completed stage 1 through "found no factor". Running through a collection of c and t files (5 total) representing late stage 1 and just after stage1 gcd, neither v0.20 nor v0.22 can carry the computation forward to completion on the gtx1080ti. Code:
batch wrapper reports (re)launch at Tue 12/04/2018 9:39:35.52 reset count 0 of max 3 CUDAPm1 v0.20 ------- DEVICE 0 ------- name GeForce GTX 1080 Ti Compatibility 6.1 clockRate (MHz) 1620 memClockRate (MHz) 5505 totalGlobalMem zu totalConstMem zu l2CacheSize 2883584 sharedMemPerBlock zu regsPerBlock 65536 warpSize 32 memPitch zu maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 28 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment zu deviceOverlap 1 CUDA reports 10988M of 11264M GPU memory free. Using threads: norm1 32, mult 32, norm2 32. Using up to 5285M GPU memory. Selected B1=3250000, B2=77187500, 3.64% chance of finding a factor Using B1 = 3215000 from savefile. Continuing stage 1 from a partial result of M414000007 fft length = 23328K, iteration = 4625001 M414000007, 0x4f7c556075b4f7f3, n = 23328K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 63:37:29batch wrapper reports (re)launch at Tue 12/04/2018 10:26:30.31 reset count 0 of max 3 CUDAPm1 v0.22 ------- DEVICE 0 ------- name GeForce GTX 1080 Ti Compatibility 6.1 clockRate (MHz) 1620 memClockRate (MHz) 5505 totalGlobalMem 11811160064 totalConstMem 65536 l2CacheSize 2883584 sharedMemPerBlock 49152 regsPerBlock 65536 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 28 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment 512 deviceOverlap 1 CUDA reports 10988M of 11264M GPU memory free. Using threads: norm1 512, mult 64, norm2 32. Using up to 10752M GPU memory. Selected B1=3990000, B2=93765000, 3.86% chance of finding a factor Using B1 = 3215000 from savefile. Continuing stage 1 from a partial result of M414000007 fft length = 23328K, iteration = 4625001 Iteration = 4625100, err = 0.5 >= 0.40, quitting. Estimated time spent so far: 63:33:25 batch wrapper reports exit at Tue 12/04/2018 10:28:09.65 |
|
|
|
|
|
|
#671 | |
|
Random Account
Aug 2009
36448 Posts |
Quote:
Code:
Pminus1=1,2,<exponent>,-1,100000000,1000000000,65 |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3497 | 2021-06-05 12:27 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |
| World's dumbest CUDA program? | xilman | Programming | 1 | 2009-11-16 10:26 |
| Factoring program need help | Citrix | Lone Mersenne Hunters | 8 | 2005-09-16 02:31 |
| Factoring program | ET_ | Programming | 3 | 2003-11-25 02:57 |