![]() |
|
|
#716 |
|
"Ghetto_Child"
Jul 2014
Montreal, QC, Canada
518 Posts |
So I'm using a 4GB eVGA GTX 770 Classified (Kepler). I was originally unable to run any version of CUDAPm1 higher than CUDA 5.5, on this Win7 HP 64-bit, PC currently set to default clock speeds. I kept getting "the program can't start because api-ms-win-crt-runtime--1-1-0.dll is missing"; same message if I tried to launce GeForce Experience. I tried installing Service Pack 1 but that didn't help, waste of time & space. Then I installed Microsoft KB2999226. That seems to have solved that, I can now run higher than CUDA 5.5 versions.
So according to this GPU's specs, I should be able to run CUDA 10. For the most part CUDA 8 seems to run ok but I can't understand why CUDA 10 in both CUDAPm1 v021 & 0.22 just fail right from the start. Attached are my lengthy screen output logs. Last fiddled with by GhettoChild on 2019-06-11 at 02:19 |
|
|
|
|
|
#717 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24×3×163 Posts |
Code:
fft size = 30870K, ave time = 61.7556 msec, max-ave = 0.08239 C:/Users/Aaron/Documents/Visual Studio 2017/Projects/CUDAPm1/CUDAPm1/bench.cu(202) : cudaSafeCall() Runtime API error 30: unknown error. -fftbench 1 16384 -fftbench 16384 32768 I've never had to go that low, but gpu models vary. API error 30 seems to be associated with the Windows TDR (timeout detection and recovery). You may want to try increasing the timeout period in the registry settings. Not using the system interactively during benchmarking is recommended. Code:
I:\GIMPS GPU Computing\CUDAPm1>cudapm1_win64_20181114_CUDA_10_v021.exe -threadbench 1 32768 2>&1 CUDAPm1 v0.21 can't parse options CUDAPm1 -cufftbench 4608 4608 2 for threadbench at 4608k, 2 passes. I usually wrap that in a batch file and for loop, with only the fft lengths listed in the fft file, with parameter substitution. The zero values for benchmark timings are a problem. The 0x0000000000000001 res64 values are a problem. I see you're running CUDAPm1 v0.21. I have no experience with that version. But much will be similar to v0.20, which you could try. Or v0.22. For some background info see https://www.mersenneforum.org/showthread.php?t=23389 If CUDA8 runs well, on your setup, use it. Last fiddled with by kriesel on 2019-06-11 at 04:57 |
|
|
|
|
|
#718 |
|
"Ghetto_Child"
Jul 2014
Montreal, QC, Canada
518 Posts |
I've used an assortment of versions and the uploads are both v021 and v0.22. v0.20 CUDA5.5 never gave these problems to do the identical scripted tests. I run the same script of work as a "Self Test" batch file on each version the first time before I start new work. CUDA8 v0.22 seems to work as it passes most of the tests; I'm only having an issue at Stage 2 M120002191 FFT 6912K it just crashes without status. Then no version of CUDA10 passes any tests; I either get zeros ending with a 1 for residues or an error code. CUDA10 v0.22 did complete the cufftbench once without issues from 1-32768: "totalavg: 488.4344 totallen: 8133237 totaltime: 16651.6445". I couldn't get a pass again after that one. The drive these are running on do have less than 4GB free space but it's an extra drive not the main OS drive.
With your suggestion about splitting the cufftbench, this is all a batch file that I run. I don't know of a way to make the batch file pause between commands for input to continue; as well splitting the work into two batch files equally defeats the purpose of lengthy-unattended scripting. Doesn't higher CUDA suites mean higher performance? Isn't CUDA10 faster than CUDA8? That's why I'm trying to get it to work. Also despite CUDA8 seems to work, it crashed at Stage 2 M120002191 FFT 6912K but v0.20 CUDA5.5 completed that exponent and larger without issue. Last fiddled with by GhettoChild on 2019-06-11 at 21:00 |
|
|
|
|
|
#719 | |||
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
11110100100002 Posts |
Quote:
Quote:
cudapm1 -cufftbench 1 16384 2 >>logfile rename "(whatever) fft.txt" "(whatver) fft save.txt" if errorlevel 1 goto exit : preceding detects rename failure and bails, preventing loss of first cufftbench pass cudapm1 -cufftbench 16384 32768 2 >>logfile echo whatever prompt you want, for interactive action via a separate process, such as: merge fft files via a text editor, then press a key here pause (threadbench portion etc of batch file) :exit (end of batch file) https://www.computerhope.com/pausehlp.htm Quote:
See also https://www.mersenneforum.org/showpo...47&postcount=8 and its attachment. Last fiddled with by kriesel on 2019-06-11 at 22:21 |
|||
|
|
|
|
|
#720 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
I've also noticed in some cases (small ram gpus) higher CUDA level means slightly lower maximum benchmarkable fft length. Perhaps a couple percent lower.
|
|
|
|
|
|
#721 |
|
"Ghetto_Child"
Jul 2014
Montreal, QC, Canada
518 Posts |
I was able to complete v0.22 CUDA8 up to M200001187 iteration 32 had error 0.4074 so that's acceptable a test for me for now. Increasing the unused memory parameter from defualt 100 to 300 allowed it to reach this far. CUDA10 fft bench completed and resulted slower than CUDA8 like you stated; however all other test exponents and selftests resulted in residues of zeros ending in a 1.
|
|
|
|
|
|
#722 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
In a nutshell, a GTX1080 Ti gpu, with 11GB, goes underutilized in at least two ways with one CUDAPm1 v0.20 instance. During stage 2, memory usage is ~4.5GB. During stage 1 gcd and stage 2 gcd, gpu memory is still occupied but the gcd is done on a cpu core and the gpu cores sit idle. For p~48M, on my test system, that's ~10% of the time that the gpu cores are idle.
Initial testing indicates that running two instances with a slight stagger gives about 9.5% higher aggregate throughput. Presumably this is due to one instance making full use of the gpu cores while the other is waiting on the cpu core to perform a gcd. After each instance have completed a couple of exponents through both stages, the stagger appears stable. I've initiated logging in GPU-Z to check whether any gpu idle or low utilization percentage occurs. So far, in almost an hour of logging, there's only about 20 seconds of gpu core idle indication. Peak gpu memory usage is ~8593MB when stage 2 of two instances coincide, indicating 3 instances probably would not fit without memory contention. Since cudapm1 queries available gpu ram at the beginning and determines bounds for later use based on that, and NRP values, it might run into insufficient-memory problems with 3 or more instances. I think the case for any potential incremental throughput from a third instance is weak. (All testing done in Windows 7, dual-E5520-Xeon system. Effect would be proportionally smaller with a proportionally faster cpu core.) Last fiddled with by kriesel on 2019-07-05 at 20:22 |
|
|
|
|
|
#723 |
|
1976 Toyota Corona years forever!
"Wayne"
Nov 2006
Saskatchewan, Canada
14CD16 Posts |
|
|
|
|
|
|
#724 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24×3×163 Posts |
Quote:
one instance ~38:46 baseline two instances ~70:08, throughput ~1/35:04, ~10.56% faster than single instance after I adjusted the stagger between the two instances (by stopping one for two minutes, then resuming it) to eliminate a brief 20-30 second recurring period of gpu cores idle. For single-instance run times, nrp, bounds selected by the program etc, versus gpu model and exponent, see the attachments at https://www.mersenneforum.org/showpo...73&postcount=9 and the couple of posts preceding. On the gtx1080 Ti, 90M a single instance is about 2.5 hours; the run time scaling is about p2.05. Note, I don't think running multiple instances will work well at p>~61M on the 11GB GTX1080Ti because of stage 2 memory requirements. A 16GB gpu should be ok to p~89M. |
|
|
|
|
|
|
#725 |
|
Jul 2003
Behind BB
2·7·11·13 Posts |
|
|
|
|
|
|
#726 |
|
Bemusing Prompter
"Danny"
Dec 2002
California
250410 Posts |
I believe the 0.22 fork has some regressions.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3628 | 2023-04-17 22:08 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |
| World's dumbest CUDA program? | xilman | Programming | 1 | 2009-11-16 10:26 |
| Factoring program need help | Citrix | Lone Mersenne Hunters | 8 | 2005-09-16 02:31 |
| Factoring program | ET_ | Programming | 3 | 2003-11-25 02:57 |