mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2019-06-11, 02:18   #716
GhettoChild
 
"Ghetto_Child"
Jul 2014
Montreal, QC, Canada

518 Posts
Unhappy Advice please?

So I'm using a 4GB eVGA GTX 770 Classified (Kepler). I was originally unable to run any version of CUDAPm1 higher than CUDA 5.5, on this Win7 HP 64-bit, PC currently set to default clock speeds. I kept getting "the program can't start because api-ms-win-crt-runtime--1-1-0.dll is missing"; same message if I tried to launce GeForce Experience. I tried installing Service Pack 1 but that didn't help, waste of time & space. Then I installed Microsoft KB2999226. That seems to have solved that, I can now run higher than CUDA 5.5 versions.

So according to this GPU's specs, I should be able to run CUDA 10. For the most part CUDA 8 seems to run ok but I can't understand why CUDA 10 in both CUDAPm1 v021 & 0.22 just fail right from the start. Attached are my lengthy screen output logs.
Attached Files
File Type: txt UL_cudapm1_CUDA_10_v021 Self Tests log.txt (62.5 KB, 159 views)
File Type: txt UL_CUDAPm1-0.22-cuda10 Self Tests log.txt (124.6 KB, 157 views)
File Type: txt UL_failed1 GeForceGTX770_fft.txt (3.0 KB, 157 views)
File Type: txt UL_failed1 v021 GeForceGTX770_threads.txt (31 Bytes, 149 views)
File Type: txt UL_failed2 GeForceGTX770_threads.txt (62 Bytes, 156 views)

Last fiddled with by GhettoChild on 2019-06-11 at 02:19
GhettoChild is offline   Reply With Quote
Old 2019-06-11, 04:14   #717
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24×3×163 Posts
Default

Code:
fft size = 30870K, ave time = 61.7556 msec, max-ave = 0.08239
C:/Users/Aaron/Documents/Visual Studio 2017/Projects/CUDAPm1/CUDAPm1/bench.cu(202) : cudaSafeCall() Runtime API error 30: unknown error.
Try splitting up your -fftbench. (Save the results of the first run under a different filename, before making the second, or the second will overwrite/delete the first)
-fftbench 1 16384
-fftbench 16384 32768
I've never had to go that low, but gpu models vary. API error 30 seems to be associated with the Windows TDR (timeout detection and recovery). You may want to try increasing the timeout period in the registry settings. Not using the system interactively during benchmarking is recommended.

Code:
I:\GIMPS GPU Computing\CUDAPm1>cudapm1_win64_20181114_CUDA_10_v021.exe -threadbench 1 32768  2>&1 
CUDAPm1 v0.21
can't parse options
CUDALucas syntax for threadbench does not work in CUDAPm1.
CUDAPm1 -cufftbench 4608 4608 2
for threadbench at 4608k, 2 passes.
I usually wrap that in a batch file and for loop, with only the fft lengths listed in the fft file, with parameter substitution.

The zero values for benchmark timings are a problem.
The 0x0000000000000001 res64 values are a problem.

I see you're running CUDAPm1 v0.21. I have no experience with that version. But much will be similar to v0.20, which you could try. Or v0.22.

For some background info see https://www.mersenneforum.org/showthread.php?t=23389
If CUDA8 runs well, on your setup, use it.

Last fiddled with by kriesel on 2019-06-11 at 04:57
kriesel is online now   Reply With Quote
Old 2019-06-11, 20:53   #718
GhettoChild
 
"Ghetto_Child"
Jul 2014
Montreal, QC, Canada

518 Posts
Default

I've used an assortment of versions and the uploads are both v021 and v0.22. v0.20 CUDA5.5 never gave these problems to do the identical scripted tests. I run the same script of work as a "Self Test" batch file on each version the first time before I start new work. CUDA8 v0.22 seems to work as it passes most of the tests; I'm only having an issue at Stage 2 M120002191 FFT 6912K it just crashes without status. Then no version of CUDA10 passes any tests; I either get zeros ending with a 1 for residues or an error code. CUDA10 v0.22 did complete the cufftbench once without issues from 1-32768: "totalavg: 488.4344 totallen: 8133237 totaltime: 16651.6445". I couldn't get a pass again after that one. The drive these are running on do have less than 4GB free space but it's an extra drive not the main OS drive.

With your suggestion about splitting the cufftbench, this is all a batch file that I run. I don't know of a way to make the batch file pause between commands for input to continue; as well splitting the work into two batch files equally defeats the purpose of lengthy-unattended scripting.

Doesn't higher CUDA suites mean higher performance? Isn't CUDA10 faster than CUDA8? That's why I'm trying to get it to work. Also despite CUDA8 seems to work, it crashed at Stage 2 M120002191 FFT 6912K but v0.20 CUDA5.5 completed that exponent and larger without issue.

Last fiddled with by GhettoChild on 2019-06-11 at 21:00
GhettoChild is offline   Reply With Quote
Old 2019-06-11, 21:33   #719
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

11110100100002 Posts
Default

Quote:
Originally Posted by GhettoChild View Post
I'm only having an issue at Stage 2 M120002191 FFT 6912K it just crashes without status.
Sometimes a relaunch from that point (or 2 or 3) will coax it to finish. It depends on getting different roundoff behavior from one attempt to the next. I've hit rough spots that simply would not run on certain cards. Even a different card of the same gpu model could run what one card couldn't. The gpu BIOS versions differed. Higher and lower exponents worked ok, but around 85M would not run to completion on a certain Quadro 2000. There was another case around 171M as I recall. Stage 2 silent halt was described by the author as occurring due to excessive roundoff error. Sometimes a retry produces less roundoff error.
Quote:
With your suggestion about splitting the cufftbench, this is all a batch file that I run. I don't know of a way to make the batch file pause between commands for input to continue; as well splitting the work into two batch files equally defeats the purpose of lengthy-unattended scripting.
MODIFY the batch file, to split the cufftbench task into pieces. Instead of "cudapm1 -cufftbench 1 32768 2 >>logfile" in the batch file,

cudapm1 -cufftbench 1 16384 2 >>logfile
rename "(whatever) fft.txt" "(whatver) fft save.txt"
if errorlevel 1 goto exit
: preceding detects rename failure and bails, preventing loss of first cufftbench pass
cudapm1 -cufftbench 16384 32768 2 >>logfile
echo whatever prompt you want, for interactive action via a separate process, such as: merge fft files via a text editor, then press a key here
pause

(threadbench portion etc of batch file)


:exit
(end of batch file)
https://www.computerhope.com/pausehlp.htm

Quote:
Doesn't higher CUDA suites mean higher performance? Isn't CUDA10 faster than CUDA8? That's why I'm trying to get it to work. Also despite CUDA8 seems to work, it crashed at Stage 2 M120002191 FFT 6912K but v0.20 CUDA5.5 completed that exponent and larger without issue.
No. If it won't run, CUDA10's performance is zero. I've not heard that I recall of any significant performance advantage to CUDA10 on older cards. CUDA 10 is necessary for compatibility for some new cards that have a higher compute capability level than is supported in CUDA 8 or CUDA 9.x. A card that requires CUDA10 can't run on CUDA8. Continuing downward, CUDA8 or above is required for the GTX10xx family. I've often seen CUDA levels higher than required for a card give LESS performance than older CUDA levels also supported on the same hardware. For example, CUDA 5 or 6 generally are faster than CUDA 8 in CUDALucas on GTX480, which is a compute capability 2.0 card. Even NVIDIA makes no claims of performance improvements for the fft library at CUDA10.x, which is what counts for CUDAPm1 and CUDALucas performance, at https://developer.nvidia.com/cuda-toolkit/whatsnew All that stuff they list as improved there? Not used in CUDALucas or CUDAPm1.
See also https://www.mersenneforum.org/showpo...47&postcount=8 and its attachment.

Last fiddled with by kriesel on 2019-06-11 at 22:21
kriesel is online now   Reply With Quote
Old 2019-06-11, 22:34   #720
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

I've also noticed in some cases (small ram gpus) higher CUDA level means slightly lower maximum benchmarkable fft length. Perhaps a couple percent lower.
kriesel is online now   Reply With Quote
Old 2019-06-14, 22:24   #721
GhettoChild
 
"Ghetto_Child"
Jul 2014
Montreal, QC, Canada

518 Posts
Default

I was able to complete v0.22 CUDA8 up to M200001187 iteration 32 had error 0.4074 so that's acceptable a test for me for now. Increasing the unused memory parameter from defualt 100 to 300 allowed it to reach this far. CUDA10 fft bench completed and resulted slower than CUDA8 like you stated; however all other test exponents and selftests resulted in residues of zeros ending in a 1.
Attached Files
File Type: txt CUDAPm1-0.22-cuda8 Self Tests log.txt (210.8 KB, 166 views)
File Type: txt CUDAPm1-0.22-cuda10 Self Tests log.txt (256.3 KB, 157 views)
GhettoChild is offline   Reply With Quote
Old 2019-07-05, 19:46   #722
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default Dual instances, 9% throughput gain on gtx1080Ti for DC-range exponent P-1

In a nutshell, a GTX1080 Ti gpu, with 11GB, goes underutilized in at least two ways with one CUDAPm1 v0.20 instance. During stage 2, memory usage is ~4.5GB. During stage 1 gcd and stage 2 gcd, gpu memory is still occupied but the gcd is done on a cpu core and the gpu cores sit idle. For p~48M, on my test system, that's ~10% of the time that the gpu cores are idle.

Initial testing indicates that running two instances with a slight stagger gives about 9.5% higher aggregate throughput. Presumably this is due to one instance making full use of the gpu cores while the other is waiting on the cpu core to perform a gcd. After each instance have completed a couple of exponents through both stages, the stagger appears stable. I've initiated logging in GPU-Z to check whether any gpu idle or low utilization percentage occurs. So far, in almost an hour of logging, there's only about 20 seconds of gpu core idle indication.

Peak gpu memory usage is ~8593MB when stage 2 of two instances coincide, indicating 3 instances probably would not fit without memory contention. Since cudapm1 queries available gpu ram at the beginning and determines bounds for later use based on that, and NRP values, it might run into insufficient-memory problems with 3 or more instances. I think the case for any potential incremental throughput from a third instance is weak.

(All testing done in Windows 7, dual-E5520-Xeon system. Effect would be proportionally smaller with a proportionally faster cpu core.)

Last fiddled with by kriesel on 2019-07-05 at 20:22
kriesel is online now   Reply With Quote
Old 2019-07-06, 04:34   #723
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

14CD16 Posts
Default

Quote:
Originally Posted by kriesel View Post
In a nutshell, a GTX1080 Ti gpu,
Did I miss it or did you post the actual time it would take to complete a P-1.
For example a current P-1 in the 85M ranges to recommended bounds?

Thanks
petrw1 is offline   Reply With Quote
Old 2019-07-06, 13:02   #724
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24×3×163 Posts
Default

Quote:
Originally Posted by petrw1 View Post
Did I miss it or did you post the actual time it would take to complete a P-1.
For example a current P-1 in the 85M ranges to recommended bounds?

Thanks
On the gtx1080 Ti, p~48M (DC wavefront), sums of stage 1 and stage 2 timings in console output:
one instance ~38:46 baseline
two instances ~70:08, throughput ~1/35:04, ~10.56% faster than single instance after I adjusted the stagger between the two instances (by stopping one for two minutes, then resuming it) to eliminate a brief 20-30 second recurring period of gpu cores idle.

For single-instance run times, nrp, bounds selected by the program etc, versus gpu model and exponent, see the attachments at https://www.mersenneforum.org/showpo...73&postcount=9 and the couple of posts preceding.

On the gtx1080 Ti, 90M a single instance is about 2.5 hours; the run time scaling is about p2.05.

Note, I don't think running multiple instances will work well at p>~61M on the 11GB GTX1080Ti because of stage 2 memory requirements. A 16GB gpu should be ok to p~89M.
kriesel is online now   Reply With Quote
Old 2019-07-06, 15:55   #725
masser
 
masser's Avatar
 
Jul 2003
Behind BB

2·7·11·13 Posts
Default

Quote:
Originally Posted by kriesel View Post
... goes underutilized in at least two ways with one CUDAPm1 v0.20 instance...
Perhaps I missed something, but why v0.20, instead of v0.22?
masser is offline   Reply With Quote
Old 2019-07-06, 20:44   #726
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

250410 Posts
Default

I believe the 0.22 fork has some regressions.
ixfd64 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3628 2023-04-17 22:08
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26
Factoring program need help Citrix Lone Mersenne Hunters 8 2005-09-16 02:31
Factoring program ET_ Programming 3 2003-11-25 02:57

All times are UTC. The time now is 15:23.


Fri Jul 7 15:23:04 UTC 2023 up 323 days, 12:51, 0 users, load averages: 1.08, 1.10, 1.09

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔