mersenneforum.org CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW)
 Register FAQ Search Today's Posts Mark Forums Read

2019-03-20, 17:40   #2751
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×919 Posts

Quote:
 Originally Posted by James Heinrich What about in-between versions, like 9.0 and 9.1 ?
In my view,optional; I recall reports by one of the gpu apps' programmer that they were buggy. I think that applies to 7.0 too. Not sure about 5.0 or 6.0.

 2019-04-12, 05:54 #2752 ATH Einyen     Dec 2003 Denmark 2·17·101 Posts I used 3 different Titan Black cards for CUDALucas the past 3+ years and never had a single bad result, now I got 2 bad in the last 2 weeks and at least 2 more not verified yet that I suspect are bad, and I found more errors now I'm starting to compare residues with CPU runs. Binaries self compiled with CUDA 5.0, CUDA 10.1 and a CUDA8 downloaded binary all have errors. I ran the long self test -r 1 several times with all of them and not a single error. I also downloaded a GPUMem Test 1.2 and ran it many times without any errors. The card is not overheating and there is no high roundoff errors. Here is an example of a residue error compared to a EC2 cpu run (with ECC RAM). The residue at 27.9M is fine and the residues are 28M and beyond are bad, no indication anything went wrong: Code: | Apr 11 04:34:16 | M47743063 27400000 0xf08fddb36e736a8b | 2592K 0.17920 1.6372 163.72s | 9:20:09 57.39% | | Apr 11 04:37:00 | M47743063 27500000 0xb25497a24137fb90 | 2592K 0.17920 1.6394 163.94s | 9:17:23 57.60% | | Apr 11 04:39:44 | M47743063 27600000 0x6eaf49afd9912e56 | 2592K 0.17920 1.6374 163.74s | 9:14:37 57.80% | | Apr 11 04:42:28 | M47743063 27700000 0xc071fedbcffee5e7 | 2592K 0.17920 1.6395 163.95s | 9:11:51 58.01% | | Apr 11 04:45:11 | M47743063 27800000 0xfbd811ee93cdeed6 | 2592K 0.17920 1.6369 163.69s | 9:09:04 58.22% | | Apr 11 04:47:55 | M47743063 27900000 0x5d0a374a75323427 | 2592K 0.17920 1.6391 163.91s | 9:06:18 58.43% | | Apr 11 04:50:39 | M47743063 28000000 0x177c2ce48730d4a6 | 2592K 0.17920 1.6373 163.73s | 9:03:32 58.64% | | Apr 11 04:53:23 | M47743063 28100000 0x6f0e3b06882f59ef | 2592K 0.17920 1.6392 163.92s | 9:00:46 58.85% | | Apr 11 04:56:07 | M47743063 28200000 0x2da418d38e412bf0 | 2592K 0.17920 1.6376 163.76s | 8:58:00 59.06% | When I recently installed CUDA 10.1 the driver was updated so I suspected it would be a bad driver version, so now I updated to driver 419.67, but errors still appears. The errors I found so far was all at 47M exponent at 2592K FFT, so it might be a problem at that particular FFT size. Now I'm runnning the same exponent at 2744K FFT and comparing residues. The last thing I can think of is to update driver again to 425.31 that came out yesterday, because the last driver update was only from 419.xx I think to 419.67, so this would be a bigger update. After that I'm out of ideas, anyone else can think of something? I'm afraid it is bad RAM on the card even though the GPUMemtest and CUDALucas long tests are not finding anything. Last fiddled with by ATH on 2019-04-12 at 06:00
2019-04-12, 06:20   #2753
diep

Sep 2006
The Netherlands

2×13×31 Posts

Quote:
 Originally Posted by ATH I used 3 different Titan Black cards for CUDALucas the past 3+ years and never had a single bad result, now I got 2 bad in the last 2 weeks and at least 2 more not verified yet that I suspect are bad, and I found more errors now I'm starting to compare residues with CPU runs. Binaries self compiled with CUDA 5.0, CUDA 10.1 and a CUDA8 downloaded binary all have errors. I ran the long self test -r 1 several times with all of them and not a single error. I also downloaded a GPUMem Test 1.2 and ran it many times without any errors. The card is not overheating and there is no high roundoff errors. Here is an example of a residue error compared to a EC2 cpu run (with ECC RAM). The residue at 27.9M is fine and the residues are 28M and beyond are bad, no indication anything went wrong: Code: | Apr 11 04:34:16 | M47743063 27400000 0xf08fddb36e736a8b | 2592K 0.17920 1.6372 163.72s | 9:20:09 57.39% | | Apr 11 04:37:00 | M47743063 27500000 0xb25497a24137fb90 | 2592K 0.17920 1.6394 163.94s | 9:17:23 57.60% | | Apr 11 04:39:44 | M47743063 27600000 0x6eaf49afd9912e56 | 2592K 0.17920 1.6374 163.74s | 9:14:37 57.80% | | Apr 11 04:42:28 | M47743063 27700000 0xc071fedbcffee5e7 | 2592K 0.17920 1.6395 163.95s | 9:11:51 58.01% | | Apr 11 04:45:11 | M47743063 27800000 0xfbd811ee93cdeed6 | 2592K 0.17920 1.6369 163.69s | 9:09:04 58.22% | | Apr 11 04:47:55 | M47743063 27900000 0x5d0a374a75323427 | 2592K 0.17920 1.6391 163.91s | 9:06:18 58.43% | | Apr 11 04:50:39 | M47743063 28000000 0x177c2ce48730d4a6 | 2592K 0.17920 1.6373 163.73s | 9:03:32 58.64% | | Apr 11 04:53:23 | M47743063 28100000 0x6f0e3b06882f59ef | 2592K 0.17920 1.6392 163.92s | 9:00:46 58.85% | | Apr 11 04:56:07 | M47743063 28200000 0x2da418d38e412bf0 | 2592K 0.17920 1.6376 163.76s | 8:58:00 59.06% | When I recently installed CUDA 10.1 the driver was updated so I suspected it would be a bad driver version, so now I updated to driver 419.67, but errors still appears. The errors I found so far was all at 47M exponent at 2592K FFT, so it might be a problem at that particular FFT size. Now I'm runnning the same exponent at 2744K FFT and comparing residues. The last thing I can think of is to update driver again to 425.31 that came out yesterday, because the last driver update was only from 419.xx I think to 419.67, so this would be a bigger update. After that I'm out of ideas, anyone else can think of something? I'm afraid it is bad RAM on the card even though the GPUMemtest and CUDALucas long tests are not finding anything.
Of course important to figure out what it is as odds high it is something simple - yet important question is: is your card properly watercooled to room temperature?

Already for nearly 2 decades the chips produced perform best when just under or at 19C. Deviation from that, so extreme cold or hot say 40C+ will cause the chip to wear out - errors start to exist then - or in short it's too high clocked then.

You're probably running that gpu 24/24 or nearby that and it just wasn't designed to do that with the default air cool body it has.

edit: if you clock the chip lower does it have any wrong result then?

You can see every GPU in that sense as overclocked too much. Of course the Tesla's also should suffer from this but as an extra precaution they got ECC.

Your GPU does have CRC32 at the GDDR5 which is also very effective. So first test you could carry out is whether you got CRC32 errors in the GDDR5 if it isn't something simple. If all other problems are not the cause of the problem, the wires of the chip have burned themselves up over time. As simple as that.

edit2: the chip also will eat up to 20%+ more juice (electricity) when it is hot. This +20% extra power usage could already start to occur at 50C.

Last fiddled with by diep on 2019-04-12 at 06:27

 2019-04-12, 06:32 #2754 ATH Einyen     Dec 2003 Denmark 1101011010102 Posts No, it is not water cooled, it is an old Titan Black from 2014, I'm not sure I could get water cooling for it even if I wanted. It is running at 76C, but I do not think that is the problem. I have been running the cards at 70C-82C for 3 years without a single bad result until now. Of course it would be better to cool it to room temperature, but I think very few people have that good of a setup, and most can run CUDALucas anyway. Last fiddled with by ATH on 2019-04-12 at 06:34
2019-04-12, 06:41   #2755
diep

Sep 2006
The Netherlands

32616 Posts

Quote:
 Originally Posted by ATH No, it is not water cooled, it is an old Titan Black from 2014, I'm not sure I could get water cooling for it even if I wanted. It is running at 76C, but I do not think that is the problem. I have been running the cards at 70C-82C for 3 years without a single bad result until now. Of course it would be better to cool it to room temperature, but I think very few people have that good of a setup, and most can run CUDALucas anyway.
Oh at this temperature first thing you should consider is removing all dust out of the gpu's cooling body. Remove a few screws and click away the plastic body and remove all dust, put it back and try again.

I've got a Titan-Z here that's 2 times your chip on a single card :)
And i'm watercooling it of course.

Yeah there is excellent watercooling bodies for it. Cheap \$15 pump to it and some tubing and radiator with fan. Radiator online easy to find.

The cooling bodies sometimes not so cheap - yet just a thing only for the gpu itself - i happen to have 2 left here. As i first intended to use that prior to finding online a good cooling body that covers entire GPU.

Only watercooling the chip means you still need to blow massive air onto the rest of the gpu and regurarly remove dust. So a cooling body that covers entire card is the best to have.

GDDR5 - i didn't look it up what temperature it starts to get errors.
Most ECC ram can handle a tad higher temperature than 'normal rams' which is around 80-85C that they start to give errors.

So the easiest thing to check is for CRC32 errors - of course AFTER removing all dust out of the cards cooling body.

edit: "start to give errors OVER TIME". So after a while.

Last fiddled with by diep on 2019-04-12 at 06:43

 2019-04-12, 07:17 #2756 ATH Einyen     Dec 2003 Denmark 2·17·101 Posts These cards are at least 70 C when they are brand new and running CUDALucas, so 76C is not that high, and I do clean out dust regularly. I'm not getting water cooling for this old card. How do I check for CRC32 errors? I'm not coding CUDA applications myself. Last fiddled with by ATH on 2019-04-12 at 07:18
2019-04-12, 07:48   #2757
diep

Sep 2006
The Netherlands

2×13×31 Posts

Quote:
 Originally Posted by ATH These cards are at least 70 C when they are brand new and running CUDALucas, so 76C is not that high, and I do clean out dust regularly. I'm not getting water cooling for this old card. How do I check for CRC32 errors? I'm not coding CUDA applications myself.
I consider 76C really high. It's too close to the errorlevel IMHO.
And it'll eat 30% more power or so.

As for the measuring of the CRC32 - good question. Every GDDR5 line has a CRC32.
Yet there was a simple answer to how to do that - yet i FORGOT what that was. Maybe ask again at Nvidia forum somewhere.

2019-04-12, 09:39   #2758
James Heinrich

"James Heinrich"
May 2004
ex-Northern Ontario

3×23×59 Posts

Quote:
 Originally Posted by diep .. so extreme cold or hot say 40C+ will cause the chip to wear out the chip also will eat up to 20%+ more juice (electricity) when it is hot. This +20% extra power usage could already start to occur at 50C.
Quote:
 Originally Posted by diep I consider 76C really high
I don't. Your previous claim of 40C borders on impossible. Even idle I don't expect any GPU to get much below 40C. Anything below 80C is perfectly fine in my world. It's only once the temperature gets to 90C that I become concerned. Lower is better of course, but not to the temperatures you appear to be targeting, unless you leave your computer outside in a Canadian winter.

Last fiddled with by James Heinrich on 2019-04-12 at 09:42

2019-04-12, 11:28   #2759
diep

Sep 2006
The Netherlands

32616 Posts

Quote:
 Originally Posted by James Heinrich I don't. Your previous claim of 40C borders on impossible. Even idle I don't expect any GPU to get much below 40C. Anything below 80C is perfectly fine in my world. It's only once the temperature gets to 90C that I become concerned. Lower is better of course, but not to the temperatures you appear to be targeting, unless you leave your computer outside in a Canadian winter.
You try to ventilate 350 watt of power through a few square centimeters using air ventilation on this GPU. That's utterly disturbed way to do it. The result of that is absurd high temperatures on the GPU's.

That eats massive more power to start with. Easily 30% more power than when the chip gets cooled to room temperature.

What you produce with ASML machines works like that for a while now simply.

With simple watercooling you get that down a lot, though not very close to room temperature. That's because the fans of the radiators i want to make very little noise.

Busy now with an upgrade of the watercooling there so that less water evaporates as i use a huge aquarium to silence the (eheim) waterpump completely.

Will post a picture when it works - maybe somewhere next week.

2019-04-12, 14:05   #2760
VBCurtis

"Curtis"
Feb 2005
Riverside, CA

127658 Posts

Quote:
 Originally Posted by diep The result of that is absurd high temperatures on the GPU's.
What you call absurd, everyone else calls normal- from the manufacturers to gamers to enthusiasts doing computation on CUDA cards. The tone of your posts is that you have the One True Way, and everyone else is a fool; that makes you look the fool for being so adamant.

I hope you're not this abrasively dogmatic in real life, too.

 2019-04-12, 14:30 #2761 GhettoChild   "Ghetto_Child" Jul 2014 Montreal, QC, Canada 2916 Posts Highly Annoying Medium Level Bug Report I've made this complaint in previous versions before and you've managed to code it back in again. It's quite an annoying error because it causes immense loss of processing time and waste of utility bills. The program doesn't know which interval is most efficient to restart from upon round-off errors. It's using the last screen output iteration interval instead of taking whichever is the smaller iteration interval, or at least just the checkpoint iteration interval. My settings are to check the round-off on each iteration. In this example I'm posting my checkpoint writes are set to every 100,000 iterations in the ini file as well as my screen output. However I also pass the command line flags to set the screen output to 100,000 iterations but I manually increased this value to 500,000 iterations using the interactive keys value while the program was running. Wasteful results below: CUDALucas2.05.1-CUDA8.0-Windows-x64.exe Code: Y -- report_iter increased to 500000 | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Apr 10 16:07:21 | M89951461 27500000 ****************** | 4860K 0.33594 72.8459 10362.19s | 36:18:35:00 30.57% | | Apr 10 22:58:20 | M89951461 28000000 ****************** | 4860K 0.33008 49.3178 24658.93s | 36:06:21:35 31.12% | | Apr 11 05:49:13 | M89951461 28500000 ****************** | 4860K 0.34375 49.3050 24652.49s | 35:19:50:03 31.68% | | Apr 11 12:37:13 | M89951461 29000000 ****************** | 4860K 0.32422 48.9598 24479.94s | 35:09:31:07 32.23% | | Apr 11 19:29:07 | M89951461 29500000 ****************** | 4860K 0.34375 49.4265 24713.27s | 35:01:00:58 32.79% | Round off error at iteration = 29891700, err = 0.39063 > 0.35, fft = 4860K. Restarting from last checkpoint to see if the error is repeatable. Using threads: square 64, splice 256. Continuing M89951461 @ iteration 29500001 with fft length 4860K, 32.80% done Round off error at iteration = 29609100, err = 0.35156 > 0.35, fft = 4860K. The error persists. Trying a larger fft until the next checkpoint. Using threads: square 64, splice 32. Continuing M89951461 @ iteration 29500001 with fft length 5120K, 32.80% done z -- fft count 177 -- current fft 5120K -- smallest fft for this exponent 4860K -- largest fft for this exponent 6480K -- square threads 64 -- splice threads 32 -- checkpoint interval 100000 -- report interval 500000 -- error check interval 100 -- error reset percent 85 -- error limit 40 -- polite flag 1 -- polite value 10 -- sleep flag 0 -- sleep value 100 -- 64 bit carry flag 0 -- save all checkpoints flag 0 -- device number 0 -- savefile folder savefiles -- ini file CUDALucas.ini -- input file worktodo.txt -- results file results.txt Last fiddled with by GhettoChild on 2019-04-12 at 14:34 Reason: added error clarification

 Similar Threads Thread Thread Starter Forum Replies Last Post LaurV Data 131 2017-05-02 18:41 Brain GPU Computing 13 2016-02-19 15:53 Karl M Johnson GPU Computing 15 2015-10-13 04:44 fairsky GPU Computing 11 2013-11-03 02:08 Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 03:46.

Fri Jan 27 03:46:14 UTC 2023 up 162 days, 1:14, 0 users, load averages: 0.49, 0.69, 0.85

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔