mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2019-03-20, 17:40   #2751
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×919 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
What about in-between versions, like 9.0 and 9.1 ?
In my view,optional; I recall reports by one of the gpu apps' programmer that they were buggy. I think that applies to 7.0 too. Not sure about 5.0 or 6.0.
kriesel is online now   Reply With Quote
Old 2019-04-12, 05:54   #2752
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2·17·101 Posts
Default

I used 3 different Titan Black cards for CUDALucas the past 3+ years and never had a single bad result, now I got 2 bad in the last 2 weeks and at least 2 more not verified yet that I suspect are bad, and I found more errors now I'm starting to compare residues with CPU runs.

Binaries self compiled with CUDA 5.0, CUDA 10.1 and a CUDA8 downloaded binary all have errors. I ran the long self test -r 1 several times with all of them and not a single error. I also downloaded a GPUMem Test 1.2 and ran it many times without any errors.

The card is not overheating and there is no high roundoff errors. Here is an example of a residue error compared to a EC2 cpu run (with ECC RAM). The residue at 27.9M is fine and the residues are 28M and beyond are bad, no indication anything went wrong:
Code:
|  Apr 11  04:34:16  |  M47743063  27400000  0xf08fddb36e736a8b  |  2592K  0.17920   1.6372  163.72s  |      9:20:09  57.39%  |
|  Apr 11  04:37:00  |  M47743063  27500000  0xb25497a24137fb90  |  2592K  0.17920   1.6394  163.94s  |      9:17:23  57.60%  |
|  Apr 11  04:39:44  |  M47743063  27600000  0x6eaf49afd9912e56  |  2592K  0.17920   1.6374  163.74s  |      9:14:37  57.80%  |
|  Apr 11  04:42:28  |  M47743063  27700000  0xc071fedbcffee5e7  |  2592K  0.17920   1.6395  163.95s  |      9:11:51  58.01%  |
|  Apr 11  04:45:11  |  M47743063  27800000  0xfbd811ee93cdeed6  |  2592K  0.17920   1.6369  163.69s  |      9:09:04  58.22%  |
|  Apr 11  04:47:55  |  M47743063  27900000  0x5d0a374a75323427  |  2592K  0.17920   1.6391  163.91s  |      9:06:18  58.43%  |
|  Apr 11  04:50:39  |  M47743063  28000000  0x177c2ce48730d4a6  |  2592K  0.17920   1.6373  163.73s  |      9:03:32  58.64%  |
|  Apr 11  04:53:23  |  M47743063  28100000  0x6f0e3b06882f59ef  |  2592K  0.17920   1.6392  163.92s  |      9:00:46  58.85%  |
|  Apr 11  04:56:07  |  M47743063  28200000  0x2da418d38e412bf0  |  2592K  0.17920   1.6376  163.76s  |      8:58:00  59.06%  |

When I recently installed CUDA 10.1 the driver was updated so I suspected it would be a bad driver version, so now I updated to driver 419.67, but errors still appears.

The errors I found so far was all at 47M exponent at 2592K FFT, so it might be a problem at that particular FFT size. Now I'm runnning the same exponent at 2744K FFT and comparing residues.

The last thing I can think of is to update driver again to 425.31 that came out yesterday, because the last driver update was only from 419.xx I think to 419.67, so this would be a bigger update. After that I'm out of ideas, anyone else can think of something?

I'm afraid it is bad RAM on the card even though the GPUMemtest and CUDALucas long tests are not finding anything.

Last fiddled with by ATH on 2019-04-12 at 06:00
ATH is offline   Reply With Quote
Old 2019-04-12, 06:20   #2753
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2×13×31 Posts
Default

Quote:
Originally Posted by ATH View Post
I used 3 different Titan Black cards for CUDALucas the past 3+ years and never had a single bad result, now I got 2 bad in the last 2 weeks and at least 2 more not verified yet that I suspect are bad, and I found more errors now I'm starting to compare residues with CPU runs.

Binaries self compiled with CUDA 5.0, CUDA 10.1 and a CUDA8 downloaded binary all have errors. I ran the long self test -r 1 several times with all of them and not a single error. I also downloaded a GPUMem Test 1.2 and ran it many times without any errors.

The card is not overheating and there is no high roundoff errors. Here is an example of a residue error compared to a EC2 cpu run (with ECC RAM). The residue at 27.9M is fine and the residues are 28M and beyond are bad, no indication anything went wrong:
Code:
|  Apr 11  04:34:16  |  M47743063  27400000  0xf08fddb36e736a8b  |  2592K  0.17920   1.6372  163.72s  |      9:20:09  57.39%  |
|  Apr 11  04:37:00  |  M47743063  27500000  0xb25497a24137fb90  |  2592K  0.17920   1.6394  163.94s  |      9:17:23  57.60%  |
|  Apr 11  04:39:44  |  M47743063  27600000  0x6eaf49afd9912e56  |  2592K  0.17920   1.6374  163.74s  |      9:14:37  57.80%  |
|  Apr 11  04:42:28  |  M47743063  27700000  0xc071fedbcffee5e7  |  2592K  0.17920   1.6395  163.95s  |      9:11:51  58.01%  |
|  Apr 11  04:45:11  |  M47743063  27800000  0xfbd811ee93cdeed6  |  2592K  0.17920   1.6369  163.69s  |      9:09:04  58.22%  |
|  Apr 11  04:47:55  |  M47743063  27900000  0x5d0a374a75323427  |  2592K  0.17920   1.6391  163.91s  |      9:06:18  58.43%  |
|  Apr 11  04:50:39  |  M47743063  28000000  0x177c2ce48730d4a6  |  2592K  0.17920   1.6373  163.73s  |      9:03:32  58.64%  |
|  Apr 11  04:53:23  |  M47743063  28100000  0x6f0e3b06882f59ef  |  2592K  0.17920   1.6392  163.92s  |      9:00:46  58.85%  |
|  Apr 11  04:56:07  |  M47743063  28200000  0x2da418d38e412bf0  |  2592K  0.17920   1.6376  163.76s  |      8:58:00  59.06%  |

When I recently installed CUDA 10.1 the driver was updated so I suspected it would be a bad driver version, so now I updated to driver 419.67, but errors still appears.

The errors I found so far was all at 47M exponent at 2592K FFT, so it might be a problem at that particular FFT size. Now I'm runnning the same exponent at 2744K FFT and comparing residues.

The last thing I can think of is to update driver again to 425.31 that came out yesterday, because the last driver update was only from 419.xx I think to 419.67, so this would be a bigger update. After that I'm out of ideas, anyone else can think of something?

I'm afraid it is bad RAM on the card even though the GPUMemtest and CUDALucas long tests are not finding anything.
Of course important to figure out what it is as odds high it is something simple - yet important question is: is your card properly watercooled to room temperature?

Already for nearly 2 decades the chips produced perform best when just under or at 19C. Deviation from that, so extreme cold or hot say 40C+ will cause the chip to wear out - errors start to exist then - or in short it's too high clocked then.

You're probably running that gpu 24/24 or nearby that and it just wasn't designed to do that with the default air cool body it has.

edit: if you clock the chip lower does it have any wrong result then?

You can see every GPU in that sense as overclocked too much. Of course the Tesla's also should suffer from this but as an extra precaution they got ECC.

Your GPU does have CRC32 at the GDDR5 which is also very effective. So first test you could carry out is whether you got CRC32 errors in the GDDR5 if it isn't something simple. If all other problems are not the cause of the problem, the wires of the chip have burned themselves up over time. As simple as that.

edit2: the chip also will eat up to 20%+ more juice (electricity) when it is hot. This +20% extra power usage could already start to occur at 50C.

Last fiddled with by diep on 2019-04-12 at 06:27
diep is offline   Reply With Quote
Old 2019-04-12, 06:32   #2754
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

1101011010102 Posts
Default

No, it is not water cooled, it is an old Titan Black from 2014, I'm not sure I could get water cooling for it even if I wanted.

It is running at 76C, but I do not think that is the problem. I have been running the cards at 70C-82C for 3 years without a single bad result until now.


Of course it would be better to cool it to room temperature, but I think very few people have that good of a setup, and most can run CUDALucas anyway.

Last fiddled with by ATH on 2019-04-12 at 06:34
ATH is offline   Reply With Quote
Old 2019-04-12, 06:41   #2755
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

32616 Posts
Default

Quote:
Originally Posted by ATH View Post
No, it is not water cooled, it is an old Titan Black from 2014, I'm not sure I could get water cooling for it even if I wanted.

It is running at 76C, but I do not think that is the problem. I have been running the cards at 70C-82C for 3 years without a single bad result until now.


Of course it would be better to cool it to room temperature, but I think very few people have that good of a setup, and most can run CUDALucas anyway.
Oh at this temperature first thing you should consider is removing all dust out of the gpu's cooling body. Remove a few screws and click away the plastic body and remove all dust, put it back and try again.

I've got a Titan-Z here that's 2 times your chip on a single card :)
And i'm watercooling it of course.

Yeah there is excellent watercooling bodies for it. Cheap $15 pump to it and some tubing and radiator with fan. Radiator online easy to find.

The cooling bodies sometimes not so cheap - yet just a thing only for the gpu itself - i happen to have 2 left here. As i first intended to use that prior to finding online a good cooling body that covers entire GPU.

Only watercooling the chip means you still need to blow massive air onto the rest of the gpu and regurarly remove dust. So a cooling body that covers entire card is the best to have.

GDDR5 - i didn't look it up what temperature it starts to get errors.
Most ECC ram can handle a tad higher temperature than 'normal rams' which is around 80-85C that they start to give errors.

So the easiest thing to check is for CRC32 errors - of course AFTER removing all dust out of the cards cooling body.

edit: "start to give errors OVER TIME". So after a while.

Last fiddled with by diep on 2019-04-12 at 06:43
diep is offline   Reply With Quote
Old 2019-04-12, 07:17   #2756
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2·17·101 Posts
Default

These cards are at least 70 C when they are brand new and running CUDALucas, so 76C is not that high, and I do clean out dust regularly. I'm not getting water cooling for this old card.

How do I check for CRC32 errors? I'm not coding CUDA applications myself.

Last fiddled with by ATH on 2019-04-12 at 07:18
ATH is offline   Reply With Quote
Old 2019-04-12, 07:48   #2757
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2×13×31 Posts
Default

Quote:
Originally Posted by ATH View Post
These cards are at least 70 C when they are brand new and running CUDALucas, so 76C is not that high, and I do clean out dust regularly. I'm not getting water cooling for this old card.

How do I check for CRC32 errors? I'm not coding CUDA applications myself.
I consider 76C really high. It's too close to the errorlevel IMHO.
And it'll eat 30% more power or so.

As for the measuring of the CRC32 - good question. Every GDDR5 line has a CRC32.
Yet there was a simple answer to how to do that - yet i FORGOT what that was. Maybe ask again at Nvidia forum somewhere.
diep is offline   Reply With Quote
Old 2019-04-12, 09:39   #2758
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

3×23×59 Posts
Default

Quote:
Originally Posted by diep View Post
.. so extreme cold or hot say 40C+ will cause the chip to wear out
the chip also will eat up to 20%+ more juice (electricity) when it is hot. This +20% extra power usage could already start to occur at 50C.
Quote:
Originally Posted by diep View Post
I consider 76C really high
I don't. Your previous claim of 40C borders on impossible. Even idle I don't expect any GPU to get much below 40C. Anything below 80C is perfectly fine in my world. It's only once the temperature gets to 90C that I become concerned. Lower is better of course, but not to the temperatures you appear to be targeting, unless you leave your computer outside in a Canadian winter.

Last fiddled with by James Heinrich on 2019-04-12 at 09:42
James Heinrich is online now   Reply With Quote
Old 2019-04-12, 11:28   #2759
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

32616 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
I don't. Your previous claim of 40C borders on impossible. Even idle I don't expect any GPU to get much below 40C. Anything below 80C is perfectly fine in my world. It's only once the temperature gets to 90C that I become concerned. Lower is better of course, but not to the temperatures you appear to be targeting, unless you leave your computer outside in a Canadian winter.
You try to ventilate 350 watt of power through a few square centimeters using air ventilation on this GPU. That's utterly disturbed way to do it. The result of that is absurd high temperatures on the GPU's.

That eats massive more power to start with. Easily 30% more power than when the chip gets cooled to room temperature.

What you produce with ASML machines works like that for a while now simply.

With simple watercooling you get that down a lot, though not very close to room temperature. That's because the fans of the radiators i want to make very little noise.

Busy now with an upgrade of the watercooling there so that less water evaporates as i use a huge aquarium to silence the (eheim) waterpump completely.

Will post a picture when it works - maybe somewhere next week.
diep is offline   Reply With Quote
Old 2019-04-12, 14:05   #2760
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

127658 Posts
Default

Quote:
Originally Posted by diep View Post
The result of that is absurd high temperatures on the GPU's.
What you call absurd, everyone else calls normal- from the manufacturers to gamers to enthusiasts doing computation on CUDA cards. The tone of your posts is that you have the One True Way, and everyone else is a fool; that makes you look the fool for being so adamant.

I hope you're not this abrasively dogmatic in real life, too.
VBCurtis is offline   Reply With Quote
Old 2019-04-12, 14:30   #2761
GhettoChild
 
"Ghetto_Child"
Jul 2014
Montreal, QC, Canada

2916 Posts
Exclamation Highly Annoying Medium Level Bug Report

I've made this complaint in previous versions before and you've managed to code it back in again. It's quite an annoying error because it causes immense loss of processing time and waste of utility bills.

The program doesn't know which interval is most efficient to restart from upon round-off errors. It's using the last screen output iteration interval instead of taking whichever is the smaller iteration interval, or at least just the checkpoint iteration interval. My settings are to check the round-off on each iteration. In this example I'm posting my checkpoint writes are set to every 100,000 iterations in the ini file as well as my screen output. However I also pass the command line flags to set the screen output to 100,000 iterations but I manually increased this value to 500,000 iterations using the interactive keys value while the program was running. Wasteful results below:

CUDALucas2.05.1-CUDA8.0-Windows-x64.exe
Code:
Y
 -- report_iter increased to 500000

|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  Apr 10  16:07:21  |  M89951461  27500000  ******************  |  4860K  0.33594  72.8459 10362.19s  |  36:18:35:00  30.57%  |
|  Apr 10  22:58:20  |  M89951461  28000000  ******************  |  4860K  0.33008  49.3178 24658.93s  |  36:06:21:35  31.12%  |
|  Apr 11  05:49:13  |  M89951461  28500000  ******************  |  4860K  0.34375  49.3050 24652.49s  |  35:19:50:03  31.68%  |
|  Apr 11  12:37:13  |  M89951461  29000000  ******************  |  4860K  0.32422  48.9598 24479.94s  |  35:09:31:07  32.23%  |
|  Apr 11  19:29:07  |  M89951461  29500000  ******************  |  4860K  0.34375  49.4265 24713.27s  |  35:01:00:58  32.79%  |
Round off error at iteration = 29891700, err = 0.39063 > 0.35, fft = 4860K.
Restarting from last checkpoint to see if the error is repeatable.

Using threads: square 64, splice 256.

Continuing M89951461 @ iteration 29500001 with fft length 4860K, 32.80% done

Round off error at iteration = 29609100, err = 0.35156 > 0.35, fft = 4860K.
The error persists.
Trying a larger fft until the next checkpoint.

Using threads: square 64, splice 32.

Continuing M89951461 @ iteration 29500001 with fft length 5120K, 32.80% done

z
  -- fft count                      177
  -- current fft                    5120K
  -- smallest fft for this exponent 4860K
  -- largest fft for this exponent  6480K
  -- square threads                 64
  -- splice threads                 32
  -- checkpoint interval            100000
  -- report interval                500000
  -- error check interval           100
  -- error reset percent            85
  -- error limit                    40
  -- polite flag                    1
  -- polite value                   10
  -- sleep flag                     0
  -- sleep value                    100
  -- 64 bit carry flag              0
  -- save all checkpoints flag      0
  -- device number                  0
  -- savefile folder                savefiles
  -- ini file                       CUDALucas.ini
  -- input file                     worktodo.txt
  -- results file                   results.txt

Last fiddled with by GhettoChild on 2019-04-12 at 14:34 Reason: added error clarification
GhettoChild is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 03:46.


Fri Jan 27 03:46:14 UTC 2023 up 162 days, 1:14, 0 users, load averages: 0.49, 0.69, 0.85

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔