mersenneforum.org CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW)
 Register FAQ Search Today's Posts Mark Forums Read

2019-04-12, 14:31   #2762
diep

Sep 2006
The Netherlands

11001001102 Posts

Quote:
 Originally Posted by VBCurtis What you call absurd, everyone else calls normal- from the manufacturers to gamers to enthusiasts doing computation on CUDA cards. The tone of your posts is that you have the One True Way, and everyone else is a fool; that makes you look the fool for being so adamant. I hope you're not this abrasively dogmatic in real life, too.
It is the standard as they have packed too much horse power in a handful of square centimeters of air ventilation - but it is not normal if you look from a logical viewpoint in it.

It's just the way how the pci-e clicks into the computer - derived from old standards that date back decades when cards that you clicked into the computer didn't need much of a cooling - None of them has been thinking yet about a better solution in order to keep it downwards compatible to click it in something like an 80s PC - provided you got a newer motherboard inside.

That still doesn't make it a good solution.

The machines that produce these CPU's and GPU's - in case of Nvidia that'll be TSMC which is producing them using ASML machines - those ASML machines already for far over a decade or 2, they prefer just under room temperature as the ideal chip temperature.

If those stuck in the past compliment you - you're doing something wrong.

2019-04-12, 15:23   #2763
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

735210 Posts

Quote:
 Originally Posted by GhettoChild I've made this complaint in previous versions before and you've managed to code it back in again. It's quite an annoying error because it causes immense loss of processing time and waste of utility bills. The program doesn't know which interval is most efficient to restart from upon round-off errors. It's using the last screen output iteration interval instead of taking whichever is the smaller iteration interval, or at least just the checkpoint iteration interval. My settings are to check the round-off on each iteration. In this example I'm posting my checkpoint writes are set to every 100,000 iterations in the ini file as well as my screen output. However I also pass the command line flags to set the screen output to 100,000 iterations but I manually increased this value to 500,000 iterations using the interactive keys value while the program was running. Wasteful results below: CUDALucas2.05.1-CUDA8.0-Windows-x64.exe Code: Y -- report_iter increased to 500000 | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Apr 10 16:07:21 | M89951461 27500000 ****************** | 4860K 0.33594 72.8459 10362.19s | 36:18:35:00 30.57% | | Apr 10 22:58:20 | M89951461 28000000 ****************** | 4860K 0.33008 49.3178 24658.93s | 36:06:21:35 31.12% | | Apr 11 05:49:13 | M89951461 28500000 ****************** | 4860K 0.34375 49.3050 24652.49s | 35:19:50:03 31.68% | | Apr 11 12:37:13 | M89951461 29000000 ****************** | 4860K 0.32422 48.9598 24479.94s | 35:09:31:07 32.23% | | Apr 11 19:29:07 | M89951461 29500000 ****************** | 4860K 0.34375 49.4265 24713.27s | 35:01:00:58 32.79% | Round off error at iteration = 29891700, err = 0.39063 > 0.35, fft = 4860K. Restarting from last checkpoint to see if the error is repeatable. Using threads: square 64, splice 256. Continuing M89951461 @ iteration 29500001 with fft length 4860K, 32.80% done Round off error at iteration = 29609100, err = 0.35156 > 0.35, fft = 4860K. The error persists. Trying a larger fft until the next checkpoint. Using threads: square 64, splice 32. Continuing M89951461 @ iteration 29500001 with fft length 5120K, 32.80% done z -- fft count 177 -- current fft 5120K -- smallest fft for this exponent 4860K -- largest fft for this exponent 6480K -- square threads 64 -- splice threads 32 -- checkpoint interval 100000 -- report interval 500000 -- error check interval 100 -- error reset percent 85 -- error limit 40 -- polite flag 1 -- polite value 10 -- sleep flag 0 -- sleep value 100 -- 64 bit carry flag 0 -- save all checkpoints flag 0 -- device number 0 -- savefile folder savefiles -- ini file CUDALucas.ini -- input file worktodo.txt -- results file results.txt
I think that by increasing the screen output time to 500k intervals interactively, you've told it to only save checkpoint intervals every 500k intervals. It can't restart from a checkpoint file more recent than the latest that exists. I suggest you adjust screen update interval to 100K or 50K, to get maximum loss of iteration per resume down to around 1/2 to 1 hour.

Please update to v2.06 (May 5 2017 version). It contains checks for certain invalid interim residues, that v2.05.1 does not.

2019-04-12, 15:50   #2764
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×919 Posts

Quote:
 Originally Posted by diep Of course important to figure out what it is as odds high it is something simple - yet important question is: is your card properly watercooled to room temperature? Already for nearly 2 decades the chips produced perform best when just under or at 19C. Deviation from that, so extreme cold or hot say 40C+ will cause the chip to wear out - errors start to exist then - or in short it's too high clocked then. You're probably running that gpu 24/24 or nearby that and it just wasn't designed to do that with the default air cool body it has.
Manufacturer's spec is 95C max for the Titan Black. https://www.geforce.com/hardware/des...specifications
Various other models range 87 to 105C max spec.

All my gpus run air cooled. Some have wattage comparable to the Titan Black's 250W rating. Ambient temp around all these systems is higher than the 19C you recommend for gpu running temp. To cool my fleet to such a temperature would be cost prohibitive. To try to put the chip temperature down to 19C might require circulating brine, and cause condensation problems in the systems! They run cooler in CUDALucas, ~80% of rated watts, than in Mfaktc ~100%. Most run 24/7/365. Some are thermally limiting. No gpu here is as cool as 40C, even at idle in a running system. Cumulative error rate is quite acceptable, comparable to prime95 rates. There was one GTX480 that developed bad memory errors and became reliably error prone. That was a factory-overclocked model bought used. It was replaced. This is a summary of gpu-decades of run time experience so far.

Last fiddled with by kriesel on 2019-04-12 at 16:06

2019-04-12, 15:52   #2765
ATH
Einyen

Dec 2003
Denmark

65528 Posts

Quote:
 Originally Posted by GhettoChild CUDALucas2.05.1-CUDA8.0-Windows-x64.exe
Do NOT use version 2.05 it has known errors that will sometimes give bad results. You HAVE to use 2.06 even though it is called "Beta", no one got around to removing that beta tag, but that is the only good version.

2019-04-12, 15:59   #2766
ATH
Einyen

Dec 2003
Denmark

2×17×101 Posts

Quote:
 Originally Posted by diep The machines that produce these CPU's and GPU's - in case of Nvidia that'll be TSMC which is producing them using ASML machines - those ASML machines already for far over a decade or 2, they prefer just under room temperature as the ideal chip temperature.
If you ask everyone in GIMPS using GPU I bet 90%+ are running at 70C or more, and a good number of them are running at 80C or more.

Are you running a GPU with mfaktc or CUDALucas with water cooling? What temperature is it at? Can you back up your extreme claims?

I would love to run a GPU at room temperature, but it does not sound realistic. I have never tried water cooling myself either on CPU or GPU, so I'm hesitant to try it and definitely not on an old Titan Black, maybe in my next computer whenever that will be.

Last fiddled with by ATH on 2019-04-12 at 16:01

2019-04-12, 17:10   #2767
GhettoChild

"Ghetto_Child"
Jul 2014

41 Posts

Quote:
 Originally Posted by kriesel I think that by increasing the screen output time to 500k intervals interactively, you've told it to only save checkpoint intervals every 500k intervals. It can't restart from a checkpoint file more recent than the latest that exists. I suggest you adjust screen update interval to 100K or 50K, to get maximum loss of iteration per resume down to around 1/2 to 1 hour. Please update to v2.06 (May 5 2017 version). It contains checks for certain invalid interim residues, that v2.05.1 does not.
I tested that theory, screen output has no effect on checkpoint iteration or the "savepoint on exit". The cEXPONENT and tEXPONENT files still update according to instructed values or events independant of screen output value. In other words when I lowered screen output to 50,000 the checkpoint on the hard drive did not update but the restart point on round-off error changed to smaller intervals than even the actual checkpoint value which resulted in less was reprocessing time but 10x more output lines printed on screen.

I have not tried version 2.06 yet because there is only a beta version available, no full release, and 2.05.1 is full release not beta.

Last fiddled with by GhettoChild on 2019-04-12 at 17:16

2019-04-12, 17:31   #2768
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23·919 Posts

Quote:
 Originally Posted by GhettoChild I tested that theory, screen output has no effect on checkpoint iteration or the "savepoint on exit". The cEXPONENT and tEXPONENT files still update according to instructed values or events independant of screen output value. In other words when I lowered screen output to 50,000 the checkpoint on the hard drive did not update but the restart point on round-off error changed to smaller intervals than even the actual checkpoint value which resulted in less was reprocessing time but 10x more output lines printed on screen. I have not tried version 2.06 yet because there is only a beta version available, no full release, and 2.05.1 is full release not beta.
So what checkpoint interval are you specifying in cudalucas.ini or at the command line or interactively?
From cudalucas.ini:
Code:
# ErrorIterations tells how often the roundoff error is checked. Larger values
# give shorter iteration times, but introduce some uncertainty as to the actual
# maximum roundoff error that occurs during the test. Default is 100.
# ReportIterations is the same as the -x option; it determines how often
# screen output is written. Default is 10000.
# CheckpointIterations is the same as the -c option; it determines how often
# checkpoints are written. Default is 100000.
# Each of these values should be of the form k * 10^n with k = 1, 2, or 5.

ErrorIterations=100
ReportIterations=10000
CheckpointIterations=100000

 2019-04-12, 18:34 #2769 ATH Einyen     Dec 2003 Denmark 2×17×101 Posts I sent a message to Dubslow, flashjh and owftheevil and asked them to remove CUDALucas 2.03 and 2.05 from the site and to remove "Beta" from 2.06: https://sourceforge.net/projects/cudalucas/files/ Last fiddled with by ATH on 2019-04-12 at 18:39
2019-04-12, 19:39   #2770
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×919 Posts

Quote:
 Originally Posted by ATH I sent a message to Dubslow, flashjh and owftheevil and asked them to remove CUDALucas 2.03 and 2.05 from the site and to remove "Beta" from 2.06: https://sourceforge.net/projects/cudalucas/files/
As I recall, flashjh in response to https://www.mersenneforum.org/showpo...postcount=2708 and the next couple messages had added it to his queue. Can't find the post ATM.

2019-04-12, 20:55   #2771
GhettoChild

"Ghetto_Child"
Jul 2014

4110 Posts

Quote:
 Originally Posted by kriesel So what checkpoint interval are you specifying in cudalucas.ini or at the command line or interactively? From cudalucas.ini:
Exactly what I posted, I have 100,000 set in the ini file for both screen output and checkpoint interval. When I run the command line I add "-x 100000" but I don't specify the -t flag at all since it's already in the ini file. After I'm satisfied with the program running I use the "Y" input value to increase the screen output to every 500,000. Then I just wait throughout the day unti an error occurs. Same if I lower screen output using "y" down to 50000. No matter what value it uses the screen output interval as the last checkpoint instead of at least the checkpoint interval value or whichever of the two is a smaller interval.

 2019-04-13, 05:21 #2772 GhettoChild   "Ghetto_Child" Jul 2014 Montreal, QC, Canada 2916 Posts I think I just now saw the tEXPONENT and cEXPONENT files update in sync with screen output instead of the checkpoint interval this time; but it's hard to be certain. I'm going to try adding the -c command line flag with 100000 for the checkpoint value to see if this error occurs that way. After that I'll try setting screen output larger than checkpoint right in the ini file to begin with and use no command line flags/switches to see if the error also occurs.

 Similar Threads Thread Thread Starter Forum Replies Last Post LaurV Data 131 2017-05-02 18:41 Brain GPU Computing 13 2016-02-19 15:53 Karl M Johnson GPU Computing 15 2015-10-13 04:44 fairsky GPU Computing 11 2013-11-03 02:08 Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 05:45.

Fri Jan 27 05:45:32 UTC 2023 up 162 days, 3:14, 0 users, load averages: 0.88, 0.97, 1.00