![]() |
|
|
#2597 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
Quote:
Last fiddled with by kriesel on 2017-04-19 at 03:15 Reason: add empirical data |
|
|
|
|
|
|
#2598 |
|
"Oliver"
Mar 2005
Germany
111110 Posts |
OK, got my hand on another set of P100, this time the higher clocked Tesla P100-SXM2-16GB. Compared to the Tesla P100-PCIE-16GB these babies have 300W TDP (instead of 250W) and a bit higher clock rates (base and boost clocks). Memory bandwidth is exactly the same.. and CUDALucas performance too! So seems like CUDALucas is completely memory bandwidth bound on P100!
Oliver |
|
|
|
|
|
#2599 |
|
Random Account
Aug 2009
22·3·163 Posts |
2.06 Beta runs very well on my hardware. The only difference I see is the round-off error values are a bit higher. With 2.05, they stayed in the 0.05 range. With the Beta, they are running 0.62 to 0.75. I don't know if this is worth mentioning, but I guess it won't hurt.
Thanks!
|
|
|
|
|
|
#2600 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
Quote:
|
|
|
|
|
|
|
#2601 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
Correction, cudapm1 terminates if round-off error exceeds 0.40;cudalucas changes fftlength and resumes from the last checkpoint if the round-off error exceeds 0.35.
|
|
|
|
|
|
#2602 | |
|
Random Account
Aug 2009
22·3·163 Posts |
Quote:
Edit: I just noticed a serious error in my post about the Beta. The round-off error values should be 0.062 to 0.075. I fudged the decimal point. Sorry! Last fiddled with by storm5510 on 2017-05-09 at 04:20 Reason: Correction |
|
|
|
|
|
|
#2603 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
Feature request: device confirmation
There is a chain of events that can lead to multiple instances of CUDALucas (or a mix with cudapm1, mfaktc etc), launched with different -d numbers, unintentionally running on the same physical GPU at the same time, with considerable slowdown as a result of sharing one device and perhaps multiplexing overhead. This can also lead to other unintended effects, depending on the various gpu cards' speeds, memory capacity, and reliability differences, and the types of runs. This sequence has been observed on V2.05.1 repeatedly. I believe it could affect all currently available versions that support multiple device numbers in use on multiple-gpu systems. I have not detected incorrect Lucas Lehmer test results produced as a result of this unintentional gpu sharing or switching occurring; I haven't ruled it out either. I believe it very likely to produce incorrect fft or threads benchmark results. Please consider adding device confirmations, such as gpu model, Bus ID, or bios version, as command line options. This would both help identify when the events occur, so that they can be addressed, and help prevent or reduce the negative impact when the event chain occurs. More detail follows. Part one, status quo -------------------- Example system, Condorette: 3 gpus, all running CUDALucas, mixed versions, Windows 7 64 bit & current on updates, NVIDIA driver 378.66 gtx1050Ti v2.06beta -d 0, Quadro 2000 v2.05.1 -d 1, Quadro 2000 v2.05.1 -d 2; q2000 timings logged went to ~double-duration (1/2 speed when running two instances on q2000's, days after the last system change; from 35-37 ms/iter to ~76 ms/iter on same exponents; persists for weeks until system shutdown). BIOS versions and bus numbers of the cards as reported by GPU-Z instances differ, as follows: A -d 0 GTX 1050 Ti 86.07.22.00.50 Bus 40 Device 0 furthest from CPUs physically B -d 1 q2000 1 70.06.0D.00.02 Bus 15 Device 0 middle card C -d 2 q2000 2 70.06.31.02.02 Bus 28 Device 0 closest to CPUs physically Example without the options requested (in separate directories, run inside batch files): a) CUDALucas2.06beta-CUDA6.5-Windows-win32.exe -d 0 >>cl.txt b) CUDALucas2.05.1-CUDA6.5-Windows-win32.exe -d 1 >>cl.txt c) CUDALucas2.05.1-CUDA6.5-Windows-win32.exe -d 2 >>cl.txt At startup all cards are cool and the above example is fine. But if/when the B card overheats, and shuts down in self-defense, there's a driver timeout detected and Windows attempts to restart it. If the gpu is still too hot since it's being heated by other cards nearby, the gpu fails to restart. (In my experience, the gpu generally does not restart until one or both neighbors' workload is halted so things cool down, and the system is restarted. Rarely, if I recall correctly, the gpu will be restartable hours later.) So now the a and c instances are running, and the corresponding GPU-Z instances can read their respective sensors, but the sensor data and actual clock rate data for B are no longer available. The b instance of cudalucas detects an issue and terminates. The batch file it's launched from launches a new executable using option -d 1 again, which since the B GPU is now invisible, -d 1 is now apparently gpu C according to Windows. Meanwhile instance c of CUDALucas still is successfully running on gpu C, so the b and c instances of CUDALucas timeshare gpu C, with some performance loss. If instance c is later stopped and relaunched, it will fail to find a device 2, print the usual error message and halt, leaving instance b the full use of GPU C. I have confirmed this chain of events by observing logged iteration timing changes after manual instance halts and restarts. Changes in clock rates in GPU-Z instances corresponding to each physical gpu also confirm. In the case of a Quadro 2000, shutdown occurs after GPU-Z displays 98 C GPU temperature. Stated temperature limits vary by GPU model; GTX480 limit is 105 C; GTX1070 94 C; GTX 1050 Ti 97 C. (I've found Quadro 2000 & 4000 temperature limits are not stated in NIVIDIA spec sheets.) If the highest device number card overheats and drops out, device remapping does not occur. If any other does, it's possible that multiple cards can remap to lower device numbers. For example, in a 4-GPU system, if -d 0 drops out, all the others move down by one for new launches. The effect of a card dropping out varies. In one system I have a high reliability gpu as -d 0 running cudalucas and a low default reliability gpu (needs to be clocked slower than its default rates) as -d 1 doing p-1. If the high reliability one dropped out, and the iffy gpu has restarted with a higher clock rate that produces errors, a restart of CUDALucas would put continuation of a lengthy LL test on the iffy gpu and cause bad residues, which I want to avoid. (Tools like MSI Afterburner or EVGA Precision XOC can adjust the clock rates, but I've observed the rates don't stay set, eventually resetting to the higher value that reenables memory errors. I have not yet tried modifying and flashing the gpu BIOS to cap the rates lower.) In the case of remapping putting a run on a different model gpu with lower memory, it might fail to execute, for the larger fft lengths. It could cause real havoc during benchmarking runs, giving timings from the wrong model gpu, or some fft benchmark timings double what they should be due to accidentally sharing a GPU for part of a benchmark run, or a run failing to complete because of resource limits such as memory size differing between different model gpus. Part two, proposed feature additions ------------------------------------ Example with the requested options added & possible syntax: CUDALucas2.06beta-CUDA6.5-Windows-win32.exe -d 0 -m "Quadro 4000" -p "Bus 40 Device 0" -b 86.07.22.00.50 -warn >>cl.txt CUDALucas2.06beta-CUDA6.5-Windows-win32.exe -d 1 -m "Quadro 2000" -p "Bus 15 Device 0" -b 70.06.0D.00.02 -halt >>cl.txt CUDALucas2.06beta-CUDA6.5-Windows-win32.exe -d 2 -m "Quadro 2000" -p "Bus 28 Device 0" -b 70.06.31.02.02 >>cl.txt Here, -m is gpu model identifier (as CUDALucas logs it in -info output; these may not be unique within a system, and some are not on mine) -p is pci bus identifier (I think these are unique within one system) -b is Gpu BIOS version identifier (I haven't seen a case where these match within a system, but it might happen) Ideally, -m -p and -b could be present in any of the permutations (none, any one, any two, or all three; in any order). Options -warn or -halt could be limited to occur only following whatever not-null selection of -b -m -p is used. Note, bios versions can be very different or very similar. Examples I've seen include: 86.07.22.00.50 all 5 fields differ; 8 of 10 digits differ 70.06.31.02.01 70.06.31.02.02 right field differs by one In this case, the executable queries the hardware selected as -d whatever, for matches to its specific expected parameters specified for model, bus id, and/or bios version (preferably unique within the system). It prints to stdout the confirmation options specified and the responses obtained. If there's a mismatch in any it warns if -warn is specified, and goes ahead computing anyway; it halts if a mismatch is detected and -halt is specified. If neither -warn nor -halt are specified, it warns of a mismatch in stdout and proceeds to execute. (Default -warn) The current behavior is equivalent to don't detect, don't warn. I suppose an option -nowarn could also be included. I don't see much utility in that. (end) |
|
|
|
|
|
#2604 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
Ugh. Overnight a CUDALucas V2.05.1 run goes from fine, to pointless, just over half way through. This was not the system or gpu that was overheating previously.
| Jun 04 23:44:45 | M80409907 42500000 0x1b919bf0d5dc5667 | 4320K 0.29297 37.1880 743.76s | 19:19:45:29 52.85% | | Jun 04 23:57:09 | M80409907 42520000 0x26de355425603463 | 4320K 0.28125 37.2359 744.71s | 19:19:25:02 52.87% | | Jun 05 00:09:28 | M80409907 42540000 0x0000000000000002 | 4320K 0.28125 36.9652 739.30s | 19:19:04:25 52.90% | | Jun 05 00:20:54 | M80409907 42560000 0x0000000000000002 | 4320K 0.23906 34.3121 686.24s | 19:18:42:01 52.92% | ... until manual detection ~10am. About three gpu-weeks lost. System backups failed. Now have savefiles ini parameter set to one, executable switched to 2.06beta, and launched. V2.05 would continue to compute meaningless 0x02 residues at full fft length. The April 18 beta computes 0x02 residues until the next checkpoint, & terminates after detecting illegal residue. Batch file wrapper relaunches it, and repeat. May 5 build 2.06beta does the same as the April 18 build. So I conclude an illegal residue and bad c and t files produced by v2.051 are detected but not recovered from by v206beta thru April 18 or May 5 build. I manually renamed the c and t files to prevent restarting from them. After a bit of running, the savefiles directory looks like: Directory of C:\Users\Ken\My Documents\cl-quadro2000-2\savefiles 06/05/2017 11:03 AM . 06/05/2017 11:03 AM .. 06/05/2017 10:38 AM 0 .empty.txt 06/05/2017 11:59 AM 10,051,280 s80409907.100000.fda890b7e00cf3cd.cls 06/05/2017 11:03 AM 10,051,280 s80409907.14136.73bcfcd608d670c6.cls 06/05/2017 10:38 AM 10,051,280 s80409907.43616986.0000000000000002.cls 06/05/2017 10:52 AM 10,051,280 s80409907.43618538.0000000000000002.cls 5 File(s) 40,205,120 bytes It looks like the naming convention for savefiles is s[exponent].[iteration].[hex 64-bit residue].cls. What's the procedure for actually using a valid savefile if needed? Halt CUDALucas, delete or rename or move c and t files, move or copy an s file and rename it to be a c or t file, relaunch CUDALucas? The readme file covers creation of savefiles but seems to be silent about use of savefiles. (end) |
|
|
|
|
|
#2605 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
124538 Posts |
If a CUDALucas lucas lehmer test is moved while in progress, from a slow gpu to a faster gpu, the total required time apparently is not recalculated for the second gpu, so the ETA does not match what calculation with the iteration speed of the faster gpu would indicate. Excerpts from an example run follow, with some comments interspersed.
Continuing M80443463 @ iteration 39520001 with fft length 4320K, 49.13% done | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Jun 22 11:39:57 | M80443463 39540000 0xc5ddef5105433496 | 4320K 0.34375 13.6289 272.56s | 24:22:45:04 49.15% | | Jun 22 11:44:30 | M80443463 39560000 0xb1e6ea9c33c1a3fe | 4320K 0.32813 13.6285 272.57s | 24:22:14:02 49.17% | | Jun 22 11:49:03 | M80443463 39580000 0x43b2046313d7a7b3 | 4320K 0.31250 13.6283 272.56s | 24:21:43:02 49.20% | | Jun 22 11:53:36 | M80443463 39600000 0x7fc26f721831079a | 4320K 0.29688 13.6281 272.56s | 24:21:12:04 49.22% | | Jun 22 11:58:08 | M80443463 39620000 0xd6bab5c2754eac22 | 4320K 0.30859 13.6589 273.17s | 24:20:41:08 49.25% | | Jun 22 12:02:41 | M80443463 39640000 0x66aca7edf49b4250 | 4320K 0.31250 13.6275 272.55s | 24:20:10:13 49.27% | | Jun 22 12:07:13 | M80443463 39660000 0xe2578942679761ce | 4320K 0.30884 13.6281 272.56s | 24:19:39:19 49.30% | | Jun 22 12:11:46 | M80443463 39680000 0x751514eb98ae8498 | 4320K 0.29688 13.6285 272.57s | 24:19:08:28 49.32% | | Jun 22 12:16:19 | M80443463 39700000 0x3af1e6f8d1d5b626 | 4320K 0.31250 13.6287 272.57s | 24:18:37:37 49.35% | | Jun 22 12:20:52 | M80443463 39720000 0xddb86eb30a41f069 | 4320K 0.30078 13.6584 273.16s | 24:18:06:49 49.37% | | Jun 22 12:25:24 | M80443463 39740000 0xb0f8e032c204279b | 4320K 0.32031 13.6276 272.55s | 24:17:36:02 49.40% | 0.0136276 seconds times 80443463 times (1-0.494) = about 6.42 days, not 24.71 days. The ETA is dropping about 30.8 minutes in about 5.5 minutes, a ratio of about 5.57. This example occurred moving an exponent from a Quadro 2000 to a GTX 1050 Ti. Version is CUDALucas v2.06beta 32-bit Windows build, compiled May 5 2017 @ 12:33:52 Hours later, after fft size changes, it adjusts ETA: | Jun 22 17:55:21 | M80443463 41180000 0xa62927d07ad536a8 | 4320K 0.31250 13.6277 272.55s | 23:05:46:37 51.19% | Round off error at iteration = 41191900, err = 0.35938 > 0.35, fft = 4320K. Restarting from last checkpoint to see if the error is repeatable. Using threads: square 32, splice 256. Continuing M80443463 @ iteration 41180001 with fft length 4320K, 51.19% done Round off error at iteration = 41191900, err = 0.35938 > 0.35, fft = 4320K. The error persists. Trying a larger fft until the next checkpoint. Using threads: square 32, splice 256. Continuing M80443463 @ iteration 41180001 with fft length 4608K, 51.19% done | Jun 22 18:05:07 | M80443463 41200000 0x0cb6bac9222e41be | 4608K 0.07031 13.0346 260.67s | 5:22:05:24 51.21% | Resettng fft. Using threads: square 32, splice 256. Continuing M80443463 @ iteration 41200001 with fft length 4320K, 51.22% done | Jun 22 18:09:40 | M80443463 41220000 0xc5e12eeadc77d748 | 4320K 0.31641 13.6281 272.54s | 6:04:29:02 51.24% | Perhaps expected duration is calculated only when beginning an exponent or changing fft size. How about doing it at checkpoint file save intervals? |
|
|
|
|
|
#2606 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
Prime95 estimates completion date/time for multiple entries in its worktodo file. It's very handy for scheduling.
CUDALucas computes estimated time of arrival of completion (ETA), of the current exponent only, while in progress. It would be useful to have an option in CUDALucas to read the entire worktodo file, and compute and display time span and completion date for each line of the worktodo file, at the outset of a run following the usual header and optional info section. Option could be -w or -work. The output might look something like the following (for a GTX 1050 Ti or similar speed card): Code:
Work Queue Status Start-date start-time exponent current-iteration run-length iteration-time total-time-est completion-estimate %-complete Jun 22 18:05:07 M80443463 41200000 4608K 13.0346 12d 3:13:40 Jun 28 16:10:31 51.21% Jun 28 16:10:31 M43161917 0 2304K 6.0141 3d 0:06:20 Jul 01 16:16:51 0.00% (any additional exponents queued would follow in list) Total of 2 exponents queued, occupying estimated 15d 3:20:00 total, 8d 22:11:44 remaining. Code:
Get current date and time
Set start date and time for first work as current date and time
Zero exponent count, estimated run time total, estimated remaining time total
open worktodo file for read
Output header for work table
While (!EOF) {
read a line of worktodo
parse to obtain exponent
increment count of valid lines containing work
if there is a checkpoint file for the exponent in the working directory to resume from{
read it to obtain fft length and saved iteration number (or error handling if read fails or values obtained are not valid)
} else {
determine fft length for the exponent
assume zero iterations performed
}
perhaps, if it is the first valid work line, save exponent for resumption or start after work estimation
look up the iteration time for the fft length
compute duration in seconds as (exponent-2) iterations times iteration-time divided by 1000
compute percent-done as iterations done / (exponent-2) * 100
compute remaining duration for this exponent as duration * (1 - percent-done/100) compute completion date and time as start date and time plus remaining duration
output line
set start date and time (for next worktodo line) as completion date and time for this line's exponent
increment count of valid lines containing work
add this exponent's estimated time to total estimated run time
add this exponent's estimated remaining duration to total remaining time estimate
}
close worktodo
output summary line with count of exponents and total estimated run time.
(continue; resume or start first exponent in worktodo, if there is at least one valid work line)
Comments? (end) Last fiddled with by kriesel on 2017-06-24 at 20:40 Reason: distinguish full run time versus remaining run time totals |
|
|
|
|
|
#2607 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
152B16 Posts |
Quote:
16384 294471259 132.2527 corresponds to more than a year, and 32768 580225813 282.8326 to an LL test run taking up to 5.2 years of a Quadro 2000, 38880 685923253 396.8878 up to 8.6 years. FFT or threads benchmarking to wide ranges of fft lengths has revealed some serious anomalies on certain GPU and CUDA-level combinations. Last fiddled with by kriesel on 2017-07-11 at 03:14 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Don't DC/LL them with CudaLucas | LaurV | Data | 131 | 2017-05-02 18:41 |
| CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 | Brain | GPU Computing | 13 | 2016-02-19 15:53 |
| CUDALucas: which binary to use? | Karl M Johnson | GPU Computing | 15 | 2015-10-13 04:44 |
| settings for cudaLucas | fairsky | GPU Computing | 11 | 2013-11-03 02:08 |
| Trying to run CUDALucas on Windows 8 CP | Rodrigo | GPU Computing | 12 | 2012-03-07 23:20 |