![]() |
We have never experienced a GPU error until tonight. So far the error is not reproducible. We are not sure how to diagnose it. It could be the memory or the GPU or the driver.
The first error happened with the default (3M) FFT. We then ran it again with a larger (4M) FFT to eliminate possible rounding problems. The error went away. But when we reran the default FFT the error did not show up again. So we will let it run overnight and see what develops.[CODE]2020-12-10 19:55:57 gfx804-0 57884161 FFT: 3M 1K:6:256 (18.40 bpw) 2020-12-10 19:55:57 gfx804-0 Expected maximum carry32: 424C0000 2020-12-10 19:55:57 gfx804-0 OpenCL args "-DEXP=57884161u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=6u -DPM1=0 -DAMDGPU=1 -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0x8.3c97b67d7c268p-4 -DIWEIGHT_STEP_MINUS_1=-0xa.e000341d8b4f8p-5 -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only " 2020-12-10 19:55:57 gfx804-0 ASM compilation failed, retrying compilation using NO_ASM 2020-12-10 19:55:59 gfx804-0 OpenCL compilation in 1.69 s 2020-12-10 19:56:04 gfx804-0 57884161 OK 0 loaded: blockSize 400, 0000000000000003 2020-12-10 19:56:04 gfx804-0 validating proof residues for power 8 2020-12-10 19:56:04 gfx804-0 Proof using power 8 2020-12-10 19:56:16 gfx804-0 57884161 OK 800 0.00%; 10302 us/it; ETA 6d 21:38; 2b49902d4f6905c2 (check 4.16s) 2020-12-10 19:57:56 gfx804-0 57884161 OK 10000 0.02%; 10388 us/it; ETA 6d 23:00; 292e73fc56b7ff86 (check 4.17s) 2020-12-10 19:59:44 gfx804-0 57884161 OK 20000 0.03%; 10362 us/it; ETA 6d 22:33; fc68ecb5bf035d79 (check 4.18s) 2020-12-10 20:01:31 gfx804-0 57884161 OK 30000 0.05%; 10348 us/it; ETA 6d 22:18; c087f22eb0605bc2 (check 4.16s) 2020-12-10 20:03:19 gfx804-0 57884161 EE 40000 0.07%; 10398 us/it; ETA 6d 23:04; e8e70a03278c7bee (check 4.17s) 2020-12-10 20:03:24 gfx804-0 57884161 OK 30000 loaded: blockSize 400, c087f22eb0605bc2 2020-12-10 20:05:13 gfx804-0 57884161 OK 40000 0.07%; 10475 us/it; ETA 7d 00:19; e8e70a03278c7bee (check 4.20s) 1 errors 2020-12-10 20:07:01 gfx804-0 57884161 OK 50000 0.09%; 10340 us/it; ETA 6d 22:07; 166a6e02aca42253 (check 4.17s) 1 errors 2020-12-10 20:08:48 gfx804-0 57884161 EE 60000 0.10%; 10313 us/it; ETA 6d 21:39; 23fc7a31e5763224 (check 4.16s) 1 errors 2020-12-10 20:08:53 gfx804-0 57884161 OK 50000 loaded: blockSize 400, 166a6e02aca42253 2020-12-10 20:10:40 gfx804-0 57884161 OK 60000 0.10%; 10351 us/it; ETA 6d 22:16; 23fc7a31e5763224 (check 4.14s) 2 errors 2020-12-10 20:12:28 gfx804-0 57884161 OK 70000 0.12%; 10359 us/it; ETA 6d 22:22; bff8335d5596df84 (check 4.14s) 2 errors 2020-12-10 20:14:16 gfx804-0 57884161 OK 80000 0.14%; 10355 us/it; ETA 6d 22:16; 176d39434271b754 (check 4.17s) 2 errors 2020-12-10 20:16:04 gfx804-0 57884161 OK 90000 0.16%; 10410 us/it; ETA 6d 23:08; 78ab0ddd577e4377 (check 4.16s) 2 errors[/CODE][CODE]2020-12-10 20:29:47 gfx804-0 57884161 FFT: 4M 1K:8:256 (13.80 bpw) 2020-12-10 20:29:47 gfx804-0 Expected maximum carry32: 3350000 2020-12-10 20:29:47 gfx804-0 OpenCL args "-DEXP=57884161u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DPM1=0 -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0x9.7bac6e6a40e48p-6 -DIWEIGHT_STEP_MINUS_1=-0x8.4260f87783e98p-6 -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only " 2020-12-10 20:29:47 gfx804-0 ASM compilation failed, retrying compilation using NO_ASM 2020-12-10 20:29:50 gfx804-0 OpenCL compilation in 3.27 s 2020-12-10 20:29:57 gfx804-0 57884161 OK 0 loaded: blockSize 400, 0000000000000003 2020-12-10 20:29:57 gfx804-0 validating proof residues for power 8 2020-12-10 20:29:57 gfx804-0 Proof using power 8 2020-12-10 20:30:14 gfx804-0 57884161 OK 800 0.00%; 14209 us/it; ETA 9d 12:28; 2b49902d4f6905c2 (check 5.74s) 2020-12-10 20:32:31 gfx804-0 57884161 OK 10000 0.02%; 14240 us/it; ETA 9d 12:55; 292e73fc56b7ff86 (check 5.74s) 2020-12-10 20:34:59 gfx804-0 57884161 OK 20000 0.03%; 14203 us/it; ETA 9d 12:18; fc68ecb5bf035d79 (check 5.74s) 2020-12-10 20:37:26 gfx804-0 57884161 OK 30000 0.05%; 14204 us/it; ETA 9d 12:16; c087f22eb0605bc2 (check 5.74s) 2020-12-10 20:39:54 gfx804-0 57884161 OK 40000 0.07%; 14204 us/it; ETA 9d 12:14; e8e70a03278c7bee (check 5.74s) 2020-12-10 20:42:22 gfx804-0 57884161 OK 50000 0.09%; 14203 us/it; ETA 9d 12:10; 166a6e02aca42253 (check 5.74s) 2020-12-10 20:44:50 gfx804-0 57884161 OK 60000 0.10%; 14205 us/it; ETA 9d 12:10; 23fc7a31e5763224 (check 5.75s) 2020-12-10 20:47:18 gfx804-0 57884161 OK 70000 0.12%; 14207 us/it; ETA 9d 12:10; bff8335d5596df84 (check 5.75s) 2020-12-10 20:49:46 gfx804-0 57884161 OK 80000 0.14%; 14226 us/it; ETA 9d 12:25; 176d39434271b754 (check 5.80s) 2020-12-10 20:52:15 gfx804-0 57884161 OK 90000 0.16%; 14321 us/it; ETA 9d 13:54; 78ab0ddd577e4377 (check 5.81s)[/CODE]:sad: |
The error EE occurs when I overdo it a little at overclocking the memory, e.g. 2200 MHz instead of 2100 MHz (only 200 MHz too much is possible)
|
More errors, even with a larger than necessary FFT length:[CODE]2020-12-11 02:42:04 config: -prp 77936867 -log 10000 -fft 5M
2020-12-11 02:42:04 gfx804-0 77936867 FFT: 5M 1K:10:256 (14.87 bpw) 2020-12-11 02:42:04 gfx804-0 Expected maximum carry32: 79A0000 2020-12-11 02:42:04 gfx804-0 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DPM1=0 -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0xc.87616ccfba158p-7 -DIWEIGHT_STEP_MINUS_1=-0xb.696d4cbe9cep-7 -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only " 2020-12-11 02:42:04 gfx804-0 ASM compilation failed, retrying compilation using NO_ASM 2020-12-11 02:42:10 gfx804-0 OpenCL compilation in 5.61 s2020-12-11 02:42:19 gfx804-0 77936867 OK 0 loaded: blockSize 400, 0000000000000003 2020-12-11 02:42:41 gfx804-0 77936867 OK 800 0.00%; 18441 us/it; ETA 16d 15:14; 1579c241dc63eca6 (check 7.46s) 2020-12-11 02:45:38 gfx804-0 77936867 OK 10000 0.01%; 18432 us/it; ETA 16d 15:00; fc4f135f7cf4ad29 (check 7.46s) 2020-12-11 02:48:50 gfx804-0 77936867 OK 20000 0.03%; 18431 us/it; ETA 16d 14:55; 3cd1bd9d5e09cbc5 (check 7.45s) 2020-12-11 02:52:02 gfx804-0 77936867 OK 30000 0.04%; 18434 us/it; ETA 16d 14:56; c4e0ff35e3290d98 (check 7.46s) 2020-12-11 02:55:13 gfx804-0 77936867 EE 40000 0.05%; 18434 us/it; ETA 16d 14:53; dffe1b1b0d748128 (check 7.44s) 2020-12-11 02:55:22 gfx804-0 77936867 OK 30000 loaded: blockSize 400, c4e0ff35e3290d98 2020-12-11 02:58:34 gfx804-0 77936867 OK 40000 0.05%; 18437 us/it; ETA 16d 14:56; dffe1b1b0d748128 (check 7.46s) 1 errors 2020-12-11 03:01:46 gfx804-0 77936867 OK 50000 0.06%; 18434 us/it; ETA 16d 14:49; 52e286945371ed29 (check 7.46s) 1 errors 2020-12-11 03:04:57 gfx804-0 77936867 OK 60000 0.08%; 18433 us/it; ETA 16d 14:45; 0945da4dc08bdd95 (check 7.44s) 1 errors 2020-12-11 03:08:09 gfx804-0 77936867 EE 70000 0.09%; 18430 us/it; ETA 16d 14:38; 7131fa4eb77f4bb2 (check 7.39s) 1 errors 2020-12-11 03:08:17 gfx804-0 77936867 OK 60000 loaded: blockSize 400, 0945da4dc08bdd95 2020-12-11 03:11:18 gfx804-0 77936867 OK 70000 0.09%; 17402 us/it; ETA 15d 16:24; 7131fa4eb77f4bb2 (check 7.03s) 2 errors 2020-12-11 03:14:19 gfx804-0 77936867 OK 80000 0.10%; 17399 us/it; ETA 15d 16:17; 8d76071d27ee4221 (check 7.03s) 2 errors 2020-12-11 03:17:20 gfx804-0 77936867 EE 90000 0.12%; 17398 us/it; ETA 15d 16:13; 0bacff453b2f470e (check 7.01s) 2 errors 2020-12-11 03:17:28 gfx804-0 77936867 OK 80000 loaded: blockSize 400, 8d76071d27ee4221 2020-12-11 03:20:29 gfx804-0 77936867 OK 90000 0.12%; 17402 us/it; ETA 15d 16:18; 0bacff453b2f470e (check 7.03s) 3 errors 2020-12-11 03:23:30 gfx804-0 77936867 OK 100000 0.13%; 17399 us/it; ETA 15d 16:12; 6d7296b9e2830f50 (check 7.03s) 3 errors 2020-12-11 03:26:31 gfx804-0 77936867 OK 110000 0.14%; 17399 us/it; ETA 15d 16:08; 8cbfd4435622bda7 (check 7.04s) 3 errors 2020-12-11 03:29:32 gfx804-0 77936867 OK 120000 0.15%; 17402 us/it; ETA 15d 16:10; 79ae5dad855057ad (check 7.04s) 3 errors 2020-12-11 03:32:33 gfx804-0 77936867 OK 130000 0.17%; 17400 us/it; ETA 15d 16:05; 50c97bcbf876231f (check 7.04s) 3 errors 2020-12-11 03:35:35 gfx804-0 77936867 OK 140000 0.18%; 17402 us/it; ETA 15d 16:04; e1db15f897271496 (check 7.05s) 3 errors 2020-12-11 03:38:36 gfx804-0 77936867 EE 150000 0.19%; 17401 us/it; ETA 15d 16:00; 127631386c6a9b17 (check 7.02s) 3 errors 2020-12-11 03:38:44 gfx804-0 77936867 EE 140000 loaded: blockSize 400, 0000000000000000 (expected e1db15f897271496) 2020-12-11 03:38:44 gfx804-0 Exiting because "error on load" 2020-12-11 03:38:44 gfx804-0 Bye[/CODE] |
Mike, what have your temps been like?
I think you live farther south than I and ours have been crazy, around 60 degress plus which is insane for December. Has this caused your hardware to run hotter than normal? |
[QUOTE=Xyzzy;565934]More errors, even with a larger than necessary FFT length[/QUOTE]
Changing FFT length wont help (as you found out). The most likely problems are: 1) the computational units are running too hot or too fast . 2) the memory is running too hot or too fast 3) inadequate or flaky power supply If you can reduce speed or increase voltage do that until the errors go away. |
Mike, what gpu?
Some of them "develop unique personalities". I have a GTX 1080 that is anywhere from solid for days, to occasional EE, to EE EE EE Bye in PRP/GEC, even though there's a desk fan blowing on it and it's running stock clocks. It seems to go from fine to stopping in a few C ambient variation. The system has a new 750W power supply. One of the Radeon VIIs does not like P-1, and has hour-long fail to transfer to host or something like that, followed by a persistent switch from normal clocks to an unchangeable 570Mhz gpu clock (loaded or not) until the system is restarted. Run PRP/GEC on it, and it's fine. The others on the same system are unaffected by whatever's happening with that one. A 4GB RX550 on another system has taken to just stalling, no gpuowl progress for hours or days, with GPU-Z displaying full clock rate whether the stalled process is left alone or killed. Again, a system restart to clear that. The 2GB RX550 in the same box is unaffected. I think the driver or something gets confused about which device is which; running on -d 0 and -d 1 both land on the 2GB then. [SPOILER]I know, switch to Linux and all will be solved.[/SPOILER] |
[QUOTE=kriesel;565950]Mike, what gpu?
Some of them "develop unique personalities". [/QUOTE] See post 2618: Radeon Pro WX 2100. |
1 Attachment(s)
It is a Radeon Pro WX 2100. We can pull that card out and put in a different card and everything is rock solid, so it must be the card. We were running it at whatever the default settings were so it should have been okay. The GPU temperature was steady at 80C. The PSU is a newish top-end "platinum" Corsair. Aiming a fan at the system didn't change anything.
Life is too short to deal with broken hardware so it will be replaced. Thanks for all of the tips! Here is picture of our current system. :mike: |
ASRock Deskmini A300W, AMD A8-9600, Radeon R7 IGPU, 16GB DDR-4, SSD, Windows 10.
gpuowl V6.11-380 [CODE]2020-12-05 18:14:05 gpuowl v6.11-380-g79ea0cc 2020-12-05 18:14:05 config: -iters 200000 -prp 77936867 2020-12-05 18:58:18 Bristol Ridge-0 77936867 OK 200000 0.26%; 13201 us/it; ETA 11d 21:04; f0b04b45b0855bd2 (check 5.19s) 2020-12-05 18:58:23 Bristol Ridge-0 Stopping, please wait.. 2020-12-05 18:58:33 Bristol Ridge-0 77936867 OK 200800 0.26%; 12658 us/it; ETA 11d 09:20; 895b034c5473a608 (check 5.21s)[/CODE] gpuowl V7.2-21 [CODE]2020-12-13 15:08:17 GpuOwl VERSION v7.2-21-g28dbf88 2020-12-13 15:08:17 config: -iters 200000 -prp 77936867 2020-12-13 15:50:54 Bristol Ridge-0 77936867 190000 0.24% f37f068f014b18a0 13328 us/it 2020-12-13 15:53:07 Bristol Ridge-0 77936867 Stopping, please wait.. 2020-12-13 15:53:13 Bristol Ridge-0 77936867 OK 200000 0.26% f0b04b45b0855bd2 13337 us/it + check 5.46s + save 0.21s; ETA 12d 00:00[/CODE] |
[URL="https://www.techpowerup.com/gpu-specs/msi-gt-710-low-profile-1-gb.b5859"]GT 710[/URL][CODE]2020-12-14 14:28:46 GeForce GT 710-0 OpenCL compilation in 1.65 s
2020-12-14 14:29:17 GeForce GT 710-0 77936867 OK 0 loaded: blockSize 400, 0000000000000003 2020-12-14 14:30:41 GeForce GT 710-0 77936867 OK 800 0.00%; 69538 us/it; ETA 62d 17:26; 1579c241dc63eca6 (check 27.78s) 2020-12-14 14:41:41 GeForce GT 710-0 77936867 OK 10000 0.01%; 68817 us/it; ETA 62d 01:39; fc4f135f7cf4ad29 (check 27.57s) 2020-12-14 14:53:36 GeForce GT 710-0 77936867 OK 20000 0.03%; 68739 us/it; ETA 61d 23:46; 3cd1bd9d5e09cbc5 (check 27.55s) 2020-12-14 15:05:31 GeForce GT 710-0 77936867 OK 30000 0.04%; 68725 us/it; ETA 61d 23:16; c4e0ff35e3290d98 (check 27.57s)[/CODE]:ouch: |
[QUOTE=M344587487;563394][url]..Based on the slides it should be 56-65% the cost of an A100, your number looks about right. The chance of a consumer version is ~0%, but there may be a pro variant. The poorly binned dies have to go somewhere, maybe there'll be an MI90 instead which would be a sad development.[/QUOTE]
Will anyone here ever get their hands on this AMD ROCmâ„¢ Compatible gold medal favorite? :rgerbicz: [URL="https://www.amd.com/en/products/server-accelerators/instinct-mi100"]https://www.amd.com/en/products/server-accelerators/instinct-mi100[/URL] |
| All times are UTC. The time now is 07:02. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.