mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   I'm having quality issues on skylake as well (https://www.mersenneforum.org/showthread.php?t=21476)

nucleon 2016-07-30 14:23

I tried lower exponents and do "stuff" to try to induce the error. I tried double check 10000139 which takes <2hours, and do things like start my washing machine and exhaust memory to force pagefile usage. But couldn't cause an error - residue match.

But no luck.

So I decided to test M39295301 again, but this time, do "InterimFiles=600000" in prime.txt, so I could go back in time for troubleshooting purposes. i.e. do disastrous activity 'x', and see what happens, and go back to prior residue.

So I have something 'weird'.

Interim residues went to zero.

[CODE][Sat Jul 30 16:21:06 2016]
M39295301 interim We4 residue CC3C48495000037E at iteration 12000000
M39295301 interim We4 residue 3C49659327DD0526 at iteration 12000001
M39295301 interim We4 residue F44162741F0EE10A at iteration 12000002
[Sat Jul 30 16:48:48 2016]
M39295301 interim We4 residue 0000000000000000 at iteration 12600000
M39295301 interim We4 residue 0000000000000000 at iteration 12600001
M39295301 interim We4 residue 0000000000000000 at iteration 12600002
etc...
Each residue thereafter was all 0s.
[/CODE]

I thought that was weird. I checked the saves:

[CODE]$ ls -lart
-rwxrwx---+ 1 4911984 Jul 30 16:21 p2E95301.020
-rwxrwx---+ 1 3733828 Jul 30 16:48 p2E95301.021[/CODE]

Well I thought that was super weird, why did interim saves drop in size?

So I run the same test again from the "16:21 / *.020" save. And the .021 save was the same size now with non zero residue.

[CODE]
[Sat Jul 30 20:12:53 2016]
M39295301 interim We4 residue 64DD890177223A30 at iteration 12600000
M39295301 interim We4 residue E9CF3C57BEBBD866 at iteration 12600001
M39295301 interim We4 residue FED40900A4BD82C6 at iteration 12600002[/CODE]

[CODE]$ ls -alrt p2E95301.021
-rwxrwx---+ 1 4911984 Jul 30 20:12 p2E95301.021[/CODE]

So I checked my command history to see what I did then.

[CODE]Jul 30 16:27:06> ./CUDAPm1-v0.20.exe[/CODE]

So that's a worry. That looks like some serious corruption by the cuda/GPU drivers. My GPU=Titan Black.

Still doing more testing.

nucleon 2016-08-07 02:16

I think I'm happy it's fixed now.

The machine has done 5x matching double checks in a row. 3x new exponents, and 2x exponents done previously. This is higher than anything done previously.

Resolution?

Given the randomness of the issue, it's probably not just one the items below. But here's what I did:
- remove and reinstall SATA AHCI drivers
- remove old GPU drivers, and reinstall nvidia drivers
- Change the user temp directory and Windows\Temp directory variables to point to a fresh directory and reboot (not all services pickup the change)

Nvidia temp files:

[CODE]
Directory of U:\TEMP\NVIDIA Corporation\NV_Cache

05/08/2016 08:01 PM <DIR> .
05/08/2016 08:01 PM <DIR> ..
02/08/2016 10:43 PM 16,384 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_0_0.bin
02/08/2016 10:43 PM 4,096 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_0_0.toc
02/08/2016 10:43 PM 16,384 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_1_0.bin
02/08/2016 10:43 PM 4,096 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_1_0.toc
03/08/2016 01:57 AM 1,048,576 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_1_1.bin
04/08/2016 09:11 PM 262,144 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_1_1.toc
03/08/2016 01:27 PM 16,384 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_3b623872478f08e_0_0.bin
03/08/2016 01:27 PM 4,096 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_3b623872478f08e_0_0.toc
05/08/2016 08:01 PM 16,384 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_f3279b66e87c6f22_0_0.bin
05/08/2016 08:01 PM 4,096 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_f3279b66e87c6f22_0_0.toc
03/08/2016 01:28 PM 16,384 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_0.bin
03/08/2016 01:28 PM 4,096 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_0.toc
03/08/2016 01:28 PM 1,048,576 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_1.bin
03/08/2016 01:28 PM 262,144 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_1.toc
03/08/2016 01:32 PM 16,777,216 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_2.bin
03/08/2016 01:34 PM 4,194,304 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_2.toc
03/08/2016 01:35 PM 16,777,216 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_3.bin
17 File(s) 40,472,576 bytes
2 Dir(s) 8,476,454,912 bytes free[/CODE]

Variables changed:

[CODE]C:\>set | find "U:"
TEMP=U:\TEMP
TMP=U:\TEMP
[/CODE]

Both the user and system version of these were changed. I created a dummy ram drive (U:) to relocate these files. As it's almost impossible to clear out the temp directory - there always one file locked.

-- Craig

nucleon 2016-08-23 19:11

I feel like an idiot. Pretty much ignore the above.

My GPU blew up a few days ago.

The magic smoke was released. I could smell the smoke.

My guess is something blew or shorted on the gpu board that regulates power.

I tried powering up the PC again, but it seems the PSU over voltage or over current kicked in and didn't allow bootup.

As soon as I replaced GPU, all OK. I'm back up and no errors since.

So I'd say the power components on the GPU board have been playing up for a while.

Possibly since my interstate relocation in March this year.

-- Craig

LaurV 2016-08-24 07:42

Ha! The famous "mosfet bug" hit you too! Those cards have a design error, the mosfet that powers the memory on the back side is not properly dimensioned or not properly covered by the cooler, and it burns when you stress the memories. That is why LL/cudaLucas can kill the card, but TF will not, as mfaktc does not use much memory transfers.

I started to "collect" Titan cards with this issue, hoping to be able to repair them. Three of them will be sent to me soon by airsquirrels, adding to my already existent "stock" (another two damaged, one repaired). So, if you decide to part with it toward the rubbish bin, better send it to me and I may pay you the postal fees, in case you will not try to skin me off (sometimes the postal fees are more than the new goods, hehe, and I never "imported" electronics form Australia). In case I can repair it, I may give it a new life, producing for gimps.


All times are UTC. The time now is 11:45.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.