mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Keeping cuda working over Ubuntu upgrades (https://www.mersenneforum.org/showthread.php?t=20852)

fivemack 2016-01-15 21:36

Keeping cuda working over Ubuntu upgrades
 
After a few automatic upgrades and a reboot, nvidia-smi is telling me

[code]
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
[/code]

and indeed 'lsmod | grep nv' gives no output.

Presumably I need to cause the driver to get rebuilt against the new kernel version, but I can't see how you do that.

Batalov 2016-01-15 21:46

I've had the same problem for years while I was using my home workstation for both work and cuda. (I don't anymore; easier to run cuda computations on the cloud and keep home computer lightly loaded... and in Windows so the kids can use it)

What I've gathered is that NVIDIA makes this non-automatable deliberately. I still have to type 'accept' every time I install
[CODE]# ssh into a new EC2 node
sudo yum -y update
sudo yum -y install tcsh wget bc perl unzip gcc gcc-c++ openssh-clients diffutils gmp-devel kernel-devel-`uname -r`

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/358.16/NVIDIA-Linux-x86_64-358.16.run
sudo sh ./NVIDIA-Linux-x86_64-358.16.run
[/CODE]

frmky 2016-01-15 22:15

[QUOTE=Batalov;422605]I still have to type 'accept' every time I install
[CODE]sudo sh ./NVIDIA-Linux-x86_64-358.16.run -s
[/CODE][/QUOTE]
Add -s as above.

Mark Rose 2016-01-15 22:29

Are you using dkms? It's used to recompile modules on kernel upgrades.

xilman 2016-01-15 23:29

[QUOTE=fivemack;422602]Presumably I need to cause the driver to get rebuilt against the new kernel version, but I can't see how you do that.[/QUOTE]Move to Gentoo. It all "just works":smile:

Well, it does for me anyway.

Paul

fivemack 2016-01-16 11:58

[QUOTE=Mark Rose;422614]Are you using dkms? It's used to recompile modules on kernel upgrades.[/QUOTE]

I believe I'm using dkms, but all it see it doing is deleting old versions of the module when I do apt-get autoremove to clean up the huge pile of old kernels filling my unreasonably-small /boot partition.

fivemack 2016-01-16 12:11

[QUOTE=Batalov;422605]I've had the same problem for years while I was using my home workstation for both work and cuda. (I don't anymore; easier to run cuda computations on the cloud and keep home computer lightly loaded... and in Windows so the kids can use it) [/quote]

Don't you find running CUDA computations on the cloud expensive? I'm paying probably £200/year for electricity for the GTX580, though I suppose a g2.2xlarge at spot price is 5p/hour so that's only a factor two.

fivemack 2016-01-16 12:54

Installing on a fresh machine is basically fine.

But I'm now in a situation where nvidia-sim can't find the device, and downloading the .deb and doing 'sudo apt-get install cuda' just tells me 'cuda is already the newest version'.

sudo apt-get purge cuda; sudo apt-get install cuda also does very little

[code]
sudo apt-get remove cuda
sudo apt-get install cuda
[/code]

re-downloads Java and half of X11, and still leaves me in a situation where nvidia-smi can't find the device.

I'll try again using the run-file that nvidia ship; after a reboot (I really would prefer a solution with no reboots - this is a compute node, I aim to have twelve gnfs-lasieve jobs running 24/365) I get a new exciting unhelpful message

[code]
pumpkin@pumpkin:~$ nvidia-smi
Failed to initialize NVML: GPU access blocked by the operating system
[/code]

fivemack 2016-01-16 14:08

On further examination, the card has fallen off the PCIe bus entirely: lspci | grep -i nv returns nothing. Maybe Ubuntu is not to blame.

chris2be8 2016-01-16 16:37

[QUOTE=Batalov;422605]
What I've gathered is that NVIDIA makes this non-automatable deliberately. I still have to type 'accept' every time I install
[CODE]# ssh into a new EC2 node
sudo yum -y update
sudo yum -y install tcsh wget bc perl unzip gcc gcc-c++ openssh-clients diffutils gmp-devel kernel-devel-`uname -r`

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/358.16/NVIDIA-Linux-x86_64-358.16.run
sudo sh ./NVIDIA-Linux-x86_64-358.16.run
[/CODE][/QUOTE]

Read up on Expect. I used the perl expect module to automate setting a new user's password. There's also a program called expect which is easier to call from a shell script (other parts of the user setup script were already in perl so that was the obvious choice in my case).

You only need to install expect on the system you are connecting to the new node from, it can automate responses to a SSH session.

Chris


All times are UTC. The time now is 14:03.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.