mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   ubuntu-15.10 nvidia-355 driver fails to detect two dissimilar GPUs (https://www.mersenneforum.org/showthread.php?t=20655)

fivemack 2015-11-13 20:18

ubuntu-15.10 nvidia-355 driver fails to detect two dissimilar GPUs
 
[code]
$ nvidia-smi
Fri Nov 13 20:08:05 2015
+------------------------------------------------------+
| NVIDIA-SMI 355.11 Driver Version: 355.11 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 970 Off | 0000:02:00.0 Off | N/A |
| 0% 27C P0 43W / 160W | 15MiB / 4093MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[/code]

[code]
$ lspci | grep -i nvi
02:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1)
02:00.1 Audio device: NVIDIA Corporation GM204 High Definition Audio Controller (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation GF110 [GeForce GTX 580] (rev a1)
03:00.1 Audio device: NVIDIA Corporation GF110 High Definition Audio Controller (rev a1)
[/code]

[code]
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Thu_Jul_17_21:41:27_CDT_2014
Cuda compilation tools, release 6.5, V6.5.12
[/code]

This nvcc won't compile for compute_10, so I removed the references to that from msieve/b40c/Makefile, but still I get

[code]
pumpkin@pumpkin:~/msieve-cuda/trunk/X$ time ../msieve -g 0 -np1 "stage1_norm=1e25 0,1000"
error (line 71): CUDA_ERROR_NO_DEVICE
[/code]

Moreover:
[code]
pumpkin@pumpkin:~/msieve-cuda/trunk/X$ dpkg -l "*nvi*"
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=================================-=====================-=====================-=======================================================================
un libgl1-nvidia-alternatives <none> <none> (no description available)
rc nvidia-352 352.41-0ubuntu1 amd64 NVIDIA binary driver - version 352.41
ii nvidia-355 355.11-0ubuntu0~gpu15 amd64 NVIDIA binary driver - version 355.11
un nvidia-common <none> <none> (no description available)
un nvidia-compute-profiler <none> <none> (no description available)
un nvidia-cuda-debugger <none> <none> (no description available)
ii nvidia-cuda-dev 6.5.14-2 amd64 NVIDIA CUDA development files
ii nvidia-cuda-doc 6.5.14-2 all NVIDIA CUDA and OpenCL documentation
ii nvidia-cuda-gdb 6.5.14-2 amd64 NVIDIA CUDA Debugger (GDB)
un nvidia-cuda-profiler <none> <none> (no description available)
ii nvidia-cuda-toolkit 6.5.14-2 amd64 NVIDIA CUDA development toolkit
un nvidia-driver-binary <none> <none> (no description available)
un nvidia-libopencl1 <none> <none> (no description available)
un nvidia-libopencl1-352 <none> <none> (no description available)
un nvidia-libopencl1-352-updates <none> <none> (no description available)
un nvidia-libopencl1-dev <none> <none> (no description available)
ii nvidia-opencl-dev:amd64 6.5.14-2 amd64 NVIDIA OpenCL development files
un nvidia-opencl-icd <none> <none> (no description available)
ii nvidia-opencl-icd-352 352.55-0ubuntu0~gpu15 amd64 NVIDIA OpenCL ICD
rc nvidia-opencl-icd-352-updates 352.41-0ubuntu1 amd64 NVIDIA OpenCL ICD
un nvidia-opencl-icd-355 <none> <none> (no description available)
un nvidia-opencl-profiler <none> <none> (no description available)
un nvidia-persistenced <none> <none> (no description available)
ii nvidia-prime 0.8.1 amd64 Tools to enable NVIDIA's Prime
ii nvidia-profiler 6.5.14-2 amd64 NVIDIA Profiler for CUDA and OpenCL
ii nvidia-settings 358.09-0ubuntu0~gpu15 amd64 Tool for configuring the NVIDIA graphics driver
un nvidia-settings-binary <none> <none> (no description available)
un nvidia-vdpau-driver <none> <none> (no description available)
ii nvidia-visual-profiler 6.5.14-2 amd64 NVIDIA Visual Profiler for CUDA and OpenCL
[/code]

fivemack 2015-11-13 20:52

If I swap the cards round so the 580 is in the top slot, nvidia-smi only picks up the 580, and still gives me
[code]
pumpkin@pumpkin:~/msieve-cuda/trunk/X$ time ../msieve -g 0 -np1 "stage1_norm=1e25 0,1000"
error (line 71): CUDA_ERROR_NO_DEVICE
[/code]

Mark Rose 2015-11-13 21:04

You may wish to add msieve to the title.

chris2be8 2015-11-13 22:44

This sounds like a hardware problem. But here are a few things to check in case it's a config problem.

Does lspci show them? What driver does lspci -v show for them?
Do they both work with anything else, eg gmp-ecm?
Does dmesg or syslog show anything interesting.
Does either card work if it's the only card in the system?

Is the PSU able to feed both cards?
Does the motherboard manual say both slots are suitable for a GPU?

Chris

fivemack 2015-11-14 07:57

It's a 1kW PSU; the system has in the recent past worked successfully with two GTX580 cards running simultaneously. However, since then I have reinstalled the OS (previously it was Ubuntu 13.10, now it is 15.10) and replaced one of the GTX580 with a GTX970 (in the same slot). I was not able to get gpgpu to work on the new OS before adding the new card. I suspect this is just a config or driver problem, but I don't know how to attack it.

Output of 'sudo lspci -v' looks OK to me:

[code]
02:00.0 VGA compatible controller: NVIDIA Corporation GF110 [GeForce GTX 580] (rev a1) (prog-if 00 [VGA controller])
Subsystem: CardExpert Technology Device 0401
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at b8000000 (64-bit, prefetchable) [size=128M]
Memory at c0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at fb000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Virtual Channel
Capabilities: [128] Power Budgeting <?>
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Kernel driver in use: nvidia

02:00.1 Audio device: NVIDIA Corporation GF110 High Definition Audio Controller (rev a1)
Subsystem: CardExpert Technology Device 0401
Flags: bus master, fast devsel, latency 0, IRQ 17
Memory at fb080000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Kernel driver in use: snd_hda_intel

03:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Gigabyte Technology Co., Ltd Device 36bc
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f8000000 (32-bit, non-prefetchable) [size=16M]
Memory at a0000000 (64-bit, prefetchable) [size=256M]
Memory at b0000000 (64-bit, prefetchable) [size=32M]
I/O ports at d000 [size=128]
Expansion ROM at f9000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Kernel driver in use: nvidia
[/code]

chris2be8 2015-11-14 16:48

I'm running my GTX 970 on openSUSE 13.2 so can't help with driver issues on Ubuntu.

The lspci output suggests both cards are OK from a hardware viewpoint.

Here's what lspci shows about my card (in case it helps): [code]
4core:~ # lspci -v -s 01:00
01:00.0 VGA compatible controller: NVIDIA Corporation Device 13c2 (rev a1) (prog-if 00 [VGA controller])
Subsystem: eVga.com. Corp. Device 3978
Flags: bus master, fast devsel, latency 0, IRQ 48
Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Memory at e0000000 (64-bit, prefetchable) [size=256M]
Memory at f0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at f7000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
Capabilities: [900] #19
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia

01:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
Subsystem: eVga.com. Corp. Device 3978
Flags: bus master, fast devsel, latency 0, IRQ 17
Memory at f7080000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
[/code]
Check output from dmesg and syslog. There might be a clue there. If not all I can suggest is removing one card and and trying to get the other working on it's own. Then swap cards and repeat. Then hope you can get them both going.

Sorry I can't help any more.

Chris

Mark Rose 2015-11-14 18:31

[QUOTE=fivemack;416105]If I swap the cards round so the 580 is in the top slot, nvidia-smi only picks up the 580, and still gives me
[code]
pumpkin@pumpkin:~/msieve-cuda/trunk/X$ time ../msieve -g 0 -np1 "stage1_norm=1e25 0,1000"
error (line 71): CUDA_ERROR_NO_DEVICE
[/code][/QUOTE]

I missed this earlier. This confirms it's not related to the actual application, but to driver or hardware. Have you confirmed that SLI is disabled in the BIOS if there is a setting for it? In addition to /var/log/syslog I would also check /var/log/X.log and see if X detects both cards.

Also, what is the output of ls -l /dev/nv* ?

It's possible the driver isn't making the device node for the second card. Since I use the onboard graphics in one system I have to manually make these with a script.

fivemack 2015-11-14 19:46

To try another app, I downloaded the CUDALucas source and built it

[code]
pumpkin@pumpkin:~/cudalucas-build/cudalucas-code$ ./CUDALucas -d 0 -threadbench 1 16 5 0

device_number >= device_count ... exiting
(This is probably a driver problem)
[/code]

fivemack 2015-11-17 21:40

Ended up installing ubuntu-14.04 and then the .deb of drivers distributed by nvidia; that appears to work

henryzz 2015-11-18 12:55

Maybe 16.04 will work. For something like CUDA support I would guess that the LTS versions are more likely to work.


All times are UTC. The time now is 04:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.