mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2016-01-24, 16:23   #1
bgbeuning
 
Dec 2014

3·5·17 Posts
Default Linux and multiple GPU

I have a Linux (Ubuntu 14.04 desktop 64-bit) box with 8 GPU.
Linux boot sees all 8.

Quote:
[ 15.482436] nvidia: module license 'NVIDIA' taints kernel.
[ 15.492355] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 15.504224] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:0f:00.0 on minor 1
[ 15.504392] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:10:00.0 on minor 2
[ 15.504494] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:11:00.0 on minor 3
[ 15.504576] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:12:00.0 on minor 4
[ 15.504656] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:15:00.0 on minor 5
[ 15.504727] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:16:00.0 on minor 6
[ 15.504797] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:17:00.0 on minor 7
[ 15.504870] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:18:00.0 on minor 8
There are /dev files for all 8 devices.

Quote:
crw-rw-rw- 1 root root 195, 0 Jan 24 10:24 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Jan 24 10:24 /dev/nvidia1
crw-rw-rw- 1 root root 195, 2 Jan 24 10:24 /dev/nvidia2
crw-rw-rw- 1 root root 195, 3 Jan 24 10:24 /dev/nvidia3
crw-rw-rw- 1 root root 195, 4 Jan 24 10:24 /dev/nvidia4
crw-rw-rw- 1 root root 195, 5 Jan 24 10:24 /dev/nvidia5
crw-rw-rw- 1 root root 195, 6 Jan 24 10:24 /dev/nvidia6
crw-rw-rw- 1 root root 195, 7 Jan 24 10:24 /dev/nvidia7
crw-rw-rw- 1 root root 195, 255 Jan 24 10:22 /dev/nvidiactl
crw-rw-rw- 1 root root 248, 0 Jan 24 10:24 /dev/nvidia-uvm
'mfaktc -d 0' thru 'mfaktc -d 6' all work but 'mfaktc -d 7' fails with error.

Quote:
CUDA version info
binary compiled for CUDA 4.20
CUDA runtime version 4.20
CUDA driver version 7.50
cudaSetDevice(7) failed
cudaGetLastError() returned 10: invalid device ordinal
Any idea why I can not use the 8th GPU?

Actually there are 2 Linux boxes, each with 8 GPU.
Both can only access 7 of 8. The GPU are in a Dell c410x
with an iPASS cable from the Dell box going to a PCI card
in the Linux boxes.

Thanks,
bgbeuning is offline   Reply With Quote
Old 2016-01-24, 19:05   #2
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

What is the output of nvidia-smi?

Are all of the GPUs the same?
airsquirrels is offline   Reply With Quote
Old 2016-01-24, 19:16   #3
bgbeuning
 
Dec 2014

3·5·17 Posts
Default

It see's 7 GPU.
The seller claimed it had all Tesla M2090.
Maybe it has a mix.

Quote:
Sun Jan 24 14:12:06 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.68 Driver Version: 352.68 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M2070 Off | 0000:0F:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 80MiB / 5375MiB | 16% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M2070 Off | 0000:11:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 80MiB / 5375MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M2070 Off | 0000:12:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 80MiB / 5375MiB | 14% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M2070 Off | 0000:15:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 80MiB / 5375MiB | 14% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla M2070 Off | 0000:16:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 80MiB / 5375MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla M2070 Off | 0000:17:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 80MiB / 5375MiB | 18% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla M2070 Off | 0000:18:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 80MiB / 5375MiB | 10% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2190 C ../mfaktc-0.20/mfaktc.exe 68MiB |
| 1 2191 C ../mfaktc-0.20/mfaktc.exe 68MiB |
| 2 2192 C ../mfaktc-0.20/mfaktc.exe 68MiB |
| 3 2193 C ../mfaktc-0.20/mfaktc.exe 68MiB |
| 4 2194 C ../mfaktc-0.20/mfaktc.exe 68MiB |
| 5 2195 C ../mfaktc-0.20/mfaktc.exe 68MiB |
| 6 2196 C ../mfaktc-0.20/mfaktc.exe 68MiB |
+-----------------------------------------------------------------------------+
bgbeuning is offline   Reply With Quote
Old 2016-01-24, 19:39   #4
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

My biggest Nvidia system only has 6 GPUs so I can't tell if that's a driver limit.

Do "lspci -mm |grep VGA" does it have another internal GPU?

If possible turn that one off, that caused me trouble with 8 AMD GPUs
airsquirrels is offline   Reply With Quote
Old 2016-01-24, 20:07   #5
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

29×101 Posts
Default

That might be it. Each GPU needs a bit of BIOS memory. Nvidia supports at least 8 GPUs.
Mark Rose is offline   Reply With Quote
Old 2016-01-25, 01:20   #6
bgbeuning
 
Dec 2014

25510 Posts
Default

Quote:
$ lspci -mm | grep -i vga
1a:04.0 "VGA compatible controller" "ASPEED Technology, Inc." "ASPEED Graphics Family" -r10 "Inventec Corporation" "Device 0047"
Since the Tesla GPU cards have no Video output, I think it kinda needs the VGA driver.

The documentation (for the Dell c410x) explains that each GPU needs 4K I/O ports
but an x86 only has 64K I/O ports. So one machine can not host 16 GPU.
When I look at /proc/ioports, it only shows about 1/4 of the I/O ports in use.

The box can be reconfigured to appear as 4 GPU on 4 iPASS ports but then I
need to dedicate 4 Linux boxes to hosting the GPU's. Maybe that will be
another weekend project.

Thanks,
bgbeuning is offline   Reply With Quote
Old 2016-01-25, 01:33   #7
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

29×101 Posts
Default

I would also ask on http://devtalk.nvidia.com
Mark Rose is offline   Reply With Quote
Old 2016-01-25, 02:18   #8
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Is the system running headless? Linux generally doesn't require an actual VGA display, that's what I do.

I also have some scripts that generate a fake display for the purpose of accessing certain nvidia-settings options.

Also, check nvidia-xconfig/xorg.conf if that is incorrectly setup it can block some GPUs from appearing. Better yet, stop X altogether and see if you get the same device count.

somewhere in the dmesg log there should be an entry indicating a problem with the card., if any.

Do all 8 show in lspci -vvv? Any differences in link speed?

Finally, what host CPU(s)? Is this an external PCIe backplane with a lane switch, or is it using CPU lanes. lspci will give us link speeds for each card.
airsquirrels is offline   Reply With Quote
Old 2016-01-25, 17:31   #9
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111110 Posts
Default

Hi!
  • As 'nvidia-smi' only shows GPU from 0 to 6 (7 GPUs) I think it is safe to assume it is NOT a mfaktc issue. GPU at PCI 0000:10:00.0 is missing in the output, this BUS ID is shown earlier in kernel log.
  • I had my hands on many systems with 8 GPUs so far, most driver versions should easily handle this.

Maybe(!) it is as simple as just a missing power cable for on of the cards.

Oliver

P.S. perhaps the system is better used for CUDALucas?
TheJudger is offline   Reply With Quote
Old 2016-01-25, 18:09   #10
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

236710 Posts
Default

Slightly off-topic, but do you have one of these?

http://www.amax.com/hpc/productdetai..._id=XG-4802Gk8
ixfd64 is offline   Reply With Quote
Old 2016-01-25, 19:21   #11
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by ixfd64 View Post
Slightly off-topic, but do you have one of these?

http://www.amax.com/hpc/productdetai..._id=XG-4802Gk8
It looks like that's based off the supermicro board/chassis from the picture. That's exactly what is hosts my sets of 8 Fury X cards.
airsquirrels is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Running multiple ecm's johnadam74 GMP-ECM 21 2019-10-27 18:04
Multiple GPU's Windows 10 cardogab7341 GPU Computing 5 2015-08-09 13:57
Using multiple PCs numbercruncher Information & Answers 18 2014-04-17 00:17
Running multiple copies of mprime on Linux hc_grove Software 3 2004-10-10 15:34
Multiple systems/multiple CPUs. Best configuration? BillW Software 1 2003-01-21 20:11

All times are UTC. The time now is 01:18.

Tue Mar 9 01:18:42 UTC 2021 up 95 days, 21:30, 0 users, load averages: 2.33, 2.29, 2.23

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.