![]() |
|
|
#23 |
|
32×853 Posts |
Thanks for getting back to me. I might get some C6100s then and network them with Infiniband. I have been trying it with a Supermicro server from the same era, but that won't work with more than four cards, despite having multiple PCIe slots, and so I need something else.
Does GPUDirect work for you, i.e. peer-to-peer communication between the cards, e.g. as determined by the simpleP2P Cuda sample program? I can't get this to work even with four cards on my existing board. Did you have to do anything special to get 8 cards per cable support to show up in the web interface? Mine only goes up to 4, and I gather from the manual that this is expected for older versions of the C410X, but mine has a sticker on top saying that 8 are supported. |
|
|
|
#24 |
|
Dec 2014
3·5·17 Posts |
One C6100 contains 4 independent nodes, so you only need one C6100.
FYI, a C6100 can run on either 120 or 240 VAC. Never tried GPUdirect, only mfaktc and CUDAlucas. It was some work getting Linux set up right to talk to it. You need to have X windows installed even though the GPU do not have video outputs. Also the default Ubuntu driver for Nvidia is nouveau which does not support CUDA. I will need to power up the box to look at the GUI. |
|
|
|
|
|
#25 |
|
449410 Posts |
Many thanks for replying again. That's useful information. After playing around with ipmi commands, I'm now fairly certain that there is nothing I can do to make mine work in 8:1 mode, despite the sticker saying that it does! I am wondering now if the system board might have been replaced at some point. This at least is one feature for which you seem to have been more fortunate than me. I'm glad you pointed out that some cards didn't work - I have only tested 8 so far and must be sure to test the others before the short return period runs out.
I might try to buy a system board with a newer revision at some point to try to get 8:1 mode working. Does yours have a revision number printed on it and does it have a sticker saying whether it support 8:1 or not? My revision sticker is along the top, near the front, and the sticker about 8 to 1 is towards the back, near the other sticker about PCIe information on the top. If I know what yours says, it might help me to figure out which system board revision I need. Thanks again so far. |
|
|
|
#26 |
|
Dec 2014
3·5·17 Posts |
Mine has a small sticker on the front that has a long number then "rev a00".
The back has the "8 to 1 supported" sticker. This mode uses iPASS ports PCIe3 & PCIe1. If you do get a C6100, the 4 nodes should let you use "4 to 1" mode and cover all GPU. Mine came with short iPASS cables but there are longer cables available. |
|
|
|
|
|
#27 |
|
5×7×13×17 Posts |
Thanks again! I have a specialized use in mind that would benefit from all 8 cards in one machine, so I am probably going to buy a new middle board to see if I can get 8:1 to work. Then I also only need a cheaper C6100 with only two boards.
It seems that there are 3 boards in the C410X, so I'm hoping that only the main board needs to be replaced and not also the boards with the iPASS connectors. Please could you send me the output from two commands on Linux and tell me the output when in 8:1 mode? It will help me to see which bridge connections are added in 8:1 mode, which should tell me which board is modified - some of the bridges are on the main board and others on the iPass boards. lspci | grep PEX cat /proc/ioports |
|
|
|
#28 | ||
|
Dec 2014
25510 Posts |
nvidia-smi shows 7 GPU available
lspci | grep PEX Quote:
Quote:
|
||
|
|
|
|
|
#29 |
|
33×59 Posts |
Thanks! But was the output from ioports cut off? The GPUs appear to be completely missing there...
|
|
|
|
#30 | |
|
Dec 2014
3778 Posts |
The ioports is not cut off.
dmesg | grep -i pci | grep io gives 533 lines of output. Below is the first part of it. The rest seems to repeat. It sounds like it allocated IO ports but then released them. Quote:
|
|
|
|
|
|
|
#31 |
|
2×349 Posts |
This is interesting - mine doesn't free the resources like that on the server I am currently using, likely because it runs RedHat and not Ubuntu, so has an older kernel. I will try that instead.
This raises the possibility that all 16 cards in one machine might be possible after all in an appropriate machine with two x16 slots, despite the warning in the documentation: if the I/O ports are used only during configuration and then released afterward, the I/O space limitation seemingly no longer applies. Anyway, the lack of I/O space explains why your eighth card is not working. At least, that’s what the log appears to say. Perhaps some other device was initialized in the middle of I/O space and then released later - the best solution might be to try to identify it. If not, you could disable other devices associated with the other I/O ports in use. I imagine that devices 01:00.0 and 01:00.1 are the onboard ethernet - you could verify this with lspci. So you probably can't do much about that. What is 1a:04.0? You can find out with lspci (sorry if this is telling you something you already know) with: lspci -s 1a:04.0 If it is a RAID controller, then you probably don't need AHCI mode and could disable the onboard SATA in the BIOS. That would free up enough I/O space for the eighth card. Even if you do need onboard SATA, you could try temporarily switching to IDE Mode in the BIOS - although the drive performance would be slower, you might get your eighth GPU. If this is just a server and you don't need any local peripherals, then disabling USB in the BIOS will free up I/O ports associated with UHCI. This would also likely allow the other GPU to fit in I/O space. Changing things like AHCI version settings in the BIOS might also help. Now, along the lines of my earlier remark about 16 cards, it might be possible to simply rescan the eighth GPU after the other I/O ports have been released. Then you wouldn't need to disable any other hardware. I have some suggested commands below; this is still relatively new to me - I have been reading into this to try to fix the problems with my existing server - but I think something here might help. Before I begin: it is possible that some of these commands might crash the server, so I just want to make sure I have mentioned that in advance - I hope I don't appear condescending, but I also want to make sure that I have included an appropriate warning so that I am not responsible for any data loss or other issues. Everything here requires sudo. First, be sure that the released messages have appeared in dmesg and that /proc/ioports does not include the GPUs, as in the sample you showed me. Then you could try: echo 1 > /sys/bus/pci/devices/0000:0e:04.0/rescan echo 1 > /sys/bus/pci/devices/0000:10:00.0/rescan For the first line, it might be more appropriate to rescan the pci_bus: cd /sys/bus/pci/devices/0000:0e:04.0/pci_bus cd <whatever is in there> echo 1 > rescan You might need to try this as well to rescan the affected PEX 8647 bridge, but I suspect that this might end up rescanning too many of the other GPUs and fill the IO space again: echo 1 > /sys/bus/pci/devices/0000:0b:00.0/rescan Again, it might be more appropriate to rescan the entry inside pci_bus. I am essentially just suggesting to rescan the affected GPU and the bridges for which BARs could not be assigned. This assumes that the GPU 0000:10:00.0 is still the missing one; if it changes on each boot, then these values may need to be modified. Similarly, if the bridge IDs with the missing BARs in the supplied log sample change on each boot, then those values need to change in these commands too. You may need to remove the existing improperly-configured GPU first. To do that, run this sequence instead (again, check that the ioport range is still truncated, in case any of the above commands filled it again - you might need to reboot if they did): echo 1 > /sys/bus/pci/devices/0000:10:00.0/remove echo 1 > /sys/bus/pci/devices/0000:0e:04.0/rescan Again, rescanning the other bridge too might help, or might cause problems. Also again, perhaps the bridge should rescanned from its pci_bus entry and not directly like this. After this, you can see if the card shows up, if not in nvidia-smi, then at least in /proc/ioports. It would also be interesting to see the dmesg output. It nothing appears in nvidia-smi, it is possible that the nvidia driver may need to be reloaded: modprobe nvidia Maybe run modprobe -r nvidia to remove first, then reload. I'm not sure if that makes any difference in this case or not. If this doesn't work, then setpci can probably be used to manually assign the missing IO space for GPU 8 and the associated bridges, after it becomes available again - perhaps you know how to do that? I would be interested to see the result if you do. Otherwise, I can try to make suggestions, but will probably need to ask you for some more information from other commands first. Finally, just from my own interest, please can you send me the output from: lspci | grep -i nvidia I am looking not only for NVIDIA GPUs, but to see if you have an NVIDIA bridge chip in your host interface card. |
|
|
|
#32 |
|
112×59 Posts |
One other thing - please can you send the output from:
lspci -tv It might be necessary to attach it as a text file. It might help for me to see what is going on with your BAR errors a little more clearly. Thanks. |
|
|
|
#33 |
|
3·877 Posts |
And yet one more thing - this should help me finally understand what I need to do to get my C410x working in 8:1 mode, i.e. should tell me what minimum board version I need. Please can you run the following command from any Linux computer on the same network as your c410x:
ipmitool -H x.x.x.x -U root -P root fru print [where x.x.x.x is your c410x IP, the root after -U is the username, and the root after -P should be changed to the password]. You might need: sudo apt-get openipmi if it doesn't work right away. I am particularly interested in the Board Mfg Date and Board Part Number. My entries are: Board Mfg Date : Sun Jan 9 04:52:00 2011 Board Part Number : A00 I imagine that yours will have a later Mfg date and may have part number A01 or something like that, in which case I know what to buy to get 8:1 on mine. I would just like to make sure that yours isn't an A02 or something even higher. Curiously, the product serial (which is the Dell service tag) shows on the Dell support website that this shipped in 8:1 configuration! So I guess they put the wrong board in by mistake or else replaced it later with the wrong one under warranty. Anyway, at least that gives me more confidence that 8:1 will work with the right new board. |
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Running multiple ecm's | johnadam74 | GMP-ECM | 21 | 2019-10-27 18:04 |
| Multiple GPU's Windows 10 | cardogab7341 | GPU Computing | 5 | 2015-08-09 13:57 |
| Using multiple PCs | numbercruncher | Information & Answers | 18 | 2014-04-17 00:17 |
| Running multiple copies of mprime on Linux | hc_grove | Software | 3 | 2004-10-10 15:34 |
| Multiple systems/multiple CPUs. Best configuration? | BillW | Software | 1 | 2003-01-21 20:11 |