![]() |
![]() |
#45 |
3×13×61 Posts |
![]()
After posting my earlier message, I just wondered if you haven't actually taken it apart to the extent required to see the iPass circuit boards? At first I thought you had disassembled the whole thing, but a glance at the manual suggests that you might be able to see the info you gave me just by removing the fans. In that case, I don't want to trouble you to have to take everything out just to see the iPass boards! The information probably wouldn't help that much anyway.
The strange thing here is that my board serial number is only 19 lower than yours, so they were probably part of the same batch. Very odd that it doesn't work on mine! I might try opening it tomorrow and checking that the jumpers match yours. It also looks like there is a CMOS battery on the board from other photos I have seen - I might take that out for a couple of minutes in case it resets something. One other thing: what is your firmware version, as displayed in the web interface? With that mind, have you considered a firmware update? Of course an upgrade is risky though - it takes several minutes and any failure could leave you needing to replace the main board, but perhaps it might help to fix the cards in the slots that don't work. However, I would be very surprised if it helped with the I/O address issue that you are facing with getting the eighth card to work. Also, did you try removing and re-inserting (with power off) the affected GPU cards that don't work, just in case they are not seated correctly? I looked at the lspci -tv output and was surprised by the result, so my most updated post about the order or remove/rescan stands. For further progress, setpci might be the answer. I haven't used it before, but more or less know what it does - the aim is to explicitly reallocate the i/o ports that could not be allocated initially, now that they have been freed. Please can you send the output from: lspci -vv -s 10:00.0 lspci -vv -s 11:00.0 Assuming that the 10:00.0 card is still the one that doesn't work when you next boot. The idea is just to compare the output for working and non-working cards in order to identify the ports that weren't allocated, although I am not certain how well this will work, given that the other ports have been released anyway. It might also be useful to see: dmesg | grep 11:00.0 dmesg | grep 10:00.0 |
![]() |
![]() |
#46 | ||||
Dec 2014
3×5×17 Posts |
![]()
web firmware version is 1.25
Not using sudo leaves the capabilities <no access> Quote:
Quote:
Quote:
Quote:
Last fiddled with by bgbeuning on 2016-05-23 at 00:45 |
||||
![]() |
![]() |
![]() |
#47 |
11×673 Posts |
![]()
Try this (noting that it might cause the system to crash or lock up)::
sudo setpci -v -s 10:00.0 BASE_ADDRESS_5=0000dc01 Then try rescaning the bus, as before. Here I am trying to manually assign some IO space to the GPU that isn't working, then get it picked up by the bus. However, possibly compatible settings need to be put in place manually on the bus too. For that, it might help to see: sudo lspci -s 0e:04.0 -vv sudo lspci -s 0b:00.0 -vv The above were two bridges that gave errors in the logs you showed earlier. It would also be helpful to have two other equivalent ones that did not: sudo lspci -s 14:04.0 -vv sudo lspci -s 09:04.0 -vv Also, please can you try the following; it might give different tree output: lspci -btv |
![]() |
![]() |
#48 | |
22×593 Posts |
![]()
Please can you also send the output from the simpleP2P program in the CUDA samples? It would be very helpful for me to know if GPUDirect is working on your system. To do this, go to the cuda samples directory. If you haven't already installed them, just run:
cuda-install-samples-7.5.sh <dirname> Then, in the directory that you just specified, please go to: 0_Simple/simpleP2P/ Then run make, then run ./simpleP2P when make has finished. The initial part of my output is as follows. I'm hoping that yours will say "yes" where mine says "no": Quote:
Last fiddled with by anonymous on 2016-05-23 at 02:40 |
|
![]() |
![]() |
#49 |
Dec 2014
3·5·17 Posts |
![]()
When I powered up the C410X, 4 boards did not start. After pushing their reset
buttons, 2 started but 2 are still off. The result is the PCI 10 address is online this time. I am pretty convinced the IO ports are why I am only seeing 7 of 8 cards, and for me the simplest solution is to switch to 4:1 mode and use all 4 nodes in the C6100. Here are some of the output you asked for. The simpleP2P had yes instead of no. |
![]() |
![]() |
![]() |
#50 |
22×3×5×163 Posts |
![]()
Thanks - it's good to know that GPUDirect works.
The new lspci output unfortunately didn't give me any more ideas about your issue. I would suggest trying the ipmi commands for the C410X that I linked to before: http://www.dell.com/support/article/us/en/04/SLN244176. Sometimes a few of my cards do not start, and I am able to fix the problem by using power commands to individual slots, as described in that link. |
![]() |
![]() |
#51 |
24×32×11 Posts |
![]()
Just a brief update about how things eventually turned out for me:
I purchased a Dell C6220 II on eBay. It uses newer Sandy Bridge or Ivy Bridge CPUs, compared to the Nehalem or Westmere on the C6100, and supports full 64-bit addressing for PCI, hence can potentially work with more GPUs and newer GPUs. It is also available in a configuration with 2x2U nodes with 2 full-height PCI slots per node, rather than the 4x1U nodes of the C6100 with only one low profile PCI slot. I also eventually found that the system board in the C410X did not need to be replaced to enable 8:1 mode; instead, the upper iPass board needed to be replaced. Both versions have the same part number (H0NJN), but the new one says "Rev 2.1" on the PCB, while the original says "Rev 1.0". The new one also has a sticker marked "DJCN65", while the older one is marked "BGCN15". I now have 8:1 working without a new system board. I have firmware 1.35 loaded. 8 GPUs per node works fine with my C6220 II. I even tried a dual HIC card (NVIDIA P894, require a full-height PCI Express slot), but found that I could get only up to 10 GPUs per node - there were insufficient I/O ports available for more. Even that required disabling the mezzanine card slot (in which I have an Infiniband card) and the RAID controller, booting only from the onboard SD card slot. There is a UEFI option in the BIOS that I haven't tried yet - I am hopeful that it might allow an extra 1-2 GPUs per node by freeing some IO ports that are currently allocated to bridges, but are seemingly unused.. Last fiddled with by anonymous on 2016-06-08 at 12:45 |
![]() |
![]() |
#52 |
Dec 2014
3·5·17 Posts |
![]()
Sounds like you are making great progress. Glad to hear it.
Sad the nvidia driver is closed source, someone could change it to 1. Allocate IO ports 2. Init the GPU 3. Release the IO ports 4. Repeat for all GPU |
![]() |
![]() |
![]() |
#53 |
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
1166210 Posts |
![]()
"Someone" could also ask Nvidia to make those changes ...
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Running multiple ecm's | johnadam74 | GMP-ECM | 21 | 2019-10-27 18:04 |
Multiple GPU's Windows 10 | cardogab7341 | GPU Computing | 5 | 2015-08-09 13:57 |
Using multiple PCs | numbercruncher | Information & Answers | 18 | 2014-04-17 00:17 |
Running multiple copies of mprime on Linux | hc_grove | Software | 3 | 2004-10-10 15:34 |
Multiple systems/multiple CPUs. Best configuration? | BillW | Software | 1 | 2003-01-21 20:11 |