Skip to content

Commit

Permalink
GitBook: [#116] Add notes on restoring functionality of "missing" GPUs
Browse files Browse the repository at this point in the history
  • Loading branch information
msherman64 authored and gitbook-bot committed Sep 20, 2022
1 parent a7f8166 commit cb4470e
Showing 1 changed file with 14 additions and 7 deletions.
21 changes: 14 additions & 7 deletions docs/operations/troubleshooting/troublesome-hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@ description: An exciting menu of footguns

The following hardware has known issues. Sometimes this is because the hardware or software is flaky, and sometimes it's because extensive use uncovers edge cases.



## Networking

### Dell OS10 Switches
Expand All @@ -18,7 +16,7 @@ Dell switches running OS10 default to zero-touch-provisioning mode. Even if you

To check, run `show ztd-status`

``
\`\`

```
# The bad state
Expand Down Expand Up @@ -56,8 +54,6 @@ You'll observe that, despite no useful logs, some ports are up but not passing t

You can check this by launching a known-good node, then running `show mac-address-table interface <node_port_interface>`



To check your current config, run `show spanning-tree`

As a workaround, set the switch to `RSTP` instead of `Rapid-PVST` mode.
Expand All @@ -79,9 +75,20 @@ Spanning tree enabled protocol rstp with force-version rstp
...
```



### Dell FX2 Chassis Switch

TODO

## GPUs

### nvidia-smi reports "no devices found" or "could not communicate with driver"

This is an idiosyncratic error which happens occasionally on GPU nodes. It can be fixed through software. It can usually be resolved by some combination of the following steps:

* If Ubuntu, run their driver installation utility and reboot
* Manually install the most recent nvidia drivers using the instance's package manager and reboot
* Install the `nvidia-dkms` module and reboot
* If Ubuntu, install the Ubuntu hardware enablement stack (`linux-generic-hwe-<version>`) and its recommended packages, and reboot
* Install the appropriate linux kernel headers and image and reboot
* Check the kernel module blacklist to see if the nvidia drivers got blacklisted somehow
* Check `dmesg` logs for "nvidia." See if there were any obvious errors that may indicate a hardware issue

0 comments on commit cb4470e

Please sign in to comment.