起因是這樣,有鑒於實驗室的電腦在每次更新後重開機之後,輸入nvidia-smi後總是會發生以下情形:

1
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

就是Nvidia的驅動程式又掉線了。

使用套件管理而非.run file安裝Driver

這個原因有可能是因為安裝Driver時使用了Cuda的Runfile附贈的安裝程式。

在安裝的時候一般是輸入:

1
sudo sh cuda_<Cuda>_<Driver>_linux.run

而非

1
sudo sh cuda_<Cuda>_<Driver>_linux.run --dkms

這會導致Driver會在Kernel升級之後無法運作。由此可知,問題出在apt upgrade後,重開機使Kernel更新生效,導致Driver無法運作。

因此應該要用Package Manager安裝

首先先增加圖形韌體的Repository,並且安裝ubuntu用的驅動。

1
2
3
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt install ubuntu-drivers-common

並且使用指令觀察應該安裝哪種驅動。以RTX 2080 Ti為例:

1
ubuntu-drivers devices

輸出

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
== /sys/devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:10.0/0000:68:00.0 ==
modalias : pci:v000010DEd00001E04sv00001458sd000037C4bc03sc00i00
vendor   : NVIDIA Corporation
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-450 - distro non-free
driver   : nvidia-driver-460 - distro non-free recommended
driver   : nvidia-driver-440-server - distro non-free
driver   : nvidia-driver-410 - third-party free
driver   : nvidia-driver-415 - third-party free
driver   : nvidia-driver-418-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

可以看到他推薦安裝460版本的驅動程式。並且可以利用

1
apt-cache search nvidia-driver

確定套件名稱,確定玩直接安裝,並且重開機。

1
2
sudo apt install nvidia-driver-460 -y
sudo reboot

重開機後輸入nvidia-smi,有介面輸出就代表成功了。

後記

僅緬懷我在重新安裝數次驅動後終於知道原因的光陰。


Reference

  1. https://forums.developer.nvidia.com/t/nvidia-driver-not-work-after-reboot-on-ubuntu/70831/2
  2. https://gitpress.io/@chchang/install-nvidia-driver-cuda-pgstrom-in-ubuntu-1804