
一键安装22.04 nvidia 驱动
- nvidia 官网下载驱动
- 我的环境是NVIDIA RTX A5000
- nvidia 文档参考
- 没有安装驱动之前确认自己的型号 lspci | grep -i vga (如数字2231) 参考
- docker 支持nvidia ,注释了需要的取消注释即可 42行-92行
- 一定要重启服务器哦,不然驱动不会生效的reboot
vim /nvidia_install.sh
#!/bin/bash
# -*- coding: utf-8 -*-
# Author: CIASM
# update 2025/02/27
# make.ha<<!
# check nvidia
lspci | grep -i vga
http://pci-ids.ucw.cz/mods/PC/10de/2204#add-apt-repository ppa:graphics-drivers/ppa
!echo "remove nvidia"
apt remove -y nvidia*echo "add nvidia repo"
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.listecho "check host nvidia"
ubuntu-drivers devicesecho "install nvidia"
apt-get update
apt install -y nvidia-driver-535echo "install docker NVIDIA GPU"
apt install -y nvidia-container-toolkitecho "install NVIDIA CUDA Toolkit"
apt install -y nvidia-cuda-toolkitecho "nvidia persist mode"
nvidia-smi -pm 1# docker 支持 nvidia 配置
<<!
echo "docker daemon.json"
rm -rf /etc/docker/daemon.json
cat <<'EOF'>>/etc/docker/daemon.json
{"registry-mirrors": ["https://registry.hub.docker.com","https://ccr.ccs.tencentyun.com","https://dockerproxy.com","https://hub-mirror.c.163.com","https://docker.mirrors.sjtug.sjtu.edu.cn","https://docker.nju.edu.cn","https://registry-k8s-io.mirrors.sjtug.sjtu.edu.cn","https://docker.m.daocloud.io","https://docker.mirrors.ustc.edu.cn","https://mirror.iscas.ac.cn","https://s64h8lpn.mirror.aliyuncs.com","https://atomhub.openatom.cn","https://mirror.baidubce.com","https://docker.1panel.live","https://proxy.1panel.live","https://image.cloudlayer.icu","https://docker-0.unsee.tech","https://docker.tbedu.top","https://pull.loridocker.com","https://docker.melikeme.cn","https://docker.imgdb.de","https://docker.hlmirror.com","https://docker.kejilion.pro","https://hub.rat.dev","https://dockerpull.pw","https://hub.fast360.xyz","https://docker.xuanyuan.me","https://docker.1ms.run","https://xdark.top","https://func.ink","https://lispy.org"],"insecure-registries": ["192.168.11.40"],"runtimes": {"nvidia": {"args": [],"path": "nvidia-container-runtime"}}
}
EOFecho "restart docker"
systemctl restart dockerecho "test cuda docker"
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
!
一键安装nvidia 驱动
- 一定要重启服务器哦,不然驱动不会生效的reboot
bash /nvidia_install.sh
查询安装完成的nvidia驱动情况
nvidia-smi

nvidia 持续模式
- 持久模式(
-pm
)允许用户将 NVIDIA-SMI 设置持久化,即使在系统重启后也能生效。这意味着,您可以在会话期间配置设备,而无需每次重新启动系统时都手动重新应用设置。 - 1:启用持久模式
- 0:禁用持久模式
- 立即生效
echo "nvidia persist mode"
nvidia-smi -pm 1

nvidia ECC 校验开启和关闭
- ECC(错误纠正代码)是一种用于检测和纠正设备内存错误的技术。启用 ECC 可以提高系统稳定性,防止因内存错误而导致的数据损坏。
- 1:启用 ECC
- 0:禁用 ECC
- reboot 系统生效
nvidia-smi -e 1

重置 ECC 错误计数(-p)
-p
选项用于重置 ECC 错误计数器。如果 ECC 已启用,此计数器将跟踪检测到的内存错误数量。重置计数器可以帮助您监测和排除故障,并确保您收到设备错误的最新信息。- 0/VOLATILE,
- 1/AGGREGATE
nvidia-smi -p 0