k8s配置GPU感知：k8s-device-plugin的使用（已踩完坑）

1，定义

Kubernetes 的 NVIDIA 设备插件是一个 Daemonset，它允许自动：

暴露集群中每个节点上的 GPU 数量
跟踪 GPU 的运行状况
在 Kubernetes 集群中运行支持 GPU 的容器

2，需要满足的前置条件

NVIDIA drivers ~= 384.81
nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
nvidia-container-runtime configured as the default low-level runtime
Kubernetes version >= 1.10

3，安装

kubect apply -f nvidia-device-plugin.yml

yaml内容如下：

# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.apiVersion: apps/v1
kind: DaemonSet
metadata:name: nvidia-device-plugin-daemonsetnamespace: kube-system
spec:selector:matchLabels:name: nvidia-device-plugin-dsupdateStrategy:type: RollingUpdatetemplate:metadata:labels:name: nvidia-device-plugin-dsspec:tolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedule# Mark this pod as a critical add-on; when enabled, the critical add-on# scheduler reserves resources for critical add-on pods so that they can# be rescheduled after a failure.# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/priorityClassName: "system-node-critical"containers:- image: 10.5.5.25:8080/nvidia/k8s-device-plugin:v0.17.0-ubi9name: nvidia-device-plugin-ctrenv:- name: FAIL_ON_INIT_ERRORvalue: "false"securityContext:allowPrivilegeEscalation: falsecapabilities:drop: ["ALL"]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-pluginsvolumes:- name: device-pluginhostPath:path: /var/lib/kubelet/device-plugins

4,测试

4.1 配置yaml文件，跑一个job

apiVersion: v1
kind: Pod
metadata:name: gpu-pod
spec:restartPolicy: Nevercontainers:- name: cuda-containerimage: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0resources:limits:nvidia.com/gpu: 1 # requesting 1 GPUtolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedule

4.2 查看gpu-pod的log

请添加图片描述

5 遇到的问题

安装结束后，并没有发现GPU信息，通过查看/etc/docker/daemon，发现container toolkit也已经装好，但是运行docker info发现runtime还是runc，猜想可能就是这个原因，因此设置了default-runtime，如下：

{"data-root":"/data/docker_data","insecure-registries":["192.168.237.50:8080",//私有仓库"127.0.0.0/8"],"registry-mirrors":["192.168.237.50:8080",//私有仓库"https://docker.m.daocloud.io","https://docker.unsee.tech","https://docker.1panel.live","http://mirrors.ustc.edu.cn","https://docker.chenby.cn","http://mirror.azure.cn","https://dockerpull.org","https://dockerhub.icu","https://hub.rat.dev","https://proxy.1panel.live","https://docker.1panel.top","https://docker.m.daocloud.io","https://docker.1ms.run","https://docker.ketches.cn","https://mirror,aliyuncs.com"],"runtimes":{"nvidia":{"args":[],"path":"nvidia-container-runtime"}},"default-runtime":"nvidia"
}

最终实现了k8s调用GPU