1,定义
Kubernetes 的 NVIDIA 设备插件是一个 Daemonset,它允许自动:
- 暴露集群中每个节点上的 GPU 数量
- 跟踪 GPU 的运行状况
- 在 Kubernetes 集群中运行支持 GPU 的容器
2,需要满足的前置条件
- NVIDIA drivers ~= 384.81
- nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
- nvidia-container-runtime configured as the default low-level runtime
- Kubernetes version >= 1.10
3,安装
kubect apply -f nvidia-device-plugin.yml
yaml内容如下:
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.apiVersion: apps/v1
kind: DaemonSet
metadata:name: nvidia-device-plugin-daemonsetnamespace: kube-system
spec:selector:matchLabels:name: nvidia-device-plugin-dsupdateStrategy:type: RollingUpdatetemplate:metadata:labels:name: nvidia-device-plugin-dsspec:tolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedule# Mark this pod as a critical add-on; when enabled, the critical add-on# scheduler reserves resources for critical add-on pods so that they can# be rescheduled after a failure.# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/priorityClassName: "system-node-critical"containers:- image: 10.5.5.25:8080/nvidia/k8s-device-plugin:v0.17.0-ubi9name: nvidia-device-plugin-ctrenv:- name: FAIL_ON_INIT_ERRORvalue: "false"securityContext:allowPrivilegeEscalation: falsecapabilities:drop: ["ALL"]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-pluginsvolumes:- name: device-pluginhostPath:path: /var/lib/kubelet/device-plugins
4,测试
4.1 配置yaml文件,跑一个job
apiVersion: v1
kind: Pod
metadata:name: gpu-pod
spec:restartPolicy: Nevercontainers:- name: cuda-containerimage: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0resources:limits:nvidia.com/gpu: 1 # requesting 1 GPUtolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedule
4.2 查看gpu-pod的log
5 遇到的问题
安装结束后,并没有发现GPU信息,通过查看/etc/docker/daemon
,发现container toolkit也已经装好,但是运行docker info发现runtime还是runc,猜想可能就是这个原因,因此设置了default-runtime,如下:
{"data-root":"/data/docker_data","insecure-registries":["192.168.237.50:8080",//私有仓库"127.0.0.0/8"],"registry-mirrors":["192.168.237.50:8080",//私有仓库"https://docker.m.daocloud.io","https://docker.unsee.tech","https://docker.1panel.live","http://mirrors.ustc.edu.cn","https://docker.chenby.cn","http://mirror.azure.cn","https://dockerpull.org","https://dockerhub.icu","https://hub.rat.dev","https://proxy.1panel.live","https://docker.1panel.top","https://docker.m.daocloud.io","https://docker.1ms.run","https://docker.ketches.cn","https://mirror,aliyuncs.com"],"runtimes":{"nvidia":{"args":[],"path":"nvidia-container-runtime"}},"default-runtime":"nvidia"
}
最终实现了k8s调用GPU