前言
因为实验室有很多台 GPU 服务器,每次要运行代码都要一台一台跑上去看GPU有没有人用,所以就写了一个这种小程序。
https://github.com/rikonaka/watchcorgi
效果图
curl http://127.0.0.1:7070/info
>> 2023-06-03 12:01:31 [watchcorgi]
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| name |cpu[s]|cpu[u]| gpu device |gpu[u]| gpu[m] | gpu user |update time|
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu1 | 0.0 %| 0.0 %| A100-PCIE-40GB(460.106.00) | 0 % | 0 MiB/40536 MiB | null | 12:01:22 |
| | | | A100-PCIE-40GB(460.106.00) | 17 % | 0 MiB/40536 MiB | | |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu2 | 0.0 %| 0.0 %| NVIDIA GeForce RTX 3090(515.65.01) | 0 % | 2 MiB/24576 MiB | StainAtt | 12:01:30 |
| | | | NVIDIA GeForce RTX 3090(515.65.01) | 91 % |12611 MiB/24576 MiB| | |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu3 | 0.0 %| 0.0 %|NVIDIA GeForce GTX 1080 Ti(530.30.02)| 0 % | 0 MiB/11264 MiB | null | 12:01:24 |
| | | |NVIDIA GeForce GTX 1080 Ti(530.30.02)| 1 % | 0 MiB/11264 MiB | | |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu4 | 0.0 %| 0.2 %| | | | driver failed| 12:01:25 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu5 | 0.0 %| 0.0 %|NVIDIA GeForce RTX 2080 Ti(530.30.02)| 0 % | 0 MiB/11264 MiB | null | 12:01:20 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu6 | 0.1 %| 0.0 %| Quadro P5000(510.54) | 100 %|16145 MiB/16384 MiB| CNN | 12:01:29 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu7 | 0.0 %| 0.0 %| A100-PCIE-40GB(460.106.00) | 0 % |39262 MiB/40536 MiB| API-Net | 12:01:28 |
| | | | A100-PCIE-40GB(460.106.00) | 0 % | 3 MiB/40536 MiB | | |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu8 | 0.0 %| 0.0 %|NVIDIA GeForce RTX 2080 Ti(510.47.03)| 0 % | 1 MiB/11264 MiB | null | 12:01:26 |
| | | |NVIDIA GeForce RTX 2080 Ti(510.47.03)| 0 % | 1 MiB/11264 MiB | | |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu9 | 0.0 %| 0.0 %| NVIDIA A100-PCIE-40GB(525.116.03) | 83 % |18796 MiB/40960 MiB|OpenHGNN_final| 12:01:23 |
| | | | | | | StainAtt | |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu10 | 0.0 %| 0.0 %| NVIDIA GeForce RTX 3090(525.116.03) | 0 % | 0 MiB/24576 MiB | null | 12:01:28 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu11 | 0.5 %| 4.2 %| NVIDIA A100-PCIE-40GB(515.65.01) | 91 % | 3671 MiB/40960 MiB| liif | 12:01:26 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
| gpu12 | 0.0 %| 0.0 %| NVIDIA GeForce RTX 4090(525.116.03) | 0 % | 0 MiB/24564 MiB | null | 12:01:18 |
+---------+------+------+-------------------------------------+------+-------------------+--------------+-----------+
Powered by Rust
普通安装
安装之前确保 server 所在的服务器上有 redis
分别下载 client 和 server 程序,client 放在你GPU服务器上,server 随便放在另外一台服务器上。
https://github.com/rikonaka/watchcorgi/releases
之后分别运行 client 和 server 程序,client 这里的 address 参数放 server 所在服务器的 IP,默认端口是7070
watchcorgi-client --server gpu --address http://YOUR_SERVER_IP:7070/update --interval 9
这里是设置 server 的监听地址和监听端口
watchcorgi-server --address 0.0.0.0 --port 7070
当然,最好的还是用 systemd 来管理
systemd 安装
这个 service 文件仓库里面已经提供了一个,大家下下来然后对应修改一下里面的内容就行,这里提供一个版本的 service 文件
这是 client 的文件,我们记得把可执行文件换成你文件在的 PATH,你也可以图省事直接把文件拖到 /usr/bin
下面
修改 --address 参数,填 server 所在服务器的 IP 和端口就行,–interval 为多久发一次监控包,最小为 1,–server 为服务器类型,这里默认是 GPU 服务器,如果你也有 CPU 服务器就填 cpu
[Unit]
Description=Watchcorgi Client Service
After=network.target[Service]
Type=simple
User=root
Restart=on-failure
RestartSec=5s
ExecStart=/usr/bin/watchcorgi-client --server gpu --address http://192.168.1.206:7070/update --interval 9
ExecReload=/usr/bin/watchcorgi-client --server gpu --address http://192.168.1.206:7070/update --interval 9
LimitNOFILE=1048576[Install]
WantedBy=multi-user.target
然后是 server 的 service 文件,这里的 --address 和 --port 都是后端监控地址,按需求改就行
[Unit]
Description=Watchcorgi Client Service
After=network.target[Service]
Type=simple
User=root
Restart=on-failure
RestartSec=5s
ExecStart=/usr/bin/watchcorgi-server --address 0.0.0.0 --port 7070
ExecReload=/usr/bin/watchcorgi-server --address 0.0.0.0 --port 7070
LimitNOFILE=1048576[Install]
WantedBy=multi-user.target
然后执行
cp watchcorgi-client.service /etc/systemd/system
systemctl enable watchcorgi-client.service
systemctl start watchcorgi-client.service
cp watchcorgi-server.service /etc/systemd/system
systemctl enable watchcorgi-server.service
systemctl start watchcorgi-server.service
前端
没有…不会写漂亮的网页,如果哪个大佬有这个能力可以写一下,命令行一辈子!
前端可以请求
http://YOUR_SERVER_IP:7070/info2
来获得一个JSON字段