深度学习训练过程中如果中断,很容易造成显存占用不释放的问题。做个记录,留着备用。
表现为报错:
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory
1.查看是否出现了问题:nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN V Off | 00000000:01:00.0 On | N/A |
| 39% 53C P2 36W / 250W | 11959MiB / 12055MiB | 0% Default |
+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1017 G /usr/lib/xorg/Xorg 298MiB |
| 0 1834 G /opt/teamviewer/tv_bin/TeamViewer 6MiB |
| 0 2045 G compiz 177MiB |
| 0 4118 G ...-token=D609226DD6A56AEBB70B08FB7BC10F2E 78MiB |
| 0 4603 G ...uest-channel-token=11061898972785214487 59MiB |
| 0 16481 C python3 418MiB |
| 0 16537 C python3 10916MiB |
+-----------------------------------------------------------------------------+
2.发现16537是罪魁祸首
kill -9 16537
3.监控GPU:3代表3秒
watch -n 3 nvidia-smi
4.监控cpu和内存
top -d 1
free -m
5.清除cache缓存内存空间
sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches'
sudo sh -c 'echo 2 > /proc/sys/vm/drop_caches'
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'