문제 상황
학습 시작 시 아래와 같은 에러 발생
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.
환경
- os: Ubuntu 20.04.6 LTS
- python: v3.8.10
- torch: v2.0.1
- cuda: v11.7
- gpu: A100 * 2 (MIG 적용)
해결
- Single GPU 학습
환경변수 CUDA_VISIBLE_DEVICES를 0으로 세팅해주니 해결됨
export CUDA_VISIBLE_DEVICES=0
- Multi GPU 학습
아직 CUDA 11과 12는 하나의 프로세스에 오직 하나의 MIG 인스턴스만 사용할 수 있으므로 불가능...
참고로 CUDA 11부터 A100 MIG, CUDA 12부터 H100 MIG 사용이 가능함
참고 1: https://github.com/pytorch/pytorch/issues/38616
RuntimeError: num_gpus <= 16 INTERNAL ASSERT FAILED · Issue #38616 · pytorch/pytorch
🐛 Bug I am not sure if this is a bug, or there is a problem with my environment configuration. RuntimeError: num_gpus <= 16 INTERNAL ASSERT FAILED at /opt/conda/conda-bld/pytorch_1579022060824/work...
github.com
참고 2: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-visible-devices
NVIDIA Multi-Instance GPU User Guide :: NVIDIA Tesla Documentation
In some cases, if you have agents on the system (e.g. monitoring agents) that use the GPU, then you may not be able to initiate a GPU reset. For example, on DGX systems, you may encounter the following message: $ sudo nvidia-smi -i 0 -mig 1 Warning: MIG mo
docs.nvidia.com
참고 3: https://stackoverflow.com/questions/73175008
How to make multiple GPUs visible with os.environ["CUDA_VISIBLE_DEVICES"] using GPU_IDs
I want to make several GPUs visible using os.environ["CUDA_VISIBLE_DEVICES"] = <GPU_IDs> the following does not work for me, perhaps because the GPUs are split into MIG partitions. ...
stackoverflow.com
'딥러닝' 카테고리의 다른 글
[딥러닝] Accelerate와 DeepSpeed를 이용한 LLM 멀티 노드 학습 (1) | 2024.09.25 |
---|