GPU Server¶
GPU Policy¶
- Each user is given 24 GPU-hours as daily quota. Once exceeded, all GPU processes of the user (on both machines) will be killed. The calculation is reset at 0:00 (UTC+8) every day. GPU time is calculated solely by the amount of time an user holds a GPU context, while other metrics, such as GPU load and number of processes, are irrelevant.
- One may use arbitrary amount of GPU memory if they have available quota. However, to prevent resource exhaustion (since GPU memory is non-preemptive), if all GPUs on the same machine are running out of free memory (less than 3 GiB of free memory is left in each of them), the system will kill all GPU processes of the heaviest GPU memory consumer if the combined memory usage of the consumer on both machines exceeds 2 GiB.
Useful Tips¶
- Take a look at the current GPU status (by
nvidia-smi
ornvtop
) before running GPU jobs.ws-status
also shows your remaining GPU quota. - PyTorch defaults to use the first 4 GPU, and Tensorflow defaults to use all GPUs on the machine (which results in consuming lots of your GPU quota). To change this, pass environment variable CUDA_VISIBLE_DEVICES=[GPU ID (0 to 7)] to the process. For example, run your training script with the command:
CUDA_VISIBLE_DEVICES=0 python train.py
- Tensorflow defaults to use all the GPU memory. Use the following code to change this:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config = config)