GPU pods

Running GPU pods

Use this definition to create your own pod and deploy it to kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-example
spec:
  containers:
  - name: gpu-container
    image: gitlab-registry.nrp-nautilus.io/prp/jupyter-stack/prp:latest
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: 1

This example requests 1 GPU device. You can have up to 8 per node. If you request GPU devices in your pod, kubernetes will auto schedule your pod to the appropriate node. There’s no need to specify the location manually.

You should always delete your pod when your computation is done to let other users use the GPUs. Consider using Jobs with actual script instead of sleep whenever possible to ensure your pod is not wasting GPU time. If you have never used Kubernetes before, see the tutorial.

Requesting high-demand GPUs

Sertain kinds of GPUs have much higher specs than the others, and to avoid wasting those for regular jobs, your pods will only be scheduled on those if you request the type explicitly.

Currently those include:

  • A40
  • A100
  • K40
  • V100
  • RTX6000
  • RTX8000
  • TITANRTX

Requesting many GPUs

Since 1 and 2 GPU jobs are blocking nodes from getting 4 and 8-GPU jobs, there are some nodes reserved for those. Once you submit a job with 4 or 8 GPUs request, a controller will automatically add toleration. You don’t need to do anything manually for that.

If we see more demand, we’ll add the reservation to more nodes.

Choosing GPU type

We have a variety of GPU flavors attached to Nautilus. This table describes the types of GPUs available for use, but is not up to date - it’s better to use the actual cluster information (f.e. kubectl get nodes -L gpu-type).

If you need more graphical memory, use this table or official specs to choose the type:

GPU Type Memory size (GB)
1080 8
M4000 8
1080Ti 11
2080Ti 11
TITAN X/XP 12
Tesla K40 12
Tesla T4 16
TITAN RTX 24
3090Ti 24
Tesla V100 32
RTX A100 40
RTX8000 48
RTX A40 48

NOTE: Not all nodes are available to all users. You can consult about your available resources in Matrix and on resources page. Labs connecting their hardware to our cluster have preferential access to all our resources.

To use a specific type of GPU, add the affinity definition to you pod yaml file. The example below specifies 1080Ti GPU:

spec:
 affinity:
   nodeAffinity:
     requiredDuringSchedulingIgnoredDuringExecution:
       nodeSelectorTerms:
       - matchExpressions:
         - key: gpu-type
           operator: In
           values:
           - 1080Ti