GPU passthrough

This tutorial gives examples on how to see available GPUs on a site, grant tenants access to selected GPUs to and indicate GPU requirements from an application specification. Currently NVIDIA and Intel GPUs are supported.

Prerequisites

There are different ways to mount NVIDIA GPUs into the container depending on the container runtime. Intel GPUs do not have special prerequisites and can be detected and mounted automatically.

note

Edge Enforcer detects the supported methods of running NVIDIA GPUs and runs the GPU discovery at Edge Enforcer start-up. When updating the Docker daemon configuration in this regard, or updating the CDI specification, make sure to restart the Edge Enforcer (after restarting Docker daemon, if applicable) so that it is up to date with the latest daemon configuration.

NVIDIA CDI specification

This method is available under Podman container runtime and newer versions of Docker (starting with version 25.0).

The procedure to create a CDI specification for NVIDIA devices on a host is defined in the NVIDIA Container Toolkit User Guide. The container engine (Podman or Docker) must be able to find (e.g. in the /etc/cdi directory) and read the CDI specification, as generated by

nvidia-ctk cdi generate

The Avassa platform will probe the devices labelled nvidia.com/gpu=all and list the available devices with this label.

If the NVIDIA Docker runtime is configured, then it takes precedence over CDI specification.

NVIDIA Docker runtime

This method is available only under Docker container runtime. When running with Podman refer to the CDI interface mentioned above.

In order for the system to gain access to NVIDIA GPU on a host, and to be able to pass GPU through to applications, nvidia-container-toolkit must be installed on the host as described in the NVIDIA Container Toolkit User Guide and NVIDIA runtime must be configured in Docker's daemon.json, but should not be set as default runtime as the Avassa platform will selectively use this runtime for the containers that require it.

note

The Docker runtime must be called nvidia for the Avassa platform to detect it.

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Showing the available GPUs

As a site provider, to confirm that the expected GPU has been found on all hosts in a site and to see the GPU parameters as discovered by the Avassa platform, use the following supctl command:

supctl show -s stockholm-sergel system cluster hosts --fields hostname,gpus

- hostname: stockholm-sergel-001
  gpus:
    - id: GPU-b75c47d9-5fb4-63e0-a07b-ff2633af741c
      vendor: NVIDIA
      name: Tesla M60
      serial: "0321017046575"
      memory: 7680 MiB
      driver-version: 525.60.13
      compute-mode: Default
      compute-capability: "5.2"
      display-mode: Enabled
      labels: []
    - id: GPU-ee1b2a5c-3cd0-0c4a-a240-d87c22748a35
      vendor: NVIDIA
      name: Tesla M60
      serial: "0321017046575"
      memory: 7680 MiB
      driver-version: 525.60.13
      compute-mode: Default
      compute-capability: "5.2"
      display-mode: Enabled
      labels: []

From this output, we can see that two identical NVIDIA GPUs were discovered on the single host within stockholm-sergel site.

Creating a GPU label

In the general case the application owner does not have access to the list of all GPUs on the site (unless this role coincides with the site provider). For this reason, an application specification cannot refer directly to a GPU unit, instead it refers to GPU labels created by the site provided and granted to the application owner.

An example of creating the simplest possible gpu label that refers to all GPUs within the site. This example creates a GPU label in the system settings, which means the label becomes available on all sites in the system.

supctl create system settings <<EOF
gpu-labels:
  - label: all
    all: true
EOF

An example of a more specific GPU label, valid only on stockholm-sergel site, selecting any one of the "Tesla"-brand NVIDIA GPUs.

supctl merge system sites stockholm-sergel <<EOF
gpu-labels:
  - label: any-tesla
    max-number-gpus: 1
    gpu-patterns:
      - vendor == "NVIDIA", name == "*Tesla*"
EOF

Note the gpu-patterns expression. It is possible to match on any parameter that appears in the GPU list in the step before. See the reference documentation for the system settings object for the detailed description of the gpu-patterns expression syntax.

Verify that the GPU labels are available on a specific site (stockholm-sergel in this case):

supctl show -s stockholm-sergel system cluster hosts --fields hostname,gpu-labels

- hostname: stockholm-sergel
  gpu-labels:
    - name: all
      matching-gpus:
        - id: GPU-b75c47d9-5fb4-63e0-a07b-ff2633af741c
        - id: GPU-ee1b2a5c-3cd0-0c4a-a240-d87c22748a35
    - name: any-tesla
      max-number-gpus: 1
      matching-gpus:
        - id: GPU-b75c47d9-5fb4-63e0-a07b-ff2633af741c
        - id: GPU-ee1b2a5c-3cd0-0c4a-a240-d87c22748a35

By comparing the GPU IDs in this list to the IDs from the GPU list in the previous step we can see that both labels are available on the site and both labels match both GPUs. However, the difference is the max-number-gpus parameter which indicates that by referencing the any-tesla label a particular container is only assigned one of the matching-gpus, not both. It is still possible that the other GPU is assigned to a different container referencing the same label on the site.

Granting a subtenant access to a GPU label

note

This step is only required if there is a subtenant that needs to be granted GPU access. If there are no subtenants and the tenant that configured the gpu-labels in the previous step is the one to run applications, then this step may be skipped.

By default, labels created as described in the previous step are only accessible to the site provider. In order for its subtenants to gain access to the GPUs the site provider needs to create a tenant-specific resource-profile and assign the GPU labels to this profile, in a similar fashion as described in the Device Discovery tutorial.

note

We use an application owner tenant, acme, as example here.

To create a new resource-profile called t-acme-gpu:

supctl create resource-profiles <<EOF
name: t-acme-gpu
gpu-labels:
  - name: all
  - name: any-tesla
EOF

To assign the resource-profile to the tenant globally. Note that this command replaces any resource-profile already assigned, if this is not desired then the currently assigned resource-profile may be updated with the list of gpu-labels.

supctl merge tenants acme <<EOF
resource-profile: t-acme-gpu
EOF

Note that there may only be one resource-profile per tenant as a global setting, however it may be refined on a per-site basis.

Assuming that the stockholm-sergel site is already assigned to tenant acme:

supctl show tenants acme assigned-sites --fields name

- name: stockholm-sergel

The tenant can now see the list of GPUs visible to them on this site:

supctl show -s stockholm-sergel assigned-sites stockholm-sergel --fields gpus,gpu-labels

gpus:
  - id: GPU-b75c47d9-5fb4-63e0-a07b-ff2633af741c
    vendor: NVIDIA
    name: Tesla M60
    serial: "0321017046575"
    memory: 7680 MiB
    driver-version: 525.60.13
    compute-mode: Default
    compute-capability: "5.2"
    display-mode: Enabled
  - id: GPU-ee1b2a5c-3cd0-0c4a-a240-d87c22748a35
    vendor: NVIDIA
    name: Tesla M60
    serial: "0321017046575"
    memory: 7680 MiB
    driver-version: 525.60.13
    compute-mode: Default
    compute-capability: "5.2"
    display-mode: Enabled
gpu-labels:
  - name: all
    matching-gpus:
      - id: GPU-b75c47d9-5fb4-63e0-a07b-ff2633af741c
      - id: GPU-ee1b2a5c-3cd0-0c4a-a240-d87c22748a35
  - name: any-tesla
    max-number-gpus: 1
    matching-gpus:
      - id: GPU-b75c47d9-5fb4-63e0-a07b-ff2633af741c
      - id: GPU-ee1b2a5c-3cd0-0c4a-a240-d87c22748a35

Creating an application with GPU passthrough

The system is now ready for an application requiring GPU access to be deployed. The following is an example of such an application:

sample-gpu-app.yaml
name: sample-gpu-app
version: 0.0.1
services:
  - name: s
    mode: replicated
    replicas: 1
    containers:
      - name: c
        image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
        entrypoint:
          - /bin/bash
        cmd:
          - "-c"
          - sleep infinity
        gpu:
          labels:
            - all

To create this application specification:

cat sample-gpu-app.yaml | supctl create applications

In order to deploy this application on stockholm-sergel site we need an application deployment object:

name: sample-gpu-dep
application: sample-gpu-app
application-version: "0.0.1"
placement:
  match-site-labels: system/name=stockholm-sergel

To create this application deployment:

cat sample-gpu-dep.yaml | supctl create application-deployments

Once the application is deployed we can see that both GPUs available on the host are passed through to the container:

supctl show -s stockholm-sergel \
    applications sample-gpu-app service-instances s-1 \
    --fields oper-status,containers/[name,gpus]

oper-status: running
containers:
  - name: c
    gpus:
      - id: GPU-b75c47d9-5fb4-63e0-a07b-ff2633af741c
      - id: GPU-ee1b2a5c-3cd0-0c4a-a240-d87c22748a35

This is expected, because the application requested all GPUs that have the GPU-label all, and there was no other limitations neither on the label, nor in the application specification. Had the application referred to any-tesla label only one GPU would be passed through to the application because of the max-number-gpus limit on the label.

In order to verify that the GPU is visible inside the container we may use the nvidia-smi binary which has been mounted into the container by the utility NVIDIA driver capability.

supctl do -s stockholm-sergel applications sample-gpu-app \
    service-instances s-1 containers c exec nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 00000000:00:1D.0 Off |                  352 |
| N/A   26C    P0    38W / 150W |      0MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 00000000:00:1E.0 Off |                  352 |
| N/A   35C    P0    38W / 150W |      0MiB /  7680MiB |     49%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

We may also try to run the NVIDIA test application included in this image.

supctl do -s stockholm-sergel applications sample-gpu-app \
    service-instances s-1 containers c exec /tmp/vectorAdd

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Requesting passthrough of a subset of GPUs matching a GPU label

In certain cases a site provider may grant access to a number of GPUs that is larger than what is required by the application. In this case, it is possible to further limit the number of GPUs passed through into the container, or even write an expression to select GPU matching certain parameters.

A modified example of the application specification mentioned in the previous step:

name: sample-gpu-app
version: 0.0.2
services:
  - name: s
    mode: replicated
    replicas: 1
    containers:
      - name: c
        image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
        entrypoint:
          - /bin/bash
        cmd:
          - "-c"
          - sleep infinity
        gpu:
          labels:
            - all
          gpu-patterns:
            - vendor == 'NVIDIA, memory >= "4 GiB", driver-version > "515", display-mode == "Enabled"
          number-gpus: 1

In this example the application is still requesting a GPU matching the GPU label all, but further refines the requested GPUs by gpu-patterns and number-gpus statements.

The gpu-patterns statement allows the application owner to select a GPU corresponding to a certain set of parameters. In this example the GPU memory must be at least 4 GiB, the NVIDIA driver version at least 515.* and the GPU must have a connected display. This does not really narrow down the set of GPUs in this particular example as both GPU on this site match this expression, but it makes sense in environments where different GPU models are present. The syntax of the gpu-patterns expression is described in detail in the reference documentation for the application object.

The number-gpus indicates the exact number of GPUs to be passed through into the container. This means that if the number of GPUs matching the gpu-labels and gpu-patterns expressions is greater than the desired number of GPUs, then exactly the desired number is selected by the scheduler to be passed through to the container. If the number of matching GPUs is lower than the desired number, then the application fails to start.

As a result of deploying this application one GPU is passed through to the container:

supctl show -s stockholm-sergel \
    applications sample-gpu-app service-instances s-1 \
    --fields oper-status,containers/[name,gpus]

oper-status: running
containers:
  - name: c
    gpus:
      - id: GPU-b75c47d9-5fb4-63e0-a07b-ff2633af741c

Prerequisites​

NVIDIA CDI specification​

NVIDIA Docker runtime​

Showing the available GPUs​

Creating a GPU label​

Granting a subtenant access to a GPU label​

Creating an application with GPU passthrough​

Requesting passthrough of a subset of GPUs matching a GPU label​

Prerequisites

NVIDIA CDI specification

NVIDIA Docker runtime

Showing the available GPUs

Creating a GPU label

Granting a subtenant access to a GPU label

Creating an application with GPU passthrough

Requesting passthrough of a subset of GPUs matching a GPU label