𝔩𝔢𝔩𝕠𝔭𝔢𝔷
Theme
Connect With Me on LinkedIn Buy Me a Coffee

Homelab

Intel Arc Kubernetes DRA

Configuring Intel Arc GPUs with Kubernetes Dynamic Resource Allocation (DRA)

Overview

This article walks through configuring Intel Arc 140T GPUs with Kubernetes Dynamic Resource Allocation (DRA). DRA provides native GPU scheduling where resources are claimed and released with pods, making it ideal for AI inference, video transcoding, and other GPU-accelerated workloads. We'll enable the required feature gates, configure CDI paths for Talos compatibility, and deploy the Intel GPU Resource Driver.

Tip:Having trouble? See v0.9.0 for reference.

Before You Begin

Prerequisites

Why This Approach

GPU access methods:

MethodUse CaseArticle
Talos i915 extensionCreates /dev/dri devices on hostTalhelper Cluster Bootstrap
DRA (this article)Kubernetes-native GPU allocation via ResourceClaimsHere
hostPath mountSimple /dev/dri passthrough for apps that don't support DRAPlex Intel GPU Transcoding

DRA (Dynamic Resource Allocation)2 provides Kubernetes-native GPU management - resources are claimed and released with pods rather than always allocated, with native multi-GPU support and standard APIs. It's ideal for AI workloads needing dynamic allocation or resource constraints. For simpler apps that just need GPU access (like Plex), hostPath works fine.

Verify GPU Devices

The i915 extension installs GPU drivers. Verify devices are visible.

DRI (Direct Rendering Infrastructure)

talosctl -n 192.168.1.30 ls /dev/dri
talosctl -n 192.168.1.31 ls /dev/dri

Expected output:

card0
renderD128

Enable DRA and Configure CDI

DRA requires feature gates on control plane and kubelet. Talos also needs CDI path configuration - the Intel driver writes device specs to /var/run/cdi, but containerd doesn't check there by default.3

TalosConfig

talos/talconfig.yaml:

---
clusterName: homelab-cluster
# ... existing cluster config ...
cniConfig:
    name: flannel

# ========== ADD to talconfig.yaml (after cniConfig): DRA feature gates ==========
controlPlane:
    patches:
        - |-
            cluster:
              apiServer:
                extraArgs:
                  feature-gates: DynamicResourceAllocation=true
                  runtime-config: resource.k8s.io/v1=true
              controllerManager:
                extraArgs:
                  feature-gates: DynamicResourceAllocation=true
              scheduler:
                extraArgs:
                  feature-gates: DynamicResourceAllocation=true
# ==============================================================
nodes:
    - hostname: talos-node-1
      # ... existing node config (ipAddress, installDiskSelector, networkInterfaces) ...
      schematic:
          customization:
              systemExtensions:
                  officialExtensions:
                      - siderolabs/i915
                      - siderolabs/intel-ucode
                      - siderolabs/iscsi-tools
      # ========== ADD to talos-node-1: DRA kubelet + CDI paths ==========
      patches:
          - |-
              machine:
                kubelet:
                  extraArgs:
                    feature-gates: DynamicResourceAllocation=true
                files:
                  - path: /etc/cri/conf.d/20-customization.part
                    op: create
                    content: |
                      [plugins."io.containerd.cri.v1.runtime"]
                        cdi_spec_dirs = ["/var/run/cdi"]
      # ================================================================

    - hostname: talos-node-2
      # ... existing node config ...
      schematic:
          # ... existing schematic ...
      # ========== ADD to talos-node-2: Same patches as node-1 ==========
      patches:
          - |-
              machine:
                kubelet:
                  extraArgs:
                    feature-gates: DynamicResourceAllocation=true
                files:
                  - path: /etc/cri/conf.d/20-customization.part
                    op: create
                    content: |
                      [plugins."io.containerd.cri.v1.runtime"]
                        cdi_spec_dirs = ["/var/run/cdi"]
      # =====================================================================

Apply Configuration

Regenerate and apply:

cd ~/homelab/talos

SOPS_AGE_KEY_FILE=<(op document get "sops-key | homelab") \
  talhelper genconfig

# Apply to node 1 (will reboot - files change requires it)
talosctl apply-config \
  --nodes 192.168.1.30 \
  --file clusterconfig/homelab-cluster-talos-node-1.yaml

# Wait for cluster health (check via control plane)
talosctl -n 192.168.1.30 health --wait-timeout 5m
Note:The health check retries until all checks pass, then exits. You'll see transient errors like connection refused, missing static pods, or no ready pods as control plane components restart with new feature gates. This is expected (~1-2 minutes).
# Apply to node 2 (will also reboot)
talosctl apply-config \
  --nodes 192.168.1.31 \
  --file clusterconfig/homelab-cluster-talos-node-2.yaml

# Wait for cluster health (check via control plane)
talosctl -n 192.168.1.30 health --wait-timeout 5m

Verify DRA API

kubectl api-resources | grep deviceclasses

Expected:

deviceclasses                            resource.k8s.io/v1   false   DeviceClass

This confirms the DRA API is registered. If DRA is NOT enabled, this command returns nothing.

Commit Changes

cd ~/homelab
git add talos/talconfig.yaml
git commit -m "feat(talos): enable DRA feature gates and CDI paths"

Deploy Intel GPU Resource Driver

The Intel GPU Resource Driver4 is different from the device plugin - it registers GPUs with DRA. The upstream manifests need patches for Talos compatibility:

  1. Namespace: Add PSA labels (Talos enforces pod security)
  2. DaemonSet: Remove /etc/cdi volume mount (Talos /etc is read-only)

Init Workspace

cd ~/homelab
export KUBECONFIG=$(pwd)/talos/clusterconfig/kubeconfig
mkdir -p k8s/core/gpu
git checkout -b dev

Namespace Patch

k8s/core/gpu/namespace-patch.yaml:

---
apiVersion: v1
kind: Namespace
metadata:
    name: intel-gpu-resource-driver
    labels:
        pod-security.kubernetes.io/enforce: privileged
        pod-security.kubernetes.io/audit: privileged
        pod-security.kubernetes.io/warn: privileged

DaemonSet Patch

k8s/core/gpu/daemonset-patch.json:

[
    {
        "op": "replace",
        "path": "/spec/template/spec/volumes/2",
        "value": {
            "name": "cdi",
            "hostPath": {
                "path": "/var/run/cdi"
            }
        }
    }
]
Note:The upstream mounts /etc/cdi from host, but Talos /etc is read-only. This patch redirects the cdi volume to /var/run/cdi (writable). The driver writes CDI specs to /etc/cdi inside the container → /var/run/cdi on host → containerd reads them.

ResourceClaimTemplate

k8s/core/gpu/resourceclaimtemplate.yaml:

---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
    name: intel-gpu
    namespace: default
spec:
    spec:
        devices:
            requests:
                - name: gpu
                  exactly:
                      deviceClassName: gpu.intel.com

Kustomization

k8s/core/gpu/kustomization.yaml:

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
    # Intel GPU Resource Driver (uses v1 API, Helm chart lags behind)
    - https://github.com/intel/intel-resource-drivers-for-kubernetes//deployments/gpu/base?ref=826b331a873dd302ce8fe314513d5b8affc039f2
    - resourceclaimtemplate.yaml
patches:
    # Talos compatibility
    - path: namespace-patch.yaml
    - path: daemonset-patch.json
      target:
          kind: DaemonSet
          name: intel-gpu-resource-driver-kubelet-plugin
Note:We reference the upstream manifests directly because the Helm chart (0.7.0) still uses v1beta1 API which doesn't exist in Kubernetes 1.34. The main branch uses v1 API.

Core Kustomization

Add gpu to k8s/core/kustomization.yaml:

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
    - tailscale
    - metallb
    - ingress-nginx.flux.yaml
    - longhorn
    - gpu

Commit Changes

git add k8s/core/gpu/ k8s/core/kustomization.yaml
git commit -m "feat(gpu): add Intel GPU resource driver"

Deploy

Merge and Push

git checkout main
git merge --ff-only dev
git push

Reconcile Flux

flux reconcile kustomization sync
flux get kustomizations

Verify DRA

DRA Components

# Check resource driver pods
kubectl get pods -n intel-gpu-resource-driver

# Check GPU resources discovered
kubectl get resourceslices

# Check device class registered
kubectl get deviceclasses

# Verify CDI specs are being created (pick either node IP)
talosctl -n 192.168.1.30 ls /var/run/cdi

Expected:

  • Pods: Running on both nodes
  • ResourceSlices: One per node (e.g., talos-node-1-gpu.intel.com-xxxxx)
  • DeviceClass: gpu.intel.com
  • CDI: Should show spec files like intel.com-gpu.yaml (not empty)

Test GPU Access

Deploy Test Pod

Deploy a test pod using ResourceClaim:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  containers:
  - name: test
    image: busybox
    command: ["sh", "-c", "ls -la /dev/dri && sleep 30"]
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    resourceClaimTemplateName: intel-gpu
EOF

# Check status and logs
kubectl get pod gpu-test
kubectl logs gpu-test

Expected output:

total 0
drwxr-xr-x    2 root     root            80 Dec 19 07:06 .
drwxr-xr-x    6 root     root           360 Dec 19 07:06 ..
crw-rw-rw-    1 root     root      226,   0 Dec 19 07:06 card0
crw-rw-rw-    1 root     root      226, 128 Dec 19 07:06 renderD128

Clean Up Test

kubectl delete pod gpu-test

Next Steps

With DRA configured, GPUs are available for workloads that support ResourceClaims. For applications that don't support DRA (like Plex's Helm chart), use hostPath mounts instead.

See: Plex Intel GPU Transcoding (uses hostPath for GPU access)

Resources

Footnotes

  1. Kubernetes, "Kubernetes v1.34: DRA has graduated to GA," kubernetes.io. Accessed: Dec. 16, 2025. [Online]. Available: https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/

  2. Kubernetes, "Dynamic Resource Allocation," kubernetes.io. Accessed: Dec. 16, 2025. [Online]. Available: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/

  3. R. Broersma, "Talos Linux and Dynamic Resource Allocation," broersma.dev. Accessed: Dec. 16, 2025. [Online]. Available: https://broersma.dev/talos-linux-and-dynamic-resource-allocation-beta/

  4. Intel, "Intel GPU Resource Driver," github.com. Accessed: Dec. 16, 2025. [Online]. Available: https://github.com/intel/intel-resource-drivers-for-kubernetes

Previous
MetalLB, Longhorn, and Ingress-NGINX