Homelab
Intel Arc Kubernetes DRA
Configuring Intel Arc GPUs with Kubernetes Dynamic Resource Allocation (DRA)
Overview
This article walks through configuring Intel Arc 140T GPUs with Kubernetes Dynamic Resource Allocation (DRA). DRA provides native GPU scheduling where resources are claimed and released with pods, making it ideal for AI inference, video transcoding, and other GPU-accelerated workloads. We'll enable the required feature gates, configure CDI paths for Talos compatibility, and deploy the Intel GPU Resource Driver.
| Tip: | Having trouble? See v0.9.0 for reference. |
Before You Begin
Prerequisites
- MetalLB, Longhorn, and Ingress-NGINX completed
- Both nodes have Intel Arc 140T GPUs
- Talos extensions installed (i915, intel-ucode) from Talhelper Cluster Bootstrap
- Talos 1.9+ (containerd 2.0 with CDI support)
- Kubernetes 1.34+ (DRA GA)1
Why This Approach
GPU access methods:
| Method | Use Case | Article |
|---|---|---|
| Talos i915 extension | Creates /dev/dri devices on host | Talhelper Cluster Bootstrap |
| DRA (this article) | Kubernetes-native GPU allocation via ResourceClaims | Here |
| hostPath mount | Simple /dev/dri passthrough for apps that don't support DRA | Plex Intel GPU Transcoding |
DRA (Dynamic Resource Allocation)2 provides Kubernetes-native GPU management - resources are claimed and released with pods rather than always allocated, with native multi-GPU support and standard APIs. It's ideal for AI workloads needing dynamic allocation or resource constraints. For simpler apps that just need GPU access (like Plex), hostPath works fine.
Verify GPU Devices
The i915 extension installs GPU drivers. Verify devices are visible.
DRI (Direct Rendering Infrastructure)
talosctl -n 192.168.1.30 ls /dev/dri
talosctl -n 192.168.1.31 ls /dev/dri Expected output:
card0
renderD128 Enable DRA and Configure CDI
DRA requires feature gates on control plane and kubelet. Talos also needs CDI path configuration - the Intel driver writes device specs to /var/run/cdi, but containerd doesn't check there by default.3
TalosConfig
talos/talconfig.yaml:
---
clusterName: homelab-cluster
# ... existing cluster config ...
cniConfig:
name: flannel
# ========== ADD to talconfig.yaml (after cniConfig): DRA feature gates ==========
controlPlane:
patches:
- |-
cluster:
apiServer:
extraArgs:
feature-gates: DynamicResourceAllocation=true
runtime-config: resource.k8s.io/v1=true
controllerManager:
extraArgs:
feature-gates: DynamicResourceAllocation=true
scheduler:
extraArgs:
feature-gates: DynamicResourceAllocation=true
# ==============================================================
nodes:
- hostname: talos-node-1
# ... existing node config (ipAddress, installDiskSelector, networkInterfaces) ...
schematic:
customization:
systemExtensions:
officialExtensions:
- siderolabs/i915
- siderolabs/intel-ucode
- siderolabs/iscsi-tools
# ========== ADD to talos-node-1: DRA kubelet + CDI paths ==========
patches:
- |-
machine:
kubelet:
extraArgs:
feature-gates: DynamicResourceAllocation=true
files:
- path: /etc/cri/conf.d/20-customization.part
op: create
content: |
[plugins."io.containerd.cri.v1.runtime"]
cdi_spec_dirs = ["/var/run/cdi"]
# ================================================================
- hostname: talos-node-2
# ... existing node config ...
schematic:
# ... existing schematic ...
# ========== ADD to talos-node-2: Same patches as node-1 ==========
patches:
- |-
machine:
kubelet:
extraArgs:
feature-gates: DynamicResourceAllocation=true
files:
- path: /etc/cri/conf.d/20-customization.part
op: create
content: |
[plugins."io.containerd.cri.v1.runtime"]
cdi_spec_dirs = ["/var/run/cdi"]
# ===================================================================== Apply Configuration
Regenerate and apply:
cd ~/homelab/talos
SOPS_AGE_KEY_FILE=<(op document get "sops-key | homelab") \
talhelper genconfig
# Apply to node 1 (will reboot - files change requires it)
talosctl apply-config \
--nodes 192.168.1.30 \
--file clusterconfig/homelab-cluster-talos-node-1.yaml
# Wait for cluster health (check via control plane)
talosctl -n 192.168.1.30 health --wait-timeout 5m | Note: | The health check retries until all checks pass, then exits. You'll see transient errors like connection refused, missing static pods, or no ready pods as control plane components restart with new feature gates. This is expected (~1-2 minutes). |
# Apply to node 2 (will also reboot)
talosctl apply-config \
--nodes 192.168.1.31 \
--file clusterconfig/homelab-cluster-talos-node-2.yaml
# Wait for cluster health (check via control plane)
talosctl -n 192.168.1.30 health --wait-timeout 5m Verify DRA API
kubectl api-resources | grep deviceclasses Expected:
deviceclasses resource.k8s.io/v1 false DeviceClass This confirms the DRA API is registered. If DRA is NOT enabled, this command returns nothing.
Commit Changes
cd ~/homelab
git add talos/talconfig.yaml
git commit -m "feat(talos): enable DRA feature gates and CDI paths" Deploy Intel GPU Resource Driver
The Intel GPU Resource Driver4 is different from the device plugin - it registers GPUs with DRA. The upstream manifests need patches for Talos compatibility:
- Namespace: Add PSA labels (Talos enforces pod security)
- DaemonSet: Remove
/etc/cdivolume mount (Talos/etcis read-only)
Init Workspace
cd ~/homelab
export KUBECONFIG=$(pwd)/talos/clusterconfig/kubeconfig
mkdir -p k8s/core/gpu
git checkout -b dev Namespace Patch
k8s/core/gpu/namespace-patch.yaml:
---
apiVersion: v1
kind: Namespace
metadata:
name: intel-gpu-resource-driver
labels:
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/warn: privileged DaemonSet Patch
k8s/core/gpu/daemonset-patch.json:
[
{
"op": "replace",
"path": "/spec/template/spec/volumes/2",
"value": {
"name": "cdi",
"hostPath": {
"path": "/var/run/cdi"
}
}
}
] | Note: | The upstream mounts /etc/cdi from host, but Talos /etc is read-only. This patch redirects the cdi volume to /var/run/cdi (writable). The driver writes CDI specs to /etc/cdi inside the container → /var/run/cdi on host → containerd reads them. |
ResourceClaimTemplate
k8s/core/gpu/resourceclaimtemplate.yaml:
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: intel-gpu
namespace: default
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.intel.com Kustomization
k8s/core/gpu/kustomization.yaml:
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
# Intel GPU Resource Driver (uses v1 API, Helm chart lags behind)
- https://github.com/intel/intel-resource-drivers-for-kubernetes//deployments/gpu/base?ref=826b331a873dd302ce8fe314513d5b8affc039f2
- resourceclaimtemplate.yaml
patches:
# Talos compatibility
- path: namespace-patch.yaml
- path: daemonset-patch.json
target:
kind: DaemonSet
name: intel-gpu-resource-driver-kubelet-plugin | Note: | We reference the upstream manifests directly because the Helm chart (0.7.0) still uses v1beta1 API which doesn't exist in Kubernetes 1.34. The main branch uses v1 API. |
Core Kustomization
Add gpu to k8s/core/kustomization.yaml:
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- tailscale
- metallb
- ingress-nginx.flux.yaml
- longhorn
- gpu Commit Changes
git add k8s/core/gpu/ k8s/core/kustomization.yaml
git commit -m "feat(gpu): add Intel GPU resource driver" Deploy
Merge and Push
git checkout main
git merge --ff-only dev
git push Reconcile Flux
flux reconcile kustomization sync
flux get kustomizations Verify DRA
DRA Components
# Check resource driver pods
kubectl get pods -n intel-gpu-resource-driver
# Check GPU resources discovered
kubectl get resourceslices
# Check device class registered
kubectl get deviceclasses
# Verify CDI specs are being created (pick either node IP)
talosctl -n 192.168.1.30 ls /var/run/cdi Expected:
- Pods:
Runningon both nodes - ResourceSlices: One per node (e.g.,
talos-node-1-gpu.intel.com-xxxxx) - DeviceClass:
gpu.intel.com - CDI: Should show spec files like
intel.com-gpu.yaml(not empty)
Test GPU Access
Deploy Test Pod
Deploy a test pod using ResourceClaim:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
containers:
- name: test
image: busybox
command: ["sh", "-c", "ls -la /dev/dri && sleep 30"]
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimTemplateName: intel-gpu
EOF
# Check status and logs
kubectl get pod gpu-test
kubectl logs gpu-test Expected output:
total 0
drwxr-xr-x 2 root root 80 Dec 19 07:06 .
drwxr-xr-x 6 root root 360 Dec 19 07:06 ..
crw-rw-rw- 1 root root 226, 0 Dec 19 07:06 card0
crw-rw-rw- 1 root root 226, 128 Dec 19 07:06 renderD128 Clean Up Test
kubectl delete pod gpu-test Next Steps
With DRA configured, GPUs are available for workloads that support ResourceClaims. For applications that don't support DRA (like Plex's Helm chart), use hostPath mounts instead.
See: Plex Intel GPU Transcoding (uses hostPath for GPU access)
Resources
Footnotes
Kubernetes, "Kubernetes v1.34: DRA has graduated to GA," kubernetes.io. Accessed: Dec. 16, 2025. [Online]. Available: https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/ ↩
Kubernetes, "Dynamic Resource Allocation," kubernetes.io. Accessed: Dec. 16, 2025. [Online]. Available: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ↩
R. Broersma, "Talos Linux and Dynamic Resource Allocation," broersma.dev. Accessed: Dec. 16, 2025. [Online]. Available: https://broersma.dev/talos-linux-and-dynamic-resource-allocation-beta/ ↩
Intel, "Intel GPU Resource Driver," github.com. Accessed: Dec. 16, 2025. [Online]. Available: https://github.com/intel/intel-resource-drivers-for-kubernetes ↩