Rolling Secure Re-image: SecureBoot and TPM Encryption for Talos Nodes

Overview

This article re-images each node with UEFI SecureBoot and TPM-based disk encryption. Using the SecureBoot ISO and updated configuration prepared in the previous article, we perform a rolling re-image, one node at a time, while maintaining etcd quorum and workload availability. Use this same process when adding new nodes to the cluster.

Before You Begin

Prerequisites

SecureBoot & Encryption Prep completed
SecureBoot USB drive ready
Physical access to nodes (BIOS changes required)

Rolling Re-image Strategy

With a 3-node control plane cluster, we re-image one node at a time while maintaining etcd quorum (2 of 3 nodes)¹.

Cluster identity is preserved: The cluster CA, etcd CA, and secrets in talsecret.sops.yaml do not change. Only the boot method and encryption config change. Nodes rejoin the same cluster with the same certificates and tokens. Your talosconfig remains valid throughout the entire process.

Process for each node:

Drain workloads from the node
Enter BIOS and enable SecureBoot in setup mode
Boot from SecureBoot USB (keys auto-enroll, enters maintenance mode)
Apply config with --insecure (node installs with SecureBoot + TPM encryption)
Verify SecureBoot and encryption
Uncordon and proceed to next node

Cluster endpoint: The kubeconfig endpoint is node-1 (192.168.10.30). While node-1 is being re-imaged, kubectl commands will fail. The talosctl commands in this article target nodes directly with --insecure or --nodes, so they work regardless. Once node-1 is back and uncordoned, kubectl resumes working for the remaining nodes.

Longhorn impact: With 2 replicas across 3 nodes, each volume's replicas exist on 2 of the 3 nodes. The drain step before each wipe is critical: Longhorn's default drain policy (block-if-contains-last-replica) blocks the drain if the node holds the last healthy replica of any volume². After each wipe, wait for replicas to fully rebuild before proceeding to the next node.

Adding future nodes: This same process applies when expanding the cluster. Generate a node config with talhelper, prepare the SecureBoot USB, and follow the per-node steps below.

Re-image Node

Complete one node at a time to maintain etcd quorum. Repeat this entire section for each node.

Per-Node Values

Value	Node 1	Node 2	Node 3
Hostname	`talos-node-1`	`talos-node-2`	`talos-node-3`
IP	`192.168.10.30`	`192.168.10.31`	`192.168.10.32`
Config file	`homelab-cluster-talos-node-1.yaml`	`homelab-cluster-talos-node-2.yaml`	`homelab-cluster-talos-node-3.yaml`


Tip:	With identical hardware, label each machine with its hostname and IP to avoid confusion during physical operations.

Set these variables at the start of each node pass. All commands below use them:

TALOS_NODE_HOSTNAME=talos-node-1
TALOS_NODE_IP=192.168.10.30
TALOS_NODE_CONFIG=homelab-cluster-talos-node-1.yaml

Verify Cluster Health

Confirm the cluster is healthy before draining:

kubectl get nodes

Expected: All nodes show Ready.

talosctl --nodes $TALOS_NODE_IP health

Expected: All checks pass.

Drain Node

kubectl drain $TALOS_NODE_HOSTNAME \
  --ignore-daemonsets \
  --delete-emptydir-data

Wait for workloads to migrate to other nodes. Longhorn instance-manager PDB violations will retry and resolve within ~40 seconds. If the drain fails with transient connection errors, re-run the same command. The node is already cordoned and drain resumes where it left off.

Save Pre-Wipe State

Note the MAC address for Wake on LAN before wiping (HW ADDR column):

talosctl --nodes $TALOS_NODE_IP get links | grep -E "^NODE|enp"

Save the Longhorn disk spec for this node. The disk wipe creates a new filesystem UUID, and we'll need to re-add the disk entry after the node rejoins:

TALOS_NODE_DISK_SPEC=$(kubectl get nodes.longhorn.io \
  -n longhorn-system \
  $TALOS_NODE_HOSTNAME \
  -o jsonpath='{.spec.disks}')

echo "$TALOS_NODE_DISK_SPEC" | python3 -m json.tool

Wipe Disk

Wipe the existing Talos installation. The SecureBoot ISO includes talos.halt_if_installed which prevents booting if an existing installation is detected on disk. Wiping first avoids this.

Insert the SecureBoot USB, then wipe and reboot:

talosctl --nodes $TALOS_NODE_IP reset --reboot

Be ready to press Delete to enter BIOS as the node reboots.

Configure BIOS

Press Delete immediately as the GEEKOM splash screen appears to enter BIOS (Aptio Setup - AMI):

Verify these existing settings are preserved from the original installation³:

Setting	Location	Expected
Wake Up by LAN	Advanced > Power Management Configuration	Enabled
Power-On after Power-Fail	Advanced > Power Management Configuration	Power On
Turbo Mode	Advanced > CPU Configuration	Enabled

Navigate to Boot and verify USB is first in boot priority (should be preserved from the original installation³):

Setting	Expected
Boot Device Priority	USB KEY: UEFI... first

Navigate to Security > Secure Boot:

Setting	Before	Action
Secure Boot Control	Disabled	Enable
Secure Boot Mode	Custom	Set to Custom (if not already)

Navigate to Security > Secure Boot > Expert Key Management:

Setting	Before	Action
Factory Key Provision	Enabled	Disable (prevents factory keys from being re-enrolled on reboot)

Action	Details
Reset To Setup Mode	Clears all keys (PK, KEK, db, dbx) and puts firmware into Setup Mode so the Talos ISO can auto-enroll Sidero Labs signing keys on first boot⁴. Select No when prompted to "reset without saving"; we save everything together in the next step.

Save and exit (F4). The system reboots and boots from the USB.

Boot from SecureBoot USB

The node boots from USB into the systemd-boot menu. Select Enroll Secure Boot keys: auto before the timeout. This enrolls the Sidero Labs signing keys into the UEFI firmware⁴. The node then reboots and boots into maintenance mode.


Important:	If the enrollment succeeded, the console shows `STAGE: Maintenance` and `SECUREBOOT` shows `true` in the next step. If `SECUREBOOT` shows `false`, the menu timed out before enrollment. Reboot and select the option again.

Verify Maintenance Mode

The node gets a temporary DHCP address in maintenance mode. Check the console screen for the IP and set it:

TALOS_NODE_DHCP_IP=<IP from console>
talosctl --nodes $TALOS_NODE_DHCP_IP get securitystate --insecure

Expected: SECUREBOOT shows true.

Apply Config

Apply the config using the DHCP address from the console. The config contains the static IP, so the node switches to its assigned IP after installation.

talosctl apply-config --insecure \
  --nodes $TALOS_NODE_DHCP_IP \
  --file talos/clusterconfig/$TALOS_NODE_CONFIG

The --insecure flag is required because the node is in maintenance mode without TLS certificates⁵. The node installs with the SecureBoot installer, creates LUKS2-encrypted partitions sealed with TPM⁶, and reboots.


Important:	Remove the USB drive before the node reboots. Otherwise it will boot from the ISO again instead of the installed system.

Wait for Node

talosctl --nodes $TALOS_NODE_IP health --wait-timeout 10m

The health check may report some nodes are not schedulable. This is expected since the node is still cordoned from the drain. Press Ctrl+C once all other checks pass.

Verify SecureBoot

talosctl --nodes $TALOS_NODE_IP get securitystate

Expected: SECUREBOOT shows true.

Verify Encryption

talosctl --nodes $TALOS_NODE_IP get volumeconfig STATE -o yaml

Expected: The encryption section shows provider: luks2 with tpm key type in slot 0.

talosctl --nodes $TALOS_NODE_IP get volumeconfig EPHEMERAL -o yaml

Expected: Same encryption config for EPHEMERAL.

Uncordon Node

kubectl uncordon $TALOS_NODE_HOSTNAME

Verify Cluster State

kubectl get nodes

Expected: All nodes show Ready.

talosctl --nodes $TALOS_NODE_IP etcd members

Expected: Three etcd members. The re-imaged node rejoins the existing etcd cluster automatically. The same IP and cluster CA from talsecret.sops.yaml allow it to re-establish membership.

Verify Data Replication

Confirm data is safe before cleaning up stale Longhorn state from the disk wipe.

1. Verify PVCs are bound:

kubectl get pvc -A

Expected: All active PVCs show Bound status.

2. Verify each volume has a running replica:

kubectl get replicas -n longhorn-system -o wide | grep running

Expected: Each active PVC has at least one running replica. Volumes will show as degraded (1 replica instead of 2). This is expected after a node wipe.

The disk wipe leaves stale replica records and a mismatched disk UUID that prevent Longhorn from rebuilding. Fix both:

3. Delete stale replicas. Longhorn will refuse to delete the last available replica for any volume:

kubectl get replicas -n longhorn-system \
  | grep stopped \
  | awk '{print $1}' \
  | xargs -I {} kubectl delete replica {} \
    -n longhorn-system 2>&1 \
  | grep -v "cannot delete"

4. Fix the disk UUID. The new filesystem has a different UUID than what Longhorn has cached. Disable scheduling, remove the stale disk entry, then re-add it to force Longhorn to pick up the new UUID.

Disable scheduling on the stale disk:

TALOS_NODE_DISK=$(kubectl get nodes.longhorn.io \
  -n longhorn-system \
  $TALOS_NODE_HOSTNAME \
  -o json | \
  python3 -c "import sys,json; d=json.load(sys.stdin); \
    print(list(d['spec']['disks'].keys())[0])")

kubectl patch nodes.longhorn.io \
  -n longhorn-system \
  $TALOS_NODE_HOSTNAME \
  --type merge \
  -p "{\"spec\":{\"disks\":{\
    \"$TALOS_NODE_DISK\":{\"allowScheduling\":false}}}}"

Remove the stale disk entry:

kubectl patch nodes.longhorn.io \
  -n longhorn-system \
  $TALOS_NODE_HOSTNAME \
  --type json \
  -p '[{"op":"replace","path":"/spec/disks","value":{}}]'

Wait for Longhorn to finish syncing, then re-add the disk entry using the spec saved in Save Pre-Wipe State. If the patch returns a "syncing" error, wait a few seconds and retry:

kubectl patch nodes.longhorn.io \
  -n longhorn-system \
  $TALOS_NODE_HOSTNAME \
  --type merge \
  -p "{\"spec\":{\"disks\":$TALOS_NODE_DISK_SPEC}}"

Longhorn detects the disk with the correct UUID and begins rebuilding replicas.

5. Verify replicas are rebuilding:

kubectl get replicas -n longhorn-system -o wide | grep -v stopped

Expected: Each active volume should have 2 replicas in running state across different nodes. If replicas only exist on one node, wait a few minutes and check again.

6. Verify workload pods are running:

kubectl get pods -A -o wide | grep -E "plex|minecraft|factorio"

Expected: Workload pods show Running status.

Wait for all attached volumes to return to healthy:

kubectl get volumes -n longhorn-system

Expected: All attached volumes show healthy robustness. detached/unknown volumes are not in use and can be ignored.


Important:	Wait for all attached volumes to show `healthy` before starting the next node. Draining a node while volumes are `degraded` risks losing all replicas.

Repeat for Remaining Nodes

Return to Drain Node and repeat the process for the next node using the values from the Per-Node Values table. Complete all nodes before proceeding.

Verify All Nodes

After all nodes are re-imaged, confirm workloads reconciled across the full cluster.

flux get kustomizations

Expected: All kustomizations show Ready. Flux reconciles all K8s resources (NetworkPolicies, HelmReleases, encrypted StorageClasses, Tailscale connector, MetalLB pool) from Git after nodes rejoin.

kubectl get pods -A | grep -v Running | grep -v Completed

Expected: Only header line. All pods should be Running or Completed.

Clean Up Detached Volumes

The old pre-encryption PVCs are no longer in use. Verify they are detached and have no bound PVC before deleting:

kubectl get volumes -n longhorn-system \
  | grep detached

Delete all detached volumes:

kubectl get volumes -n longhorn-system \
  | grep detached \
  | awk '{print $1}' \
  | xargs -I {} kubectl delete volume {} \
    -n longhorn-system

Next Steps

All three nodes now boot with UEFI SecureBoot and TPM-sealed LUKS2 encrypted partitions. Follow this same process when adding new nodes.

For the complete security overview and Talos Security Checklist progress, see: Security Hardening Series

Resources

etcd Authors, "Frequently Asked Questions," etcd.io. Accessed: Mar. 2, 2026. [Online]. Available: https://etcd.io/docs/v3.5/faq/ ↩
Longhorn Authors, "Node Failure," longhorn.io. Accessed: Mar. 2, 2026. [Online]. Available: https://longhorn.io/docs/1.6.0/high-availability/node-failure/ ↩
lelopez, "Talos Linux USB Installation," lelopez.io. Accessed: Mar. 3, 2026. [Online]. Available: /blog/homelab-v2-03-talos-linux-usb-install ↩
Sidero Labs, "SecureBoot," docs.siderolabs.com. Accessed: Mar. 3, 2026. [Online]. Available: https://docs.siderolabs.com/talos/v1.12/platform-specific-installations/bare-metal-platforms/secureboot ↩
Sidero Labs, "talosctl CLI Reference," docs.siderolabs.com. Accessed: Mar. 2, 2026. [Online]. Available: https://docs.siderolabs.com/talos/v1.12/reference/cli/ ↩
Sidero Labs, "Disk Encryption," docs.siderolabs.com. Accessed: Mar. 2, 2026. [Online]. Available: https://docs.siderolabs.com/talos/v1.12/configure-your-talos-cluster/storage-and-disk-management/disk-encryption ↩