Restoring a Kubernetes app isn't just kubectl apply
February 2026 — backup is easy, restore is where you find out if your backup actually works
Every infrastructure guide talks about backups. Almost none talk honestly about restores.
Taking a backup is a one-way operation: copy data somewhere safe. Restoring is a multi-step, stateful, order-dependent process that involves coordinating multiple Kubernetes subsystems simultaneously. For a typical self-hosted application running on Kubernetes with PostgreSQL via CNPG and persistent volumes via Longhorn, a restore involves:
- Restoring the Longhorn volume from an S3 snapshot
- Recovering the CNPG cluster from its S3 backup
- Waiting for the recovered database to become ready
- Applying the application manifests with the correct image and configuration
- Verifying the application is healthy and connected to its data
Step out of order, miss a dependency, and you get an application that starts with an empty database, or a CNPG cluster that can’t find its WAL files, or a volume that’s attached to the wrong PVC. Recovery fails silently and you don’t notice until you check the logs.
full-app-restore is a tool that handles this coordination — a REST API and web UI for managing restoration jobs across Longhorn and CNPG.
The problem with manual restores
Manual restoration follows a runbook. Runbooks drift. The person who wrote the runbook isn’t always available during the incident that requires a restore. Steps that “obviously” need to happen in order aren’t obvious to someone who’s never done the restore before and is under pressure to get a service back online.
More subtly: manual restores are hard to test. Testing a restore means actually doing a restore, which means either taking a production cluster offline or maintaining a separate environment. Most people don’t test their restores regularly. They find out it’s broken during an actual incident.
A tool that codifies the restore process makes it:
- Testable — you can run it against a staging environment on a schedule
- Auditable — the tool records what happened, in what order, and what the outcome was
- Repeatable — the same sequence runs every time, not whatever the operator remembers
What the tool does
The REST API exposes restoration jobs as a resource. You describe the application you want to restore — which namespace, which Longhorn volumes, which CNPG cluster, which S3 backup — and the tool executes the sequence:
Discovery: before you select anything, the tool scans the cluster for existing Longhorn volume backups and CNPG backup locations. You see what’s available and when each backup was taken, including point-in-time recovery options from CNPG’s WAL archive.
Coordination: the tool knows that the CNPG cluster must be fully ready before the application manifests are applied, and that the Longhorn PVC must be bound before the CNPG cluster starts (if they’re on the same underlying storage). Dependency order is encoded in the tool, not in a human’s memory.
Recovery point visualisation: the web UI shows the available recovery points on a timeline. You can see the last backup taken, whether there are more recent WAL files for PITR, and what data would be present at each point in time.
The S3 configuration
Both Longhorn and CNPG backup to S3. The tool needs S3 credentials to list and access backups. These are configured globally:
# global-s3-config.yaml
s3:
endpoint: "https://s3.homelab.local"
bucket: "cluster-backups"
region: "us-east-1"
accessKey: "..."
secretKey: "..."
# Tool discovers backup paths automatically:
# longhorn/ ← Longhorn volume backups
# cnpg/ ← CNPG cluster backups + WAL archive
The tool constructs the correct restore request for CNPG — a Cluster resource with bootstrap.recovery.source and externalClusters pointing to the S3 location:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: db-myapp-recovered
namespace: myapp
spec:
bootstrap:
recovery:
source: myapp-backup
externalClusters:
- name: myapp-backup
barmanObjectStore:
destinationPath: s3://cluster-backups/cnpg/myapp
endpointURL: https://s3.homelab.local
s3Credentials:
accessKeyId:
name: s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: s3-credentials
key: SECRET_ACCESS_KEY
The tool generates this resource, applies it, and waits for the cluster to reach Ready state before proceeding.
The web UI
The REST API handles the machine-readable side. The web UI handles the operational side — the person who needs to restore a service at 2am and wants to see progress without reading JSON.
The timeline view shows available recovery points visually. You click the point in time you want to restore to. You see the steps executing in real time. Each step shows its status (waiting, running, complete, failed) and a log of what happened.
For the common case — “restore the most recent backup of application X” — it’s three clicks: select application, confirm recovery point, start restore. The tool handles the rest.
Testing your restores
The tool makes it practical to test restore procedures regularly. A test restore script:
#!/bin/bash
# Run weekly against staging namespace
RESTORE_API="http://full-app-restore.tools.svc.cluster.local:8080"
TARGET_NAMESPACE="myapp-staging"
# Trigger restore of production backup into staging namespace
curl -X POST $RESTORE_API/api/restore \
-H "Content-Type: application/json" \
-d '{
"application": "myapp",
"targetNamespace": "'$TARGET_NAMESPACE'",
"recoveryPoint": "latest",
"longhornVolumes": ["myapp-data"],
"cnpgCluster": "db-myapp-cluster"
}'
# Wait and check result
# ...alert if restore fails
Running this weekly against a staging namespace proves two things: the backups are restorable (a backup that fails to restore is not a backup), and the tool and its configuration are working correctly. By the time you need to use it in production, you’ve already verified it works.
What makes this harder than it looks
The tricky part of CNPG recovery is WAL replay. CNPG doesn’t just restore the base backup — it replays the WAL archive from the backup point up to the selected recovery target. This takes time proportional to how much write activity the database has seen since the last base backup.
If the base backup is from last Monday and today is Friday, a full restore might replay 4 days of WAL. For a busy database, that’s potentially hours. The tool surfaces this — it estimates replay time based on the WAL archive size — so you can decide whether to wait for a full replay or accept data loss by targeting a specific transaction ID.
The other tricky part is cleanup after a failed restore attempt. Half-completed Longhorn PVCs, a CNPG cluster stuck in recovery, leftover Kubernetes resources in the target namespace — these need to be cleaned up before retrying. The tool tracks what it created and can roll back a failed restore attempt.
The honest state of self-hosted backup
Most self-hosters have backups that work in theory. They’ve tested the backup job runs. They haven’t tested a restore from scratch in the last six months.
Build something that makes restore testing automatic. Run it. Fix what breaks. Then fix it again when you upgrade CNPG or Longhorn and the restore procedure changes. The work of maintaining a working restore procedure is ongoing.
But it’s the only work that matters when everything goes wrong.


