Gino Eising
Gino Eising
Nerd by Nature
Feb 19, 2026 6 min read

Restoring a Kubernetes app isn't just kubectl apply

thumbnail for this post

February 2026 — backup is easy, restore is where you find out if your backup actually works

Every infrastructure guide talks about backups. Almost none talk honestly about restores.

Taking a backup is a one-way operation: copy data somewhere safe. Restoring is a multi-step, stateful, order-dependent process that involves coordinating multiple Kubernetes subsystems simultaneously. For a typical self-hosted application running on Kubernetes with PostgreSQL via CNPG and persistent volumes via Longhorn, a restore involves:

  1. Restoring the Longhorn volume from an S3 snapshot
  2. Recovering the CNPG cluster from its S3 backup
  3. Waiting for the recovered database to become ready
  4. Applying the application manifests with the correct image and configuration
  5. Verifying the application is healthy and connected to its data

Step out of order, miss a dependency, and you get an application that starts with an empty database, or a CNPG cluster that can’t find its WAL files, or a volume that’s attached to the wrong PVC. Recovery fails silently and you don’t notice until you check the logs.

full-app-restore is a tool that handles this coordination — a REST API and web UI for managing restoration jobs across Longhorn and CNPG.


The problem with manual restores

Manual restoration follows a runbook. Runbooks drift. The person who wrote the runbook isn’t always available during the incident that requires a restore. Steps that “obviously” need to happen in order aren’t obvious to someone who’s never done the restore before and is under pressure to get a service back online.

More subtly: manual restores are hard to test. Testing a restore means actually doing a restore, which means either taking a production cluster offline or maintaining a separate environment. Most people don’t test their restores regularly. They find out it’s broken during an actual incident.

A tool that codifies the restore process makes it:

  • Testable — you can run it against a staging environment on a schedule
  • Auditable — the tool records what happened, in what order, and what the outcome was
  • Repeatable — the same sequence runs every time, not whatever the operator remembers

What the tool does

The REST API exposes restoration jobs as a resource. You describe the application you want to restore — which namespace, which Longhorn volumes, which CNPG cluster, which S3 backup — and the tool executes the sequence:

sequenceDiagram participant API as full-app-restore API participant L as Longhorn participant PG as CNPG participant K as Kubernetes API->>L: List available volume snapshots L-->>API: Snapshot list with timestamps API->>L: Create PVC from selected snapshot L-->>API: PVC ready API->>PG: Create cluster recovery from S3 PG-->>API: Cluster recovering... API->>PG: Wait for cluster Ready PG-->>API: Cluster ready API->>K: Apply application manifests K-->>API: Deployment created API->>K: Wait for pods Ready K-->>API: Application healthy

Discovery: before you select anything, the tool scans the cluster for existing Longhorn volume backups and CNPG backup locations. You see what’s available and when each backup was taken, including point-in-time recovery options from CNPG’s WAL archive.

Coordination: the tool knows that the CNPG cluster must be fully ready before the application manifests are applied, and that the Longhorn PVC must be bound before the CNPG cluster starts (if they’re on the same underlying storage). Dependency order is encoded in the tool, not in a human’s memory.

Recovery point visualisation: the web UI shows the available recovery points on a timeline. You can see the last backup taken, whether there are more recent WAL files for PITR, and what data would be present at each point in time.


The S3 configuration

Both Longhorn and CNPG backup to S3. The tool needs S3 credentials to list and access backups. These are configured globally:

# global-s3-config.yaml
s3:
  endpoint: "https://s3.homelab.local"
  bucket: "cluster-backups"
  region: "us-east-1"
  accessKey: "..."
  secretKey: "..."

# Tool discovers backup paths automatically:
# longhorn/  ← Longhorn volume backups
# cnpg/      ← CNPG cluster backups + WAL archive

The tool constructs the correct restore request for CNPG — a Cluster resource with bootstrap.recovery.source and externalClusters pointing to the S3 location:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: db-myapp-recovered
  namespace: myapp
spec:
  bootstrap:
    recovery:
      source: myapp-backup
  externalClusters:
    - name: myapp-backup
      barmanObjectStore:
        destinationPath: s3://cluster-backups/cnpg/myapp
        endpointURL: https://s3.homelab.local
        s3Credentials:
          accessKeyId:
            name: s3-credentials
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: s3-credentials
            key: SECRET_ACCESS_KEY

The tool generates this resource, applies it, and waits for the cluster to reach Ready state before proceeding.


The web UI

The REST API handles the machine-readable side. The web UI handles the operational side — the person who needs to restore a service at 2am and wants to see progress without reading JSON.

The timeline view shows available recovery points visually. You click the point in time you want to restore to. You see the steps executing in real time. Each step shows its status (waiting, running, complete, failed) and a log of what happened.

For the common case — “restore the most recent backup of application X” — it’s three clicks: select application, confirm recovery point, start restore. The tool handles the rest.


Testing your restores

The tool makes it practical to test restore procedures regularly. A test restore script:

#!/bin/bash
# Run weekly against staging namespace

RESTORE_API="http://full-app-restore.tools.svc.cluster.local:8080"
TARGET_NAMESPACE="myapp-staging"

# Trigger restore of production backup into staging namespace
curl -X POST $RESTORE_API/api/restore \
  -H "Content-Type: application/json" \
  -d '{
    "application": "myapp",
    "targetNamespace": "'$TARGET_NAMESPACE'",
    "recoveryPoint": "latest",
    "longhornVolumes": ["myapp-data"],
    "cnpgCluster": "db-myapp-cluster"
  }'

# Wait and check result
# ...alert if restore fails

Running this weekly against a staging namespace proves two things: the backups are restorable (a backup that fails to restore is not a backup), and the tool and its configuration are working correctly. By the time you need to use it in production, you’ve already verified it works.


What makes this harder than it looks

The tricky part of CNPG recovery is WAL replay. CNPG doesn’t just restore the base backup — it replays the WAL archive from the backup point up to the selected recovery target. This takes time proportional to how much write activity the database has seen since the last base backup.

If the base backup is from last Monday and today is Friday, a full restore might replay 4 days of WAL. For a busy database, that’s potentially hours. The tool surfaces this — it estimates replay time based on the WAL archive size — so you can decide whether to wait for a full replay or accept data loss by targeting a specific transaction ID.

The other tricky part is cleanup after a failed restore attempt. Half-completed Longhorn PVCs, a CNPG cluster stuck in recovery, leftover Kubernetes resources in the target namespace — these need to be cleaned up before retrying. The tool tracks what it created and can roll back a failed restore attempt.


The honest state of self-hosted backup

Most self-hosters have backups that work in theory. They’ve tested the backup job runs. They haven’t tested a restore from scratch in the last six months.

Build something that makes restore testing automatic. Run it. Fix what breaks. Then fix it again when you upgrade CNPG or Longhorn and the restore procedure changes. The work of maintaining a working restore procedure is ongoing.

But it’s the only work that matters when everything goes wrong.