Gino Eising
Gino Eising
Nerd by Nature
Apr 10, 2026 27 min read

cluster-shepherd: The AI Ops Agent That Actually Knows Your Cluster

thumbnail for this post

April 2026 — what happens when you stop treating AI as a search engine and start treating it as a co-pilot with real cluster access

Let me describe a scene you’ve probably lived.

It’s Saturday afternoon. A service is misbehaving. You open ChatGPT or Claude, paste the output of kubectl get all -A — a wall of YAML that immediately consumes a third of your context window — and ask “why is my pod crashlooping?”. The AI gives you a confident, well-structured answer. You follow it. Nothing improves. You paste more logs. It hallucinates a missing ConfigMap that doesn’t exist. You chase it for an hour. You eventually fix it yourself with kubectl describe pod like you always do.

The AI wasn’t wrong because it’s bad. It was wrong because it was blind. It had no access to your cluster. It had no memory of the last incident. It couldn’t call the Kubernetes API. It was reasoning from text you pasted, in a context that was already half-full, about a system it had never touched.

Today I want to show you the alternative I built — and why I built it the way I did.


Why go to these lengths at all

Since December I’ve been building a self-hosted alternative to Google’s ecosystem. Email, photos, video calls, file storage, authentication — the works. Not because the Google products are bad. They’re exceptional. That’s precisely the problem.

We got spoiled. A decade of free, polished, fast, always-available cloud services trained an entire generation — including the people in our lives who are not technical — to expect infrastructure that simply works. Your mum’s photos sync the moment she takes them. Your family’s shared calendar just updates. The video call starts in two seconds on any device. Google spent billions of dollars and thousands of engineers building that experience. And then they handed it to you for free in exchange for the right to know everything about you.

The bill is coming due in ways that feel increasingly uncomfortable. Not just privacy — though that matters. Jurisdiction. Dependency. The quiet understanding that an American corporation now holds the photos of your kids, your medical appointment history, your location for the past ten years, and that the terms under which they hold it can change with a policy update, an acquisition, or a political climate you have no influence over. I decided I’d rather do the work.

But here’s the thing nobody tells you when you start this journey: you are not replacing a product, you are replacing an engineering team. When Nextcloud loses a file sync, there is no support ticket. When Authentik’s database corrupts during an upgrade, there is no rollback button. When your family’s photos stop syncing at Christmas because a Kubernetes pod ran out of memory, you are the on-call engineer. At midnight. While everyone else is opening presents.

And the people relying on these services — family, friends you’ve convinced to move their email — they didn’t sign up for the complexity. They signed up for “it works like Google, but it’s ours.” They do not want to hear about pod restarts. They want their photos.

This is the real reason reliability matters so much, and why the safety architecture of cluster-shepherd is not over-engineering. It’s the minimum viable protection for a one-person team running services that real people depend on. You cannot afford to let an AI agent delete the wrong database because it was confident and fast and had no context. You cannot afford alert fatigue that causes you to miss the one warning that actually mattered. You cannot afford to spend your Saturday debugging a ghost CRD instead of being present with your family.

The cloud services spoiled us with reliability we now have to rebuild ourselves. The goal of everything in this post — the guardrails, the RAG memory, the ephemeral feature clusters, the CAN bus filtering — is to get close enough to that bar that the people relying on you never notice the difference.

So you turn to AI. Because you’re one person, the complexity is real, and the promise of AI-assisted infrastructure sounds like exactly the leverage you need. And that’s where the next problem starts.


The anti-pattern: AI as a clipboard reader

Before cluster-shepherd, the workflow for AI-assisted infrastructure work looked roughly like this:

  1. Something breaks
  2. Open chat window
  3. Paste logs, YAML manifests, event output
  4. Ask the AI
  5. Get an answer that may or may not apply to your actual system
  6. Paste more things when it doesn’t work
  7. Context fills up — AI starts losing coherence
  8. Start a new chat, losing all accumulated context
  9. Fix it manually anyway

This is not AI-assisted operations. This is AI as a fancy search engine that happens to have your logs pasted into it.

The failure modes are well-documented if you’ve spent time in this space. Context window exhaustion is the kindest one — the AI just starts forgetting what you told it earlier. Worse is confident hallucination: the AI invents a resource name, a flag, an endpoint that doesn’t exist in your cluster, and you follow the suggestion because it sounds plausible. Worst of all is what I’d call yolo execution — when you’re exhausted at 11pm, the AI says “just run this,” and you do, without fully reading it.

I wanted to build something that eliminates all three failure modes by design.


The other anti-pattern: giving AI the keys

There is a second tribe, and I belong to it too. These are the people who got frustrated with the clipboard approach and went the other direction entirely. Instead of pasting output to the AI, they gave the AI direct access to the cluster. A kubeconfig in the environment. An agentic tool — Antigravity, Claude Code, Cursor with MCP — that can run kubectl commands autonomously. Ask it to fix something, walk away, come back and see what happened.

I killed my home cluster this way. Multiple times. Databases gone. Authentik destroyed. Immich wiped. Not because the AI was malicious — because it was confident, fast, and had no idea what it didn’t know.

The failure mode is different from the clipboard approach but in some ways worse. With clipboard AI, the worst case is you follow bad advice you could have questioned. With agentic AI and direct cluster access, the worst case is the AI takes ten actions in thirty seconds while you’re reading the first one, and by the time you say “wait, stop” it has already deleted the PVC.

Here is a real example of what that looks like:

Agent: I'll fix the authentik deployment by removing the broken init container.
Agent: Deleting pod authentik-worker-7d9f6b... done.
Agent: Scaling deployment to 0 to apply changes... done.
Agent: Updating the deployment manifest... done.
Agent: Scaling back to 1... done.
Agent: It seems the database connection is failing. Let me check the secret...
Agent: The secret references a CNPG cluster that appears to be in a bad state.
       I'll delete and recreate the PostgreSQL cluster to resolve this.
Agent: Deleting postgresql cluster authentik-prd... done.

That last line is the one that ends your evening. A CloudNativePG cluster deletion with prune: true in Flux means the PVC goes with it. The database is gone. The AI was not confused — it genuinely believed it was solving the problem. It had no context for what a CNPG deletion actually means in terms of data loss. It had never seen that cluster fail before. It had no memory of the last time this pattern led to disaster.

This is the russian roulette problem. Not every pull of the trigger ends badly. Ninety percent of the time the agentic tool does something reasonable and you feel like a genius for setting it up. Then on the tenth time, in a sequence you didn’t fully read, it deletes something irreplaceable.

The uncomfortable truth is that AI confidence does not correlate with AI correctness — especially for infrastructure operations. A model that has never seen your cluster before will answer with exactly the same tone and certainty whether it’s right or catastrophically wrong. You cannot tell from the output. You can only tell from the aftermath.

This is why cluster-shepherd’s design starts from a different premise. The AI does not have direct cluster access. It has access to a carefully scoped set of tools, every write operation goes through a risk gate backed by the actual history of your cluster, and anything non-trivial requires a human to type approve_task. The AI is fast and knowledgeable. The architecture is what keeps it from being dangerous.


What cluster-shepherd actually is

cluster-shepherd is a Go MCP server that runs inside your Kubernetes cluster, bound to a real ServiceAccount with carefully scoped RBAC permissions. It is not a chatbot wrapper. It is not an agent that calls kubectl by SSHing somewhere and hoping for the best. It talks directly to the Kubernetes API, the Flux API, and your GitOps repository. When it tells you a deployment is healthy, it just asked the API.

It connects to LibreChat over HTTP SSE and exposes itself as an MCP server — which means any model LibreChat supports (Gemini, Deepseek, Claude, Mistral) can use its tools natively through the standard Model Context Protocol. You pick your LLM from the dropdown. The cluster access comes with it.

The full architecture looks like this:

graph TD U[You in LibreChat] -->|chat message| LC[LibreChat] LC -->|SSE — listTools| MCP[cluster-shepherd
MCP Server] LC -->|streamGenerateContent + tools| LLM[LLM of choice
Gemini / Deepseek / Claude] LLM -->|functionCall: get_flux_status| LC LC -->|callTool| MCP MCP -->|REST| K8S[Kubernetes API] MCP -->|REST| FLUX[Flux Kustomization API] MCP -->|SSH| NODES[Cluster Nodes] MCP -->|HTTPS| GITLAB[GitLab Files API
fluxcd repo] MCP -->|POST /query| CORTEX[Cluster Cortex
RAG Gateway] CORTEX -->|dense + sparse search| QDRANT[Qdrant
vector store] CORTEX -->|graph context| CNPG[PostgreSQL
known_errors / solutions] MCP -->|POST /risk| CORTEX

The LLM never touches the cluster directly. All cluster operations go through the MCP server, which enforces guardrails before anything writes, deletes, or restarts. The LLM is the reasoning layer. cluster-shepherd is the hands.


The tool inventory

At the time of writing, cluster-shepherd exposes 16 tools across four categories.

Observation tools — read-only, always safe to call:

  • get_cluster_alerts — scans all namespaces for non-Running pods and Warning events; the first thing the agent calls when you say “is everything okay?”
  • get_pod_logs — fetches the last N lines from any pod, any container
  • get_node_status — returns Ready/NotReady status and conditions for every node
  • get_flux_status — lists all Flux Kustomizations with their ready state, suspension status, and last revision
  • list_workloads — returns a trimmed summary of all Deployments and StatefulSets (more on this in a moment)
  • query_cortex — searches the RAG knowledge base for verified solutions, past incidents, and cluster history

Node tools — SSH-based, direct node access:

  • get_node_list — returns all nodes and their SSH addresses
  • run_node_cmd — executes an allowlisted command on a node (df, journalctl, systemctl status, etc.)

Action tools — write operations, all gated behind assess_risk:

  • restart_workload — triggers a rollout restart for a Deployment or StatefulSet
  • scale_workload — scales a Deployment to a given replica count
  • run_flux_op — suspends, resumes, or reconciles a Flux Kustomization
  • gitops_change — reads, creates, updates, or deletes a file in the GitOps repository (more on this below)
  • delete_crds — deletes all CRDs for a given domain (e.g. longhorn.io) via the Kubernetes apiextensions API

Guardrail tools — the safety layer:

  • assess_risk — sends the proposed action to the Cortex RAG gateway for risk scoring against past incidents; returns auto, review, or reject
  • queue_task — queues a task for human approval when risk is not auto
  • approve_task — approves a queued task and returns the action for execution
  • list_pending_tasks — shows what’s waiting for approval

The agent is instructed to call assess_risk before any write operation. If the recommendation is not auto, it queues the task and waits. You approve it. It executes. No surprises.


Three layers of safety

This is not an afterthought. The entire design starts from a single principle: the cost of a wrong action in production is asymmetrically higher than the cost of asking first.

Layer 1: RBAC

The ServiceAccount the server runs as has precisely the permissions it needs and nothing more. Full read on pods, nodes, events, deployments, configmaps, PVCs. Patch/update on deployments and statefulsets for restarts and scaling. Get/list/delete on CRDs for cleanup operations. Get/list/watch/patch on Flux Kustomizations. Nothing else.

It cannot create new namespaces. It cannot delete pods. It cannot touch secrets. If the MCP server were somehow compromised, the blast radius is bounded by what the ServiceAccount can do — and that list was written deliberately, line by line.

Layer 2: the risk gate

Every action tool description tells the agent: MUST call assess_risk first. The assess_risk tool sends the proposed action text to the Cluster Cortex RAG gateway, which runs hybrid search against four knowledge collections:

  • cluster_events — every Warning event ingested since the server started
  • chat_history — prior conversations with outcomes
  • manifests — GitOps repo files (ingested at deploy time)
  • documentation — operational runbooks

Cortex returns a risk score, a recommendation (auto / review / reject), a list of similar past incidents, and reasoning. If it finds that the same restart caused a cascade failure three weeks ago, the risk score goes up and the recommendation becomes review. The agent queues the task instead of executing.

Layer 3: the human queue

queue_task writes a pending task to disk (the shepherd’s data path PVC). list_pending_tasks shows what’s waiting. approve_task marks it approved and returns the original action so the agent can execute it.

This means that even if the agent is wrong about risk — even if Cortex scores something auto that you wouldn’t be comfortable with — you can always override by asking the agent to queue it first. The queue is the escape hatch. It exists precisely because AI confidence and AI correctness are not the same thing.

sequenceDiagram participant U as You participant A as Agent participant MCP as cluster-shepherd participant C as Cortex U->>A: "restart the immich deployment" A->>MCP: assess_risk({action: "restart immich/immich"}) MCP->>C: POST /risk C-->>MCP: {score: 0.2, recommendation: "auto", reasoning: "no recent crashes"} MCP-->>A: auto — safe to proceed A->>MCP: restart_workload(namespace: immich, name: immich, kind: Deployment) MCP-->>A: {"status":"restarted","kind":"Deployment","namespace":"immich","name":"immich"} A-->>U: "Done — immich restarted. Previous pod terminated gracefully." Note over U,C: High-risk scenario U->>A: "scale immich to 0 replicas" A->>MCP: assess_risk({action: "scale immich to 0"}) MCP->>C: POST /risk C-->>MCP: {score: 0.8, recommendation: "review", reasoning: "scale-to-zero is destructive"} MCP-->>A: review required A->>MCP: queue_task({action: "scale immich to 0", reason: "user request"}) MCP-->>A: task queued — ID: task_8f3a A-->>U: "Queued as task_8f3a — call approve_task when ready" U->>A: "approve task_8f3a" A->>MCP: approve_task({id: "task_8f3a"}) MCP-->>A: approved — executing A->>MCP: scale_workload(namespace: immich, name: immich, replicas: 0) MCP-->>A: scaled to 0 A-->>U: "Done."

The context problem: why pasting YAML kills you

Here is a real number. A single Kubernetes Deployment object serialised to JSON is between 4KB and 40KB, depending on how many annotations, labels, and managed fields it carries. A cluster with 40 Deployments is between 160KB and 1.6MB. That is between 40,000 and 400,000 tokens. A model with a 128K context window — Gemini Flash, Deepseek V3 — is full after the first tool call.

I learned this the hard way. The original list_workloads implementation returned raw appsv1.Deployment objects from the Kubernetes client-go library. The agent called it, Deepseek’s context filled up in a single round-trip, and the next message was a confused half-answer that referenced resources the model had already forgotten.

The fix is what I call WorkloadSummary:

type WorkloadSummary struct {
    Namespace string   `json:"namespace"`
    Name      string   `json:"name"`
    Ready     string   `json:"ready"`    // "3/3", "1/2", etc.
    Images    []string `json:"images"`
}

Four fields. The agent knows enough to reason about the workload — whether it’s healthy, what’s running — without consuming the full object graph. A 40-deployment cluster now returns roughly 8KB of JSON instead of potentially 1.6MB.

The same philosophy applies everywhere. get_cluster_alerts returns structured alert summaries, not raw event lists. get_flux_status returns a condensed view of each Kustomization. get_pod_logs caps at 200 lines. Every tool was designed with the question: what is the minimum information the agent needs to reason correctly?

MCP over SSE also means tool results go directly into the LLM’s context as structured data, not as text you pasted. The LLM knows a tool call happened, it knows what it returned, and it can cite it. When the agent says “immich is Ready 2/2 running image v2.5.6,” it’s reading from a structured JSON field it fetched 400ms ago — not remembering something you told it in a previous message.


RAG: the memory the LLM was always missing

This was the most important addition of the day, and it’s the one that changes the fundamental character of the system.

Every LLM has a knowledge cutoff. More importantly, every LLM has no knowledge of your cluster. It doesn’t know that the Longhorn operator was uninstalled three weeks ago but its CRDs were never cleaned up. It doesn’t know that the last time you restarted the authentik deployment during a database migration it caused a 20-minute outage. It doesn’t know that the “readyToUse: false” on your CNPG VolumeSnapshot is a known cosmetic bug in the openebs-lvm-localpv driver, not a real failure.

Cluster Cortex is the RAG layer that gives the agent that memory. It runs in-cluster (you can read about how it was built in an earlier post) and exposes a /query endpoint that does hybrid retrieval: dense vector search using nomic-embed-text, sparse BM25 keyword search, reciprocal rank fusion to merge the results, and cross-encoder reranking with bge-reranker-base. The PostgreSQL graph layer adds structured context: known errors with their resolution history, prior solutions that have been verified to work.

The query_cortex tool exposes this directly to the agent:

query_cortex(
  query: "Longhorn CRD conversion webhook failures causing kube-apiserver restarts",
  top_k: 5,
  collections: "cluster_events,documentation"
)

What comes back is a merged_context field — a pre-formatted block containing verified solutions, matching cluster events, and prior conversation outcomes. The agent reads this before answering. Instead of hallucinating a cause, it cites the cluster’s own history.

I shipped query_cortex this afternoon. It proved useful within hours. The agent correctly identified that a kube-apiserver restart loop was caused by orphaned Longhorn CRDs trying to call a conversion webhook on a service that no longer existed — because Cortex had ingested those Warning events and the cluster manifests. The LLM on its own would have needed me to describe all of that. Cortex knew it already.

This is also the answer to the russian roulette problem. The agentic tools that killed my cluster weren’t wrong because they were stupid — they were wrong because they were reasoning from zero context about a system with years of accumulated state. Cortex is that accumulated state, made queryable. An agent that has read your cluster’s history before recommending an action is a fundamentally different thing from an agent operating on vibes and training data. After one day of running with RAG grounding, the answers started feeling like they came from someone who knew the cluster. That shift in trust is not a small thing.

The ingestion pipeline runs continuously. Every 5 minutes, cluster-shepherd ingests all current Warning events from the Kubernetes API into Cortex. The agent is always working against a knowledge base that is at most 5 minutes stale.


Observability without the mental tax

I’ve spent years as an SRE at banks and large international companies. And in 99% of those environments, the observability story looked roughly like this: enormous investment in Prometheus, Grafana, alerting pipelines, runbooks, on-call rotations, dashboard reviews — and at the end of it, engineers who were simultaneously overwhelmed by noise and still surprised by incidents.

The sweet spot between effort and effect was almost always missed. Not because the engineers weren’t good. Because the model is fundamentally broken.

Traditional observability is a symptom glorification machine. A metric tells you CPU is high. An alert fires. You open the dashboard — which someone built six months ago for a slightly different version of the problem — and you stare at graphs that confirm the CPU is high. You cross-reference the log aggregator. You search for the right query. You check the runbook. Forty minutes later you understand what happened. The alert told you something is wrong. Everything after that was you doing the actual work.

Your car has a CAN bus — a network that carries thousands of signals per second from every sensor in the vehicle. Misfires, voltage fluctuations, transient sensor glitches, micro-hesitations in the throttle response. If you wired every single one of those signals to a dashboard light or a speaker alert, you’d pull over in a panic within the first five minutes and the car would be perfectly fine. Modern cars generate what would look, unfiltered, like a continuous stream of failures. Almost none of it needs your attention.

The art is in the filtering. The ECU knows which signals are noise, which are patterns worth watching, and which cross the threshold that requires the driver’s attention — that single amber light on the dashboard that says something needs your attention, but you can finish the drive. Not too early. Not too late. Not too often. Just right.

A fully instrumented Kubernetes cluster is no different. Thousands of events, restarts, probe failures, scheduling decisions, certificate renewals happening every minute. Piped unfiltered into a chat room or a dashboard, it is technically complete and practically useless. The alert fatigue is not a configuration problem. It is the inevitable result of surfacing everything and expecting humans to do the filtering.

The car dashboard doesn’t have a thousand gauges. It has a warning light and a mechanic. The warning light says attention needed. The mechanic — who knows the car, its history, what failed last time — tells you what it actually means and what to do about it.

That’s what the RAG layer changes.

Instead of maintaining alert rules that fire on symptoms, cluster-shepherd ingests Warning events continuously into Cortex. Instead of building dashboards that show metrics without context, query_cortex correlates an alert with the cluster’s event history, known error patterns, and prior solutions — and returns an answer, not a graph. When the kube-apiserver starts restarting, the agent doesn’t surface a CPU spike and leave you to figure out why. It surfaces the Longhorn conversion webhook error, cross-references the fact that the longhorn-system namespace is absent, and tells you the 22 orphaned CRDs are the cause.

The feature cluster model makes this even more powerful. Every feature-headscale.djieno.com cluster runs with the same ingestion pipeline. Metrics, events, and logs flow into Cortex during the feature’s lifetime. If the headscale deployment causes an unusual memory growth pattern, that gets ingested. If there’s a DNS resolution failure at startup, that’s in the event stream. When the PR is reviewed, you’re not just looking at “did the tests pass” — you’re looking at a cluster that the agent has been watching, with a Cortex knowledge base that now includes everything that happened during that feature’s development.

This gates promotion in a way no dashboard ever could. Not “here are some graphs, good luck” — but “during this feature’s lifetime, these events occurred, these patterns were observed, here is the risk assessment for merging to production.” The observability becomes an input to a decision, not a wall of information you have to interpret yourself.

The shift is this: instead of spending engineering time building and maintaining observability infrastructure, you spend it on the system that makes observability actionable. Cortex costs almost nothing to run — a small FastAPI gateway, a Qdrant instance, a PostgreSQL cluster that was already there for other things. The dashboards it replaces would have cost weeks to build and months to maintain.

You still need metrics. You still need logs. But you don’t need to spend your time staring at them. That’s what the agent is for.


GitOps write access: fixing things in git, not just in the cluster

Most infrastructure AI tools, if they have write access at all, write directly to the cluster. That means their changes are:

  • Not tracked
  • Not auditable
  • Not repeatable
  • Wiped on the next Flux reconciliation

I built gitops_change differently. When the agent needs to make a persistent change — removing a HelmRelease for an application that was uninstalled, updating a manifest, deleting an orphaned resource definition — it commits the change to the GitOps repository via the GitLab Files API. Flux picks it up, reconciles it, and the change is in git history with a commit message.

gitops_change(
  operation: "delete",
  file_path: "clusters/production/apps-openwebui-prd.yaml",
  commit_message: "chore(shepherd): remove openwebui — replaced by librechat"
)

No git binary in the container. No SSH key to the repo needed. Just the GitLab Files API, a personal access token scoped to the single repository, and a clean commit in the log. The entire fluxcd repo is the agent’s durable memory for cluster-level changes.

This matters because GitOps is not just a deployment pattern — it’s an audit trail. When you come back in three months and wonder why the openwebui kustomization was removed, there’s a commit with a timestamp and a message. When the agent makes a change and Flux rejects it because of a syntax error, the reconciliation failure is visible in Flux’s own status. The agent can see that too with get_flux_status and fix it.


A real session from today

This is not a demo scenario. This is what happened this morning, before the blog was written.

You: "the kube-apiserver keeps restarting, can you look into it?"

Agent: [calls get_cluster_alerts]
       → finds Warning events: "failed calling webhook longhorn-conversion.longhorn.io"
       
       [calls query_cortex with "longhorn conversion webhook kube-apiserver"]
       → Cortex returns: 22 Longhorn CRDs present, longhorn-system namespace absent,
         conversion webhook service longhorn-conversion.longhorn.io unreachable
         
       Agent: "The kube-apiserver is restarting because 22 Longhorn CRDs still exist after
               Longhorn was uninstalled. Every CRD list operation triggers a call to the
               conversion webhook service, which no longer exists. I recommend deleting all
               longhorn.io CRDs. Calling assess_risk first."
               
       [calls assess_risk]
       → risk: 0.6, recommendation: "review" — "CRD deletion is irreversible"
       
       [calls queue_task]
       → task_lh_001 queued

You: "approve task_lh_001"

Agent: [calls approve_task]
       [calls delete_crds(domain: "longhorn.io")]
       → 22 CRDs deleted
       
       Agent: "Done. All 22 longhorn.io CRDs removed. The kube-apiserver should stop
               restarting within the next reconciliation cycle."

The whole session took four minutes. The root cause — orphaned conversion webhooks from a decommissioned storage operator — is exactly the kind of thing an LLM working from pasted text would hallucinate around. Cortex knew the Longhorn namespace was gone. The agent connected the dots.

(Deepseek, for the record, got confused and kept suggesting SSH-based kubectl commands. The tool description improvements I made after that experience are documented in the Gemini MCP silent failures post.)


Why this approach and not that one

There are a few obvious alternatives, and it’s worth being explicit about why I didn’t use them.

Why not just give the AI a kubeconfig and let it run kubectl?

Because kubectl is an all-or-nothing credential. A kubeconfig that can read pods can usually also exec into them, list secrets, and with the right cluster setup, do significantly more than you’d want an AI agent doing unsupervised. The ServiceAccount model gives precise, auditable, revocable permissions at the API level. There is no exec permission. There is no access to secrets.

Why not use an existing SRE agent framework?

k8sGPT, Robusta, KubeAI — they all exist and they all solve narrower versions of this problem. k8sGPT is excellent at diagnosis but has limited write operations and no human approval workflow. Robusta has good alerting integration but ties you to a specific model. I wanted full control over the tool surface, the safety architecture, and the RAG integration — and I wanted it in Go, small, and running in-cluster without a phone-home dependency.

Why not use Claude Code directly?

This question actually came up during today’s session, about ten minutes after I’d spent the morning building the whole thing. The answer is: because the agent runs in LibreChat, accessible from any browser over HTTPS with OIDC auth. It’s not tied to my laptop. It’s always running, always connected to live cluster data, available from any device. Claude Code is a fantastic coding assistant. cluster-shepherd is a production operations tool. They serve different contexts.


The philosophy: automate to the fullest, but earn the trust first

Here is the thing about infrastructure automation that doesn’t get said enough: automation without trust is worse than no automation. If you can’t explain what the automation will do before it does it, if you can’t see why it made a decision, if you can’t stop it before a bad call executes — you will eventually have an incident that’s harder to recover from than anything the automation would have prevented.

cluster-shepherd is designed around the opposite philosophy. Every action is explained before it runs. Every write operation has been through a risk gate that cites specific past incidents. Every operation that’s not clearly safe gets queued for human eyes. The agent doesn’t have write access to secrets. It can’t exec into pods. It can’t create new resources from scratch. Its GitOps changes are commits in a repo you own.

The goal is not to remove humans from the loop. The goal is to make the loop so fast and well-informed that being in it feels effortless rather than burdensome. You shouldn’t have to spend forty minutes pasting logs to find out that your kube-apiserver is restarting because of a ghost CRD. The agent should tell you that in thirty seconds, show you the evidence, propose the fix, and wait for you to say yes.

That’s the target. I’m close.


What comes next: the cluster is not a pet anymore

Do you remember the server naming arguments?

The passionate debates in the office about whether the new box should be named after a planet, a Greek god, or a character from a film nobody else had seen. The colleague who named theirs after their cat. The one who had a spreadsheet. The unwritten rule that you didn’t touch thor because Dave set it up in 2019 and nobody quite knew what was on it anymore.

Kubernetes killed the pet server. But somewhere along the way, many people turned the cluster into the new pet. One cluster, carefully tended, upgraded nervously, never quite reproducible, its state a combination of Flux manifests and three years of kubectl apply -f one-liners nobody committed. You know it intimately. You worry about it when you’re on holiday. When it breaks badly enough, you feel it personally.

The next evolution I’m building toward does away with that entirely.

The idea is ephemeral feature clusters. When I want to test a new tool, a new operator, a new configuration — I don’t touch the production cluster. Instead, a new cluster spins up from scratch, named after the feature: feature-headscale.djieno.com, feature-cnpg-upgrade.djieno.com. It gets the full baseline stack deployed via the same FluxCD manifests, the same Kustomizations, the same postBuild variable substitution that production uses — just with a different domain suffix and its own isolated environment.

The feature gets developed and tested there. When it’s ready, a pull request promotes it to main. When the PR merges, the feature cluster is deleted instantly. No migration. No cleanup. No lingering state. The cluster was never a pet — it was a workspace that got thrown away when the work was done.

graph LR B[branch: feature/headscale] -->|cluster boots| C[feature-headscale.djieno.com] C --> BS[baseline stack
cert-manager · ESO · ingress-nginx · authentik] C --> F[feature workloads
headscale · headscale-admin] C --> T[automated tests
headscale.feature-headscale.djieno.com] T -->|pass| PR[Pull Request → main] PR -->|merged| D[cluster deleted] PR -->|merged| PROD[production.djieno.com
same manifests, new postBuild vars]

cluster-shepherd fits naturally into this model. The same MCP server, the same tools, the same risk gate — but scoped to the feature cluster. The agent can observe the feature deployment, check if the headscale pods came up healthy, verify the ingress resolves, and flag anything that needs attention before promotion. Cortex ingests events from the feature cluster the same way it does from production. If the same misconfiguration caused a problem in a previous feature cluster, the risk gate knows.

What this means in practice: Kubernetes becomes infrastructure you reason about, not infrastructure you remember. The cluster is not thor. It’s not named after anything. It’s a reproducible artefact that proves a feature works, and then it’s gone. The knowledge lives in git and in Cortex — not in the cluster’s accumulated state.

We spent years arguing about server names because the servers were precious. They were precious because rebuilding them was hard and slow and full of undocumented decisions. When rebuilding is a ten-minute automated pipeline, nothing is precious anymore. And that is exactly the point.

The infrastructure doesn’t need to be quiet for you to stay calm. It needs to be disposable enough that nothing breaking it is the end of the world.


References