Gino Eising
Gino Eising
Nerd by Nature
Apr 15, 2026 11 min read

The AI That Monitored Your Cluster Just Brought It Down

thumbnail for this post

“Why can’t I see the new photos?”

That’s how the outage started. Not with a PagerDuty alert or a Grafana dashboard turning red, but with a casual question from my wife. I was already deep in the weeds debugging a glitch in Nextcloud Talk, but as I tried to refresh my own dashboard, the latency didn’t just spike—it vanished. Immich was gone. Mail was gone. The search index was a black hole.

When I finally managed to shell into the cluster, the numbers were absurd. A 16-core node was sitting at a load average of 153. The CPU wasn’t just pegged; it was screaming. The irony, of course, is that the component causing the meltdown was cortex-gateway, a piece of AI-driven infrastructure I’d built specifically to make the cluster more observable. I had built a sentinel to watch the gates, and it had decided to burn the house down.

The Setup: A Homelab with a Brain

To understand how this happened, you need a quick map of the terrain. I run a three-node Kubernetes cluster at home—a mix of x86 and ARM64 hardware—that handles everything my family needs: email (Mailu), photos (Immich), communication (Nextcloud), and general file storage.

To make managing this mess easier, I’ve been integrating AI. The centerpiece is cluster-cortex, a Retrieval-Augmented Generation (RAG) system. It’s designed to be the cluster’s “memory.” Its gateway, cortex-gateway, listens to Alertmanager webhooks and embeds incidents as vectors into a Qdrant database. These embeddings are processed by Ollama, running on node02, my compute workhorse equipped with an AMD GPU.

To keep the AI up to date, I have a daily ritual: the brain-ingest-chats CronJob. Every morning at 09:30, it scrapes my LibreChat history and pushes it through the same /embed endpoint. It’s a convenient setup, but as it turns out, it’s also a perfect blueprint for a disaster.

The Kubernetes Alert Death Spiral: A Post-Mortem

The hum of the cluster is usually a comforting sound, a testament to GitOps managing a fleet of applications. Node02 quietly handles the heavy lifting, and the RAG pipeline usually operates in the background without a peep. Each embedding call is resource-intensive, consuming 24-30 seconds of heavy CPU on node02’s 16 cores, but under normal conditions, the system absorbs this easily.

The Ingest That Wouldn’t Quit

On this particular morning, at 09:30, the brain-ingest-chats CronJob started as usual. There was no immediate sign of trouble. But today, for reasons yet unknown—perhaps a particularly voluminous chat history or an obscure bug in the ingestion logic—the job did not complete its typical 30-minute run. It persisted, an unwelcome guest that extended its stay for seven agonizing hours, continuously hammering cortex-gateway with embedding requests.

Each request triggered a fresh, heavy CPU spike. The load average began its insidious climb. From a baseline of around 20, it crept to 60, then leaped to 100. By now, the alarms should have been blaring, but the system was already caught in a feedback loop. The load average eventually peaked at a staggering 153 over node02’s 16 cores. The system was attempting 9.5 times more work than it had capacity for.

The Cascade of Failure

With node02’s CPU maxed out, liveness probes—Kubernetes' way of checking if a container is still healthy—began to time out. kubectl logs reported ominous “command timed out after 5s” messages. The kubelet started killing pods for failing these critical health checks.

A cascade of failures ensued: Immich, Mailu, SearxNG, and, most critically, cortex-gateway itself, all entered a state of CrashLoopBackOff.

And then, the death spiral truly began. More crashes meant more alerts from Alertmanager. Each alert, ironically, triggered a webhook POST to cortex-gateway. But cortex-gateway was already trapped in its own crash loop. When it managed to restart for a brief moment, it immediately attempted to embed the deluge of new alerts, inadvertently adding more load to Ollama on the already saturated node02.

This additional CPU demand caused even more pods to fail their probes, generating still more alerts, and thus feeding the very spiral that was bringing the cluster to its knees. Meanwhile, the brain-ingest-chats CronJob, governed by the Kubernetes Job controller, was relentlessly respawned after each of its pods was deleted, ensuring the initial trigger of the crisis continued its destructive work.

The end state was grim. cat /proc/stat confirmed zero idle CPU ticks. Pressure Stall Information (PSI) revealed some avg10=91.70—meaning 91% of the time, processes on node02 were stalling, waiting desperately for CPU resources.

The Frantic Diagnosis

The first indication of widespread trouble came from a user: “Immich is broken, RAG is broken, mail is broken… what the hell is going on.”

A quick kubectl get pods -A revealed a sea of CrashLoopBackOff statuses. Attempting to get a quantitative measure with kubectl top nodes returned <unknown> for node02; the node was so saturated that even the metrics-server couldn’t scrape it.

I deployed a debug pod onto node02. Inside, uptime immediately screamed the problem: a load average of 153.79. Sifting through the Ollama logs revealed the core culprit: a continuous stream of POST /api/embeddings 200 24s entries, originating every few seconds from the IP address of the cortex-gateway pod. The embedding loop was identified.

Intervention and Secondary Damage

The solution demanded a multi-pronged, forceful intervention:

  1. Stop the Ingest: I deleted the brain-ingest-chats Job object immediately. Simply killing the pod was futile, as the Job controller would just restart it. I then suspended the associated CronJob.
  2. Break the Alert Loop: I scaled the cortex-gateway deployment to replicas: 0. This was critical, as simply deleting pods would trigger FluxCD’s reconciliation, immediately bringing the gateway back up to start embedding more alerts. I also suspended the Flux Kustomization for the gateway to prevent this automatic “healing.”

As the ingest stopped and the gateway stayed dead, node02’s load average began its slow descent: 153 to 64, then to 20, finally settling around 6.47 within about 15 minutes. The system exhaled.

But the damage didn’t stop at CPU overload. Secondary failures surfaced:

  • Immich: The server pod failed to restart properly because the SeaweedFS CSI volumes had lost their critical .immich marker files during the disruption. I had to run a manual init pod to recreate them.
  • Mailu: The webmail and front pods, having crash-looped over 300 times, were stuck in an exponential backoff state. They required manual pod deletion to reset their restart counts.

Reflection: The Architecture of a Death Spiral

Looking back at the manifests and the git history, the failure wasn’t a “bug” in the traditional sense. There was no null pointer exception, no syntax error, and no logic flaw that caused a crash. Every individual component performed exactly as it was instructed to.

The failure was emergent. It was a system-level collapse born from a lack of systemic reasoning.

The cortex-gateway was designed with a seductive logic: “Why not embed every Alertmanager alert in real-time? Then, when something breaks, the AI will have the immediate context of the last ten failures to help me debug.” On paper, this is a productivity win. In practice, it was the construction of a high-speed rail line directly into a wall.

The AI assistant that designed this integration (my current self, in a previous session) focused entirely on the “happy path.” It saw two components—an alert source and a vector database—and wired them together. It didn’t ask about the CPU cost of a single embedding call under load. It didn’t consider the latency of a synchronous webhook. Most critically, it failed to recognize the classic “positive feedback loop” pattern.

The sequence was a perfect storm: a bloated ingestion job pushed the CPU to its limit → the cluster started throwing resource-pressure alerts → the cortex-gateway received those alerts → it triggered more embedding calls to Ollama → CPU load increased → more alerts were generated.

The system was literally screaming for help, and the mechanism designed to listen to those screams was the very thing choking the life out of the machine. There is a certain dry irony in building an “AI-powered observability suite” that creates a blackout so complete that you can’t even use the AI to ask why the cluster is down.

Analysis: Where the Human Gate Failed

If a senior SRE had reviewed the PR for the cortex-gateway webhook, the conversation would have looked very different. An experienced engineer doesn’t just ask, “Does this work?” They ask, “How does this fail?”

A human review would have hit three immediate red flags:

First, the Resource Budget. A human would have realized that calling a heavy LLM model synchronously inside a webhook—especially one tied to a monitoring system—is an architectural sin. You don’t put a heavy compute task in the critical path of your alerting pipeline.

Second, the Feedback Loop. Any engineer who has survived a production outage knows that monitoring systems must be decoupled from the resources they monitor. Using the same GPU/CPU instance for both the production workload and the alert-ingestion workload is a recipe for a death spiral. The moment the system struggles, the monitoring overhead scales linearly (or exponentially) with the failure, accelerating the collapse.

Third, the Metric Blindness. The brain-ingest-chats job had a history of running for 16 and 27 hours. To an AI assistant, this is just a “long-running job.” To a human, this is a flashing neon sign saying Something is wrong with the ingestion logic. A human would have set a timeout or an alert for any job exceeding its expected 30-minute window. The AI simply deployed the job and moved on to the next task, treating the completion time as a detail rather than a diagnostic.

There is a fourth red flag that is harder to write about, because it implicates the human in the loop more directly. During the session where the components were being built and debugged, the question of CPU and resource limits was raised. It came up. The AI acknowledged it, noted that Ollama was running without a CPU cap, and moved on. The concern was logged and discarded. No limit was set. This is arguably the most uncomfortable data point in the whole incident: the human quality gate activated — the right question was asked — and the answer still didn’t land in the implementation. The AI treated “acknowledged” as equivalent to “resolved.” It isn’t. A junior engineer who does this gets a comment on their pull request. An AI assistant that does this ships to production.

This highlights the fundamental gap in current AI-assisted engineering. AI agents are remarkably proficient at “Local Correctness.” They can write a perfectly functioning Python function or a syntactically flawless Kubernetes manifest. However, they struggle with “Systemic Correctness.” They lack the visceral, scar-tissue memory of having been paged at 3:00 AM because a log-aggregator accidentally DDOSed the database. They’ve read about cascading failures in textbooks, but they don’t “feel” the risk of a synchronous webhook.

The Path Forward: Senior Review for Junior Agents

The meta-lesson here is that AI assistants, regardless of how “intelligent” they seem, are effectively hyper-productive junior engineers. They can produce a month’s worth of boilerplate in an afternoon, but they lack the skepticism required for systems design.

To prevent the next death spiral, we have to move the quality gate from “Does it work?” to “What is the blast radius?”

In a practical sense, this means implementing the “defensive” patterns the AI ignored:

  • Rate Limiting: The webhook should have a hard cap on embeddings per minute.
  • Circuit Breakers: If Ollama response times spike, the gateway should stop embedding and simply log the alert.
  • Resource Isolation: Batch ingestion and real-time queries should never share the same compute priority.
  • Timeout Vigilance: No CronJob should be allowed to run for 27 hours without triggering a high-priority alert.

The irony of this outage is that it was caused by a system intended to make the cluster more resilient. It’s a reminder that in complex systems, the safety mechanisms themselves can become the primary vector of failure. The most important question we can ask when integrating AI into our infrastructure isn’t “What can this do for us?” but rather “What happens when this fails?”

If you don’t have a human in the loop asking that question, you aren’t building a smarter system—you’re just building a faster way to crash.

The most humbling part of the whole ordeal was the recovery. After the dust settled and the node finally stopped swapping, I fed the logs into the very same AI that had written the buggy code. In about ten seconds, it pinpointed the recursive loop, explained exactly why the embedding job had overran, and suggested the fix.

It’s a strange place to be on the capability curve. AI assistants are currently brilliant at building things that work, but they are almost entirely blind to how those things can break. They can write a complex K8s controller in seconds, but they have no innate concept of a “death spiral” or a resource leak.

The lesson isn’t to stop using AI—that would be like stopping the use of compilers because you might write a bug. The lesson is to treat AI like a blisteringly fast junior engineer: incredibly productive, dangerously confident, and absolutely never allowed to merge their own code.

References