Gino Eising
Gino Eising
Nerd by Nature
Apr 25, 2026 9 min read

Wie Is Wie: Watching Two AIs Debug a Network I Broke on Purpose

thumbnail for this post

There is a children’s game in the Netherlands called Wie is Wie — the Dutch version of Guess Who. You sit across from your opponent with a board of cartoon faces. You take turns asking yes/no questions. The goal is not to gather information in every possible category. The goal is to flip over as many faces as possible with a single question.

“Does your character wear glasses?” Twelve faces gone in one move.

I have had this game in my head for years as a debugging mental model. Network broken? Something is off in Kubernetes? A request is hanging? The natural instinct is to reach for the tools you know — check the logs, read the config, poke at the manifests. That instinct is wrong. The first move should eliminate half the suspects. One question. Most cards face down.

I wanted to see whether modern AI assistants played the same game. So I broke something, watched them troubleshoot, and took notes.


The Setup

The home cluster is three nodes behind a Mikrotik CCR2004 router running BGP ECMP. MetalLB on each node announces a pool of VIPs to the Mikrotik, the Mikrotik installs equal-cost routes to all three nodes, and inbound connections are distributed by connection hash. It is, as I wrote in the Varnish post, obviously overkill for a homelab. It is also excellent infrastructure for intermittent bugs.

NodeIPRole
node0210.1.1.24Main compute, AMD GPU
storage110.1.1.37NAS, control-plane
orange-pi-max-110.1.1.2ARM64 edge node

The specific problem: orange-pi-max-1 had a bridge interface (br0) configured at MTU 1500, but its physical NIC was capped at 1400. Everything under 1400 bytes passed through fine. Anything larger — a TLS handshake, an HTTPS response, a moderately sized JSON payload — was silently dropped at the NIC level without an ICMP error, without a log line, without any indication that anything was wrong.

The Mikrotik routing table showed three active BGP routes for every VIP. ECMP was working. From the router’s perspective, all three nodes were healthy and receiving traffic. And indeed they were — two thirds of connections worked perfectly. One third hung until timeout, because one third of connections were being routed to a node that silently ate anything above 1400 bytes.

The symptom was intermittent TLS hangs for external HTTPS traffic. The kind of bug that is very easy to misattribute to a flaky certificate, a CDN hiccup, or a tired operator’s imagination.

I handed this to two AI assistants. Claude and Gemini Flash. Neither knew the MTU was the problem. Neither knew which node was the culprit. Here is what they did with that information.


What the AIs Did

Gemini’s Opening Move

Gemini started with architecture. It wanted to understand the CRS326 switch configuration, checked whether L2MTU settings might differ from interface MTU, looked at the Mikrotik hardware spec, and proposed that the switch itself might be the bottleneck. It was thorough. It was wrong. The switch and router were both at 1500 MTU — a single interface print would have closed that line of inquiry in thirty seconds.

When the switch theory didn’t hold, Gemini pivoted to systemd-networkd logs and claimed they showed DHCP-advertised MTU values. They did not — networkctl status shows the current interface state, not the DHCP negotiation history. Gemini was reading the output correctly and interpreting it incorrectly.

The more fundamental issue was that it was still asking questions in the wrong direction.

Claude’s Opening Move

Claude did roughly the same thing, just with more infrastructure tooling. It read FluxCD manifests. Checked the BGPAdvertisement nodeSelectors. Looked at the Canal CNI ConfigMap to understand how FLANNEL_MTU was being computed. Found /run/flannel/subnet.env and noticed FLANNEL_MTU=1350 across all three nodes — interesting for the root cause question, completely irrelevant for the immediate fix.

It then spent a meaningful amount of time wrestling with the Mikrotik MCP API to check whether DHCP option 26 (Interface MTU, RFC 2132) was being sent to the nodes. The API archaeology involved: discovering that the Mikrotik MCP server required a full SSE handshake before accepting messages, debugging why kubectl port-forward was not handling concurrent connections properly, running scratch pods inside kube-system to call the MCP service directly, and eventually confirming that the MTU option was not being sent by DHCP at all.

Ultimately useful information. Found ninety minutes into the session instead of five.

The One Genuinely Good Move

In the middle of all this configuration archaeology, Claude suggested a tcpdump on the physical interface during an active DHCP handshake. This was the right call.

tcpdump -i enP3p49s0 -n 'port 67 or port 68' -XX

Watching the raw DHCP ACK packet and seeing no option 26 in the bytes: that is how you conclusively prove a hypothesis false. Not by reading the systemd-networkd status page. Not by inferring from interface configuration. By watching the actual packet. When everything else is ambiguous, packet capture is the ground truth. It is something I do regularly myself — and it was the most efficient thing either AI did in the entire session.


The ECMP Red Herring

Here is the part that made the diagnosis genuinely tricky, and why it stayed tricky for longer than it should have.

The Mikrotik routing table showed this for the primary VIP:

10.1.1.230/32   gateway=10.1.1.24   distance=20   active=true   # node02
10.1.1.230/32   gateway=10.1.1.37   distance=20   active=true   # storage1
10.1.1.230/32   gateway=10.1.1.2    distance=20   active=true   # orange-pi-max-1

Three routes. All active. Equal cost. ECMP working exactly as designed.

This was misleading in a specific way: because all three paths looked healthy at the routing layer, the mental model that formed was “the cluster is routing correctly, the problem must be somewhere in Kubernetes.” Node-level failures didn’t fit the pattern. If a node were truly broken, its BGP session would drop, its routes would disappear from the table, and traffic would stop flowing to it. That’s how routing is supposed to work.

What was actually happening: the node was not broken at the BGP layer. It was broken one layer lower. BGP keepalives and TCP sessions between MetalLB and the Mikrotik worked fine — those are small packets, well under 1400 bytes. The node looked alive to the router. Only application-layer packets — TLS handshakes, HTTPS responses, anything with a reasonably sized payload — were being dropped at the NIC.

ECMP working correctly was the thing that made the bug hard to see. Because 2/3 of connections succeeded, the instinct was to look for an application-level or configuration-level cause of the 1/3 that failed. The routing table, which is normally your first stop for connectivity problems, was showing you a clean bill of health and actively misdirecting you.


The Approach That Would Have Been Faster

After watching both of us rummage through configs for two hours, the approach that would have surfaced the answer in ten minutes:

1. From a pod inside the cluster, curl the ClusterIP of the nginx service.

If this fails: the problem is inside Kubernetes. CNI, pod scheduling, the application itself. If this works: the application is fine. The entire internal Kubernetes stack is eliminated.

2. Curl the MetalLB LoadBalancer VIP from a node.

If this fails: MetalLB is not announcing correctly, or the local network path is broken. If this works: MetalLB, the VIP assignment, and the local routing are all fine.

3. Curl from outside — by IP first, then by hostname.

If IP works but hostname fails: DNS issue. ExternalDNS, split-horizon, TTL. Nothing to do with the network path. If both fail intermittently: the path to the cluster is broken for some requests, which — combined with ECMP being active — immediately points at a node-level problem. One of your ECMP paths is broken.

4. Check BGP: are all expected routes active?

/ip route print where dst-address~"10.1.1.230"

Three active routes, intermittent external failures: one of your three nodes is responding to BGP but dropping application traffic. Check MTU. Check firewall. Check if the node is actually reachable on port 443 from outside.

That’s it. Four steps, one hypothesis per step, each step eliminating half the remaining faces on the board. The Wie is Wie approach.

Neither AI played it this way. Both started in the middle — inside the configuration, inside the node, inside the protocol stack — rather than starting at the application and working outward. The answer was found eventually, but by process of elimination over a long session rather than by asking the right first question.


What Was Actually Fixed

The fix, once found, was straightforward:

  1. Exclude orange-pi-max-1 from the BGP advertisement by adding a nodeSelector to the MetalLB BGPAdvertisement resource. Intermittent failures stop immediately.
  2. Set br0 MTU 1400 on the node to match the physical NIC limit.
  3. Configure DHCP option 26 (MTU=1400) on the Mikrotik so the correct MTU survives reboots.
  4. Re-add the node to BGP. All three routes active again, ECMP healthy, verified with a route print.

Along the way, I noticed that the mikrotik-mcp-go server had a REST API layer bolted onto its MCP SSE server. It exposed two endpoints — /api/{router}/interfaces and /api/{router}/dhcp-leases — as a subset of what the native RouterOS REST API already provides, on all RouterOS v7 devices, natively. There was no reason for it to exist. It was deleted. 129 lines gone, one ingress object removed, nothing lost.


The Useful Tools

Not everything was wasted time. A few things from this session are worth keeping:

The MCP API for Mikrotik is genuinely useful — not the REST wrapper, but the MCP tool layer itself. The list_routes tool gave a clean JSON representation of the routing table that was straightforward to parse and filter. No RouterOS API client library needed. No SSH session. A single tools/call request via the SSE protocol.

The tcpdump habit. When the configuration looks correct and something is still wrong, capture the actual packets. Not the interpreted output of a monitoring tool. The bytes on the wire. This tells you what is actually happening instead of what the software thinks is happening.

The inside-out script. After the session, I wrote a small netcheck.sh that runs the four-step bisection automatically: pod→ClusterIP, curl VIP, BGP route count via MCP, Hetzner→WAN by IP and hostname. Next time something is “off,” the first move is running that script. Four checks, clear pass/fail, points at the broken layer without any archaeology.


What I Would Tell Both AIs

Start inside. Not at the manifest. Not at the CNI configuration. Not at the DHCP option negotiation. At the application. Can a pod reach the service? Can a node reach the VIP? Does a request make it all the way out to a Hetzner box curling the WAN IP?

Each of these questions is a yes/no. Each yes eliminates an entire layer of the stack. The goal is not to understand the system in full — it is to find the broken link with the fewest questions.

A network is a chain. The debugging question is not “how does this chain work?” The question is “which link doesn’t.”

Start at one end, pull toward the other, stop when you feel resistance.

The packet capture is for when you can’t find the resistance by pulling.


The cluster is on GitLab. The netcheck.sh script is in scripts/. If you’re running a similar MetalLB BGP setup and experiencing intermittent HTTPS hangs, check your node MTUs before reading a single config file.