Gino Eising
Gino Eising
Nerd by Nature
Apr 12, 2026 11 min read

€200 Claude.ai bill in one week — so I built a cheaper alternative

thumbnail for this post

April 2026 — one week of intensive AI-assisted work, one surprising bill, and one decision to do something about it

The Claude.ai usage screen showed €169.51 spent in a single week. That number included €23 for a Pro subscription and four separate top-ups of €50 each in “extra usage.” One hundred and sixty-nine euros. In seven days. On a chat interface.

To be clear: that money went on a week of intensive AI-assisted development work — cluster setup, debugging, architecture sessions, writing. Not on building the thing described in this post. The LiteLLM setup described here costs a fraction of that per day to run. But the bill is what made the question impossible to ignore: if you are spending this much on one model via one interface, is that actually the right shape for this kind of work?

The honest reaction to the bill: not regret, exactly. The week was productive and the AI assistance was genuinely useful. More like the feeling of looking at a café receipt and realising you went back four times in one afternoon. You enjoyed each visit. The total is still a bit embarrassing.

The rational response: build something smarter.

This is the story of setting up a tiered model ladder behind a LiteLLM proxy, deployed on my home Kubernetes cluster, and then using it to orchestrate a headless Claude instance to research observability for four self-hosted applications. It took longer than expected. Nothing worked first time. The results were interesting.


The idea: a ladder, not a fire hose

The standard pattern for AI-assisted development is: pick one model, throw everything at it. Sonnet for everything. GPT-4o for everything. Whatever is newest and fastest.

The waste in this approach is not obvious until you look at what most requests actually are: “rename this variable,” “write the boilerplate for a new HelmRelease,” “summarise this log output.” Tasks where a free tier model would do the job perfectly well.

The escalation ladder concept fixes this by routing requests to the cheapest model that can handle the task, and only escalating when the cheaper option genuinely fails:

T0 FREE   Gemini 2.5 Flash      — boilerplate, HTML, scaffolding
T1 FREE   Ollama local           — YAML, HelmRelease, Kustomize (qwen2.5-coder:14b)
T2 FREE   OpenRouter free tier   — code review, debugging (gemma-3-27b:free)
T3 PAID   DeepSeek V3/R1         — complex code, architecture, hard reasoning
T4 PAID   OpenRouter paid OSS    — larger context, paid alternatives
T5 PAID   Anthropic Sonnet       — complex multi-step tasks, coordination
T6 PAID   Anthropic Opus         — architecture decisions, hardest reasoning

Each tier is progressively more expensive and (roughly) more capable. The cheapest model that can do the job wins. You only pay for Sonnet when the task actually needs it.

The simplest way to validate the whole thing worked end-to-end was to point Claude Code itself at the proxy — since it speaks the Anthropic API format, setting ANTHROPIC_BASE_URL to the LiteLLM endpoint means every Claude Code request routes through the ladder instead of hitting Anthropic directly. That was the test.

The infrastructure to make this work: LiteLLM, an OpenAI-compatible proxy that sits in front of all your model backends and handles routing, failover, cost tracking, and rate limit management.


The architecture

graph TD CC["Claude Code / headless claude"] -->|ANTHROPIC_BASE_URL| LIT["LiteLLM Proxy
llm-api.djieno.com"] LIT -->|skill/scaffold| G["Gemini 2.5 Flash
free tier"] LIT -->|skill/infra| OL["Ollama qwen2.5-coder:14b
node02 AMD GPU"] LIT -->|skill/code| DS["DeepSeek V3
$0.27/MTok"] LIT -->|skill/review| GR["gemma-3-27b:free
OpenRouter"] LIT -->|skill/reasoning| DR["DeepSeek R1
$0.55/MTok"] LIT -->|skill/coordinator| SO["Anthropic Sonnet
premium"] DS -->|fallback| GR OL -->|fallback| DS GR -->|fallback| DS

Each skill/ route is not a real model — it is a named router group in LiteLLM’s config. Multiple backends registered under the same name form a pool. LiteLLM’s usage-based-routing-v2 tracks success rates and latency per backend, routes to the healthiest one, and automatically puts failing backends into cooldown.

The key thing to understand, because I got confused about this initially: LiteLLM does not decide which skill to use. The caller decides. When you run claude --model skill/infra, Claude Code sends that model name to the proxy, and the proxy looks it up in its routing table. The intelligence about which tier to use for which task lives in your workflow or your CLAUDE.md, not in the proxy.

The proxy’s job is purely: given this model name, find the healthiest backend for it, enforce rate limits, track cost, retry on failure.


Deploying it to Kubernetes

The setup lives in apps/base/litellm/ in my FluxCD repo — a Deployment, a ConfigMap with the model routing config, a CNPG Postgres cluster for spend tracking, ExternalSecrets for API keys from Bitwarden, and two Ingress objects: one behind Authentik SSO for the web UI, and one for the API endpoint that requires a LiteLLM master key as a Bearer token. The API is not open — every request needs Authorization: Bearer sk-litellm-... — it is just not behind the SSO flow, which would break API clients.

# The config that routes skill/* to real backends
model_list:
  - model_name: skill/code
    litellm_params:
      model: deepseek/deepseek-chat
      api_key: os.environ/DEEPSEEK_API_KEY
    model_info:
      max_output_tokens: 8192
      description: "Code gen primary — T3 DeepSeek V3"

  - model_name: skill/code        # same name = router group
    litellm_params:
      model: openrouter/google/gemma-3-27b-it:free
      api_key: os.environ/OPENROUTER_API_KEY
    model_info:
      description: "Code gen fallback — T2 gemma-3-27b:free"

One complication: Flux manages the Deployment, and previously I was restarting LiteLLM manually when I changed the ConfigMap. Flux reconciles every 10 minutes and removes any manual restartedAt annotation it did not apply. You end up with a fight you cannot win.

The fix: deploy Stakater Reloader, which watches ConfigMaps and Secrets for changes and triggers Deployment rollouts automatically. The annotation goes on the Deployment metadata, not the pod template — if you put it on the pod template Reloader ignores it, which I discovered the hard way.


Testing: where things actually break

Every tier needed verification. Not “does the API return 200” but “does it respond in a reasonable time under repeated requests, and does fallback actually trigger when the primary goes into cooldown.”

What was solid: Gemini 2.5 Flash — fast, generous free quota for single-user usage. gemma-3-27b:free on OpenRouter — consistently 1-2s response time, no rate limit issues at our request rate. DeepSeek V3 and R1 — reliable, cheap, prompt caching reduces cost further on repeated context.

What was broken: llama-3.3-70b:free goes through the Venice upstream provider on OpenRouter. It is chronically congested. Most requests timeout after 20 seconds. It is not a rate limit problem — it is a structural availability problem. Replaced with DeepSeek V3 as the skill/code primary.

What was a 404: qwen-2.5-72b-instruct:free disappeared from OpenRouter between the time I configured it and the time I tested it. OpenRouter free tier is a moving target — models appear and disappear without notice. Replaced with gemma-3-27b:free which I had already confirmed working.

The rate limit tuning: The default cooldown when a backend fails is 5 minutes. For free tier transient throttling that is too long — you want to retry after 2 minutes, not 5. Set cooldown_time: 120. With allowed_fails: 3 (three failures in 60 seconds triggers cooldown), a briefly overloaded free tier model recovers and re-enters rotation quickly.

The max_tokens saga: Claude Code, when talking to an Anthropic-compatible endpoint, sends max_tokens=65536 (Sonnet’s limit). DeepSeek’s hard cap is 8192. Three attempts to fix this:

  1. litellm_params.max_tokens: 8000 — this is a DEFAULT, not a cap. If the client sends a higher value, the client wins. Active fail.
  2. litellm_settings.max_completion_tokens: 8192 — caps the max_completion_tokens OpenAI field. DeepSeek uses max_tokens. No effect.
  3. litellm_settings.max_tokens: 8192 — this one actually does min(client_value, setting_value) before forwarding. Works.

Three commits, three Flux reconciliations, three Reloader-triggered restarts to get there. This is the kind of thing that looks trivial in a blog post and takes an afternoon in practice.


Cooldown and the free tier question

After the rate limit testing, the question was whether the free tier slots in the ladder were actually worth keeping. The cooldown mechanism means a congested free tier model can temporarily black-hole requests before the fallback kicks in.

The conclusion: it depends on the model. gemma-3-27b:free is worth keeping — it is genuinely fast and the cooldown almost never triggers at single-user request rates. llama-3.3-70b:free was not worth keeping — the underlying provider is structurally unreliable, not transiently overloaded. Removed from the routing groups.

The pattern: free tier models stay in the ladder if they pass a simple stress test (5-8 sequential requests, all respond, none timeout). If they fail that test consistently, they are not a rate limit problem — they are a broken endpoint.


The headless experiment: one Claude orchestrating another

Once the ladder was stable, the interesting question was whether you could use Claude Code itself — pointed at LiteLLM with ANTHROPIC_BASE_URL=https://llm-api.djieno.com — to run multi-model headless tasks.

# Research task: use the reasoning tier (DeepSeek R1)
claude --print \
  --model skill/reasoning \
  -p "$(cat /tmp/observability-research-task.md)" \
  > /tmp/observability-research-output.md

# Implementation task: feed research into infra/code tiers
claude --print --model skill/infra \
  -p "$(cat /tmp/authentik-metrics-task.md)" \
  > /tmp/result-infra.md

The research task ran. A single skill/reasoning call to DeepSeek R1 produced a 311-line observability architecture document covering all four target applications: Mailu, Nextcloud, Immich, Authentik. App-by-app breakdown of what metrics endpoints existed natively, what exporters to deploy as sidecars, OTel Collector configuration, alert definitions. Cost: €0.008.

The output was genuinely useful. Authentik has a Django metrics endpoint on port 9000 that just needs AUTHENTIK_METRICS_ENABLED=true — a one-line HelmRelease change that enables Prometheus scraping immediately. Nextcloud needs a php-fpm exporter as a sidecar. Mailu needs postfix and dovecot exporters. None of this required reading documentation for four separate applications — the reasoning model did that synthesis in one call for less than a cent.

The implementation tasks — feeding that research into skill/infra to generate the actual HelmRelease patches — did not run. Every attempt hit the max_tokens wall and came back with a DeepSeek rejection. This is what got fixed in the previous section. The implementation phase is next.


What a “coordinator” actually does

There is a pattern in the config called skill/coordinator, which routes to Anthropic Sonnet with a DeepSeek V3 fallback. The name suggests some kind of intelligent dispatch — a model that decides which other models to call.

The reality is simpler: the coordinator is just a model you call manually when the task is “orchestrate multiple headless workers and merge their outputs”:

# Run three specialists in parallel
claude --print --model skill/scaffold -p "scaffold the React component" > scaffold.md
claude --print --model skill/code     -p "implement the logic"          > logic.md
claude --print --model skill/review   -p "review for issues"            > review.md

# Coordinator merges the results
claude --print --model skill/coordinator \
  -p "Merge these outputs coherently: $(cat scaffold.md logic.md review.md)"

LiteLLM does not see the relationship between these calls. From the proxy’s perspective, they are four independent requests to four different model groups. The coordination logic lives in the shell script, not the proxy. This is actually a feature — the proxy stays simple, the orchestration logic stays inspectable.

Currently, skill/coordinator falls back to DeepSeek V3 because the Anthropic API key on the cluster ran out of credits. This is worth understanding: the Claude.ai Pro subscription and the Anthropic API are completely separate billing systems. The €169 on the usage screen spent during this week bought chat interface access — not API credits. Those require a separate top-up at console.anthropic.com. The fallback means the coordinator tier keeps working, just with a less capable model until that is sorted.


What a week of €200 AI spend actually produces

A week of heavy Claude.ai use bought a lot of interactive sessions — some productive, some going in circles, some that turned into blog posts about the going-in-circles. The part that felt like a real shift was realising that the tool use pattern matters as much as the tool.

The observability research for €0.008 — not approximately, precisely €0.008 — would have taken me two to three hours to produce manually: reading Authentik’s documentation, checking Mailu’s GitHub issues for Prometheus support, cross-referencing Nextcloud’s monitoring guide. The reasoning model did that synthesis in one call.

The max_tokens debugging taught something real about how LiteLLM works internally — knowledge that is now in the config comments and will prevent the same confusion next time. The reloader annotation on the wrong YAML level is the kind of thing that costs two hours and teaches you something permanent.

Nothing worked in one go. The Ollama hostname was wrong. Two Gemini models were deprecated. One OpenRouter model vanished. The max_tokens fix took three iterations. Getting from “I have a LiteLLM config” to “every route is tested and the fallbacks trigger correctly” took a full day across multiple sessions.

That is not a failure of the tooling. That is what integration work looks like. The tooling is genuinely good — LiteLLM’s health endpoint, usage-based-routing-v2, Reloader’s automatic ConfigMap detection. Every bug was configuration, not fundamentals.

The week’s bill is what prompted the build. The build is what makes next week cheaper.


Infrastructure lives at apps/base/litellm/ in the FluxCD repo. The headless task files are in /tmp/ on the development machine — prompts, not manifests, so not committed. The observability research output feeds the next implementation session.