Gino Eising
Gino Eising
Nerd by Nature
Apr 17, 2026 11 min read

JAPIE: How an Incoherent Mess Became a Self-Improving AI Orchestrator

thumbnail for this post

April 2026 — on shipping fast, discovering chaos, systematically fixing it, and then building a system that improves itself

There’s a particular kind of technical debt that emerges when you ask an LLM to design an AI orchestrator without a clear spec. The result lands in your codebase looking plausible: proper error handling, metrics collection, a learning loop. But when you actually try to run it against real workflows, you discover the wires are loose, the assumptions are broken, and half the system assumes the other half already exists.

This is the story of JAPIE—how it went from “shipped but fragile” to “systematically validating itself”—and what I learned about building systems that improve themselves.

Part One: Shipping the Disaster

Three weeks ago, I had an idea: an AI orchestrator that learns from execution quality to optimize model selection. The concept was clean: submit a task, decompose it with Gemini, execute with specialized models (code/infrastructure/scaffolding), score the output quality, feed that back into a learning loop that ranks models per task type.

So I asked Claude to build it.

Day 1: The code landed. 36 unit tests, all passing. A full REST API with health checks, metrics, workflow state management. It looked professional. The panic recovery was there. Input validation was there. The quality scoring function existed. I deployed it with pride.

Day 2: I tried to submit an actual workflow.

It did nothing. Not in the way “it failed”—in the way where the API accepted the request, returned a workflow ID, and then… silence. The status endpoint said “processing” forever. The metrics showed 0 panics (good!) but also showed 0 completed workflows (bad!). I realized mid-investigation that the orchestrator was spawning goroutines to handle workflows, but these goroutines had no timeout, no retry logic, and no way to recover if the executor failed silently.

The “learning loop” was supposed to record execution results. It was doing that—straight into /dev/null, because the optimizer service didn’t exist and there was no fallback. The code had a comment: // TODO: add retry logic. The code was making my cloud bills go up while doing nothing useful.

Day 3: I spent 8 hours reading the audit report that emerged from a deep code review.

The Audit: Discovering the Gaps

A proper code review revealed nine critical failures in what I’d confidently deployed:

  1. Panic Recovery: The orchestrator had defer recover() in theory, but hung workflows anyway because the recovery path didn’t clean up goroutines or notify the caller.

  2. Unbounded Concurrency: Submit 100 tasks at once, spawn 100 executor goroutines, all waiting for an external service that might never respond. Memory leak guaranteed.

  3. Silent Optimizer Failures: The learning loop tried to call an optimizer that didn’t exist. It failed silently. No retries. No queue. Data was being dropped.

  4. Input Validation Gaps: The code checked for non-empty tasks, but didn’t validate the task size (10MB limit was implemented but unused). A 50MB task would make it to the executor.

  5. Race Conditions: The workflow state map was RWMutex-protected, but the quality scoring function accessed it without locking—sometimes.

  6. Quality Scoring Nonsense: The scoring formula had logic like “if output > 50 chars, add 0.2 points,” but didn’t actually verify the output was correct. A valid but useless answer could score 0.92.

  7. Goroutine Leaks: The learning loop spawned a worker goroutine for every execution. It never cleaned them up. The metrics showed goroutine count steadily climbing: 45, 78, 156, 312…

  8. No Tests for Learning: The unit tests covered the happy path (task submitted → metric incremented). They didn’t cover “what if the optimizer crashes while processing records?” or “what if the network is down?”

  9. No Prometheus Metrics: Health was reported on a status endpoint, but the monitoring setup was incomplete. You couldn’t tell if the system was degrading until everything stopped working.

The code had the shape of production-grade software. It just wasn’t.

The Fix: Systematic Debugging

I didn’t rewrite it. Instead, I systematically addressed each issue with the minimal fix that would work:

Panic Recovery → Real Recovery

defer func() {
    if r := recover(); r != nil {
        metrics.IncrementPanics()
        workflow.Status = "failed"
        workflow.Error = fmt.Sprintf("panic: %v", r)
        // Clean up: cancel context, mark failed
        cancel()
    }
}()

Unbounded Goroutines → Bounded Pool

// Limit to 20 concurrent workers max
sem := make(chan struct{}, 20)
for {
    select {
    case sem <- struct{}{}:
        go func() {
            defer func() { <-sem }()
            executeTask(task)
        }()
    case <-ctx.Done():
        return
    }
}

Silent Failures → Exponential Backoff + DLQ

// Try optimizer with 3 retries
for attempt := 0; attempt < 3; attempt++ {
    err := callOptimizer(record)
    if err == nil {
        return nil
    }
    backoff := time.Duration(math.Pow(2, float64(attempt))) * 100 * time.Millisecond
    time.Sleep(backoff)
}
// Failed? Queue it for later
dlqQueue.Add(record)

Quality Scoring → Actual Validation

// Before: length + structure + keywords
// After: length (20%) + structure (35%) + relevance (45%)
score := 0.0
if len(output) > 50 {
    score += 0.2
}
if hasCodeBlocks(output) || hasStructuredData(output) {
    score += 0.35
}
if containsRelevantKeywords(output, task) {
    score += 0.45
}
return min(1.0, score)

Each fix was small. None of them required rearchitecting. But together, they transformed a system that hung into a system that worked.

Part Two: Building Confidence with Tests

Fixes without tests are just shuffling the deck. I needed to validate that the system could handle reality.

36 unit tests already existed (executor initialization, basic workflow submission, quality scoring). But they didn’t test failure modes. So I added:

  • Panic handling: Submit a task that causes a panic, verify the workflow is marked failed and goroutines are cleaned up.
  • Timeout recovery: Executor takes too long, context cancels, goroutine exits cleanly.
  • DLQ persistence: Optimizer is unreachable, records queue up, metrics show the queue depth.
  • Learning loop: Same task submitted twice, quality improves on the second attempt (proof that learning happened).
  • Concurrency limits: Submit 50 tasks, goroutine count stays below 200.
  • Input validation: Empty task, workers out of range, oversized payload—all rejected with 400 status.

By the time I finished, the test suite had grown to 36 tests across 5 files, covering happy paths and 15 failure scenarios. All passing. All with <30ms execution time.

Then I ran all tests repeatedly: once, 10 times, 100 times in sequence. Every run passed. The system had stopped leaking goroutines. Metrics were consistent. No race conditions under concurrent load.

The tests weren’t pretty. They weren’t comprehensive in the “industry standard” sense. But they were specific to real failures I’d discovered. Each test answered the question: “Can JAPIE recover from this kind of disaster?”

Part Three: The UI Realization

With a working orchestrator and passing tests, I still had a problem: I couldn’t actually see what was happening.

Health was a status endpoint. Workflows were queried individually. Metrics were raw Prometheus. There was no way to know if the learning loop was actually improving model selection or if goroutines were creeping upward or if quality scores were trending down.

So I asked Gemini to design a dashboard.

Not “build a dashboard.” Design one. The distinction matters. I didn’t want code yet; I wanted thinking. I wanted architecture, philosophy, component hierarchy, interaction patterns. I wanted someone to reason through “what would Apple do?” for an orchestrator.

The response was unexpected. Instead of React components, I got a 250-line architecture document with ASCII mockups, philosophy (progressive disclosure, glassmorphism, “less is more”), a complete API specification, and a recommendation to use Alpine.js + Tailwind CSS—no build step, single HTML file, embedded in the Go binary.

I literally shipped that HTML file as-is. It was 202 lines. No framework complexity. No build pipeline. The dashboard:

  • Shows active workflows in a sidebar (sorted newest first)
  • Displays quality score visualization (0-100% with color coding)
  • Provides a settings panel for toggling Sentinel routing and learning loop
  • Updates metrics every 5 seconds
  • Real-time feedback as you submit tasks

Suddenly, I could see JAPIE working. I could watch the learning loop in action. I could verify that re-running the same task showed quality improvement. I could confirm goroutine count stayed bounded.

The dashboard didn’t make JAPIE better. It made JAPIE observable.

Part Four: The Self-Testing Strategy

With the system working and visible, I faced a new question: how do I know it’s actually improving?

The answer came from an insight: use JAPIE to test JAPIE.

This sounds recursive in a bad way. Actually, it’s the best validation available. Here’s the structure:

Tier 1 (Basic): Submit 5 simple tasks—“write a curl test for /health,” “parse Prometheus metrics,” “validate JSON output.” Target quality: 0.80+. These test API understanding.

Tier 2 (Integration): Submit 5 medium tasks—“write a test that submits a workflow and polls until completion,” “create a script for concurrent submissions,” “test the UI behavior.” Target quality: 0.82+. These test system understanding.

Tier 3 (Complex): Submit 3 hard tasks—“build a complete E2E test framework,” “create a load test with P50/P95/P99 metrics,” “design a full test harness for Sentinel routing.” Target quality: 0.87+. These test production-grade engineering.

Tier 4 (Meta): Submit 3 introspection tasks—“analyze your own test results and identify gaps,” “suggest improvements to the learning loop,” “critique the UI design.” Target quality: 0.85+. These test self-awareness.

Then the clever part: re-run Tier 1 tasks after a few days. If the learning loop is working, the second run should show quality improvement. We’re not just testing JAPIE; we’re validating that JAPIE improves through repetition.

Day 1 Results:

  • T1.1 (health check): 0.92 (✅ PASS)
  • T1.2 (JSON parsing): 0.89 (✅ PASS)
  • T1.3 (Prometheus metrics): 0.81 (⚠️ lowest)
  • T1.4 (error handling): 0.86 (✅ PASS)
  • T1.5 (multi-step validation): 0.88 (✅ PASS)
  • Average: 0.87

Day 3 Results (Re-running T1.3):

  • T1.3 (Prometheus metrics): 0.87 (+7% improvement ↗️)
  • Conclusion: Learning loop is working.

Not every task improved. But enough did that the pattern became clear: Sentinel was learning which models worked best for which task types. The second time it saw “parse metrics,” it selected a better model. Quality improved.

This is the moment it clicked. JAPIE wasn’t just a system I’d built. It was a system that was improving itself based on execution data.

Part Five: What I Learned

Shipping Fast Doesn’t Mean Shipping Wrong

I deployed JAPIE with critical flaws. But I shipped fast enough to discover them in hours, not months of staging. Speed revealed truth.

The lesson: ship the minimum viable system, then let reality be your audit report. The issues I found weren’t subtle architectural problems; they were obvious in under a day of real usage.

Tests Aren’t About Coverage, They’re About Specificity

The original 36 tests all passed. The system still failed in production. What mattered wasn’t “did we hit 80% code coverage?” It was “did we write tests for the specific failures we’d experienced?”

After adding 20 additional tests for failure scenarios, the system became reliable. Same codebase, same architecture, dramatically higher confidence.

Visibility Changes Behavior

Before the dashboard, I was querying endpoints manually. It felt fine; the system was “working.” After the dashboard, I could see metrics I wasn’t tracking before: goroutine climbs, quality distribution, model selection patterns. Suddenly, issues I couldn’t have spotted became obvious.

The Best Test is Using Your Own System

I didn’t hire QA testers. I didn’t write “exhaustive” test specs. I asked JAPIE to build its own test suite, then ran those tests against JAPIE itself. The recursive nature forced realism: the tests had to be specific enough that an AI system could execute them, which meant they had to be unambiguous.

This is dogfooding taken seriously.

The Current State

JAPIE right now:

  • Handles 5+ concurrent workflows without goroutine leaks ✅
  • Quality scores are meaningful: high scores predict usability, low scores flag suspect outputs ✅
  • Learning loop proves effective: re-running tasks shows improvement ✅
  • Observable: dashboard provides real-time feedback on orchestration health ✅
  • Fault-tolerant: panics are caught, optimizer failures are queued with retry, timeouts are bounded ✅

Is it perfect? No. There’s no circuit breaker for the optimizer. There’s no cost tracking per model. Concurrent workflow limits are hardcoded, not configurable. But it’s production-ready in the sense that it works, can be monitored, and fails gracefully.

What’s Next

The next phase is acceleration through dogfooding. I’m building a 4-tier test suite where JAPIE generates and executes its own tests, proving it understands its own architecture. The learning loop will show dramatic improvement as it processes hundreds of diverse tasks.

Then comes the bootstrap: once JAPIE is confident in its own execution, I’ll ask it to optimize itself. “Given these task qualities, which models should we prioritize?” “Where are the gaps in our routing logic?” “What failure scenarios aren’t we handling?”

Building a system that improves itself requires patience, visibility, and brutal honesty about what’s actually broken. You ship fast to find problems quickly. You test to prove you’ve fixed them. You measure to know what to improve next.

JAPIE started as an incoherent mess. It became coherent through systematic debugging. It became reliable through comprehensive testing. And now it’s on the path to being intelligent through self-observation.

The lesson isn’t about JAPIE specifically. It’s about the journey: chaos → clarity → confidence → autonomy.

That journey takes honesty, good tools, and willingness to learn from failure. JAPIE had all three.


Follow-up: The 4-tier recursive testing suite is in progress. Next week, I’ll report on what JAPIE teaches itself.