docs(plans): encoder + contextual-bandit router architecture

Captures the architectural research surfaced during the 2026-05-25 SLM-failure diagnostic session: RouteLLM treats routing as classification, ModernBERT is well-suited to that classification, and FunctionGemma fits as an optional JSON-sanity layer rather than the primary classifier. The current decoder-SLM-as-classifier design is the wrong shape (100% failure rate observed across two model swaps). Five-phase plan: 1. Embedding feature scaffold (near-term, additive, opt-in) 2. Contextual bandit (LinUCB / Thompson) over the feature set 3. Retire the decoder-SLM classifier once 2 outperforms 4. ModernBERT fine-tune on the accumulated labelled data 5. FunctionGemma JSON sanity layer (optional final stage) Phase 1 is the only piece scoped for near-term implementation; the rest is multi-month and hinges on the strategic 'EMA vs SLM' question already tracked in TODO. Cross-references the existing tool-router-specialization plan so a reader of either lands on both. Updates the TODO entry for the bandit selector to note the supersession path.
2026-05-25 01:22:18 +02:00
parent c0c2e4bff5
commit 24945b1eb2
3 changed files with 357 additions and 1 deletions
@@ -146,7 +146,10 @@ Active work, newest first.
     decision in #1.

  Surfaced from the r/coolgithubprojects v0.3.1 launch thread
-  (2026-05-24, `u/Ha_Deal_5079`).
+  (2026-05-24, `u/Ha_Deal_5079`). The encoder + contextual bandit
+  alternative is now sketched in
+  [`docs/superpowers/plans/2026-05-25-encoder-bandit-router.md`](docs/superpowers/plans/2026-05-25-encoder-bandit-router.md) —
+  that plan supersedes #1 above when it ships.

 - **Security boundary — egress controls + session audit log.** The
  current `Firewall` is a content boundary only (scans messages and
@@ -1,5 +1,14 @@
 # Tool-Router Specialization (functiongemma) — 2026-05-23

+> **Companion plan from 2026-05-25:**
+> [`2026-05-25-encoder-bandit-router.md`](2026-05-25-encoder-bandit-router.md)
+> sketches an alternative architecture (encoder + contextual bandit
+> instead of decoder-SLM-as-classifier). The two are complementary,
+> not competing — FunctionGemma fits as the optional Phase 5 "JSON
+> sanity layer" in that plan. Decide which track to invest in based
+> on the did-switch-rate telemetry (this plan) vs the bandit-data
+> accumulation (companion plan).
+
 Follow-up to
 [`2026-05-19-post-slm-unlock.md`](2026-05-19-post-slm-unlock.md)
 Phase A, which shipped two-stage tool routing: round 1 sends a single
@@ -0,0 +1,344 @@
+# Encoder + Contextual-Bandit Router — 2026-05-25
+
+Proposes a long-arc architectural rethink of gnoma's routing layer:
+**replace the decoder-SLM-as-classifier design with an encoder-only
+embedding model feeding a contextual bandit policy**, and treat a
+strict tiny SLM (FunctionGemma-270M-it) as the optional "emit a
+structured route decision" layer rather than the primary classifier.
+
+Surfaced from external research (RouteLLM, ModernBERT, Gemma 3
+270M, Qwen3-Embedding, BGE-M3) brought into the 2026-05-25
+diagnostic session where gnoma's current decoder-SLM classifier
+exhibited a 100% failure rate across two model swaps
+(`reecdev/tiny3.5:1.5b`, `qwen2.5-coder:1.5b`).
+
+This plan is **strategic / multi-month**. Phase 1 below is the only
+piece scoped for near-term implementation; everything else hinges on
+the bandit-vs-SLM strategic decision tracked in the existing
+`Bandit selector — design decisions deferred` TODO entry.
+
+Sibling plans:
+[`2026-05-23-tool-router-specialization.md`](2026-05-23-tool-router-specialization.md)
+already covers the **FunctionGemma fine-tune** track as the
+strict-SLM option; this plan adds the **encoder + bandit** track
+as the alternative (and arguably better-suited) architecture.
+
+---
+
+## Problem
+
+The current router has three coupled problems:
+
+1. **The classifier is a decoder LLM in a job an encoder would do
+   better.** Routing is a classification task with cost/quality
+   trade-offs, not a reasoning task. Asking a decoder model to emit
+   structured JSON for every classify call is high-latency, fragile
+   to chain-of-thought leakage, and indeterministic.
+
+2. **The bandit can't actually learn quality** because the only
+   success signal is `err == nil` (per `internal/engine/loop.go:118`).
+   EMA scores converge to 1.00 for every arm — see the 2026-05-24
+   `router stats` snapshot where 22 of 25 arm/task pairs sit at
+   exactly 1.00.
+
+3. **The classifier and bandit live in adjacent code but were
+   designed in separate phases**, so the integration point (`Task`
+   built by SLM classifier → fed to `selectBest`) is just data
+   flow, not a learning loop. The SLM's wins/losses don't update
+   the SLM; the bandit's wins/losses don't change which arms the
+   classifier considers.
+
+The 100% SLM-failure incident on 2026-05-25 made (1) urgent. The
+zero-discrimination EMA on 2026-05-24 made (2) urgent. (3) is the
+underlying integration debt.
+
+---
+
+## Non-goals
+
+- **Killing the existing SLM classifier today.** Phase 1 of this
+  plan is purely additive (encoder feature extraction); the existing
+  classifier stays as a baseline until the new path is measurably
+  better.
+- **Reimplementing bandit math.** LinUCB and Thompson Sampling are
+  well-understood. The work is the feature pipeline and reward
+  function, not the policy core.
+- **Choosing a single embedding model permanently.** Phase 1 ships
+  with a default but exposes a `[slm.embedding].model` knob so
+  swapping is config-only.
+- **The strict-SLM track.** FunctionGemma fine-tuning is the sibling
+  `2026-05-23-tool-router-specialization.md` plan; this plan
+  references it but does not duplicate it.
+
+---
+
+## Background — research summary
+
+Citations follow the user-provided research thread (RouteLLM 2024,
+ModernBERT 2024, Google FunctionGemma 2025).
+
+- **RouteLLM** tested router types as a classification problem:
+  similarity routing, matrix factorization, BERT classifier, causal
+  LLM classifier. The BERT classifier was competitive with the
+  causal-LLM classifier at lower cost and latency. Routing is a
+  classification task; treating it like a generation task is paying
+  generation cost for classification value.
+- **ModernBERT** (Dec 2024) is an encoder-only model with 8k context,
+  trained partly on code, designed for fast classification and
+  retrieval. The 'base' size is ~150M parameters, the 'large' size
+  ~400M. Both are tiny compared to even small decoder LLMs.
+- **FunctionGemma-270M-it** (Aug 2025) is Google's small model
+  fine-tuned for natural-language → function-call output. Google's
+  own positioning materials list **query routing** as a use case.
+- **Qwen3-Embedding-0.6B** and **BGE-M3** are strong multilingual
+  embedding models with long-context support; either can serve as
+  feature extractors for downstream classification or bandit
+  policies.
+
+The throughline: **encoder models are the right tool for the
+classification side of routing**; generative SLMs (FunctionGemma)
+are the right tool only when the *output* must be a structured
+decision blob with confidence + tags + fallback. For pure routing,
+encoder features + bandit policy is cheaper, faster, more
+deterministic.
+
+---
+
+## Approach overview
+
+Five phases. Phase 1 is near-term; Phases 2–4 are the actual
+architectural shift; Phase 5 is the long-arc fine-tune.
+
+### Phase 1 — Embedding feature scaffold (near-term, additive)
+
+Add an embedding pipeline that runs alongside the existing
+classifier. Extract features for every prompt; log them to disk
+next to the existing quality-EMA. No routing decision changes yet.
+
+**Why first:** lets us build up a labelled dataset of (prompt,
+features, arm, outcome) tuples without disturbing today's routing
+behaviour. Phase 2 trains against this dataset.
+
+### Phase 2 — Contextual bandit over the feature set
+
+Once Phase 1 has ~500–1000 labelled observations, swap `selectBest`
+from heuristic quality + EMA score to a LinUCB-style contextual
+bandit that takes the embedding features + the existing arm metadata
+(MaxComplexity, CostWeight, Strengths). The existing EMA quality
+score becomes one feature among many.
+
+### Phase 3 — Retire the decoder-SLM classifier
+
+When Phase 2 routing is measurably better than today's heuristic +
+EMA blend, the decoder-SLM classifier (currently producing 0
+useful classifications on the user's setup) is no longer
+load-bearing. Deprecate it; keep the same `[slm]` config knobs for
+backwards compatibility but route them at a different runtime path.
+
+### Phase 4 — ModernBERT fine-tune
+
+The off-the-shelf embedding model from Phase 1 (BGE-M3 or
+Qwen3-Embedding-0.6B by default) gives general-purpose embeddings.
+Phase 4 fine-tunes a router-specific classification head on top of
+ModernBERT-base using the labelled dataset accumulated since Phase
+1. Pure performance win; falls back gracefully to off-the-shelf
+embeddings if the fine-tune isn't loaded.
+
+### Phase 5 — FunctionGemma JSON sanity layer (optional)
+
+For users who want a structured route decision (arm + confidence +
+fallback) alongside or instead of the bandit output, plug
+FunctionGemma-270M-it (fine-tuned per the
+`tool-router-specialization` plan) as a final-stage decision blob
+emitter. Sits *after* the encoder + bandit, not in front of them.
+
+---
+
+## Phase 1 — Embedding feature scaffold (detailed)
+
+This is the only phase scoped for near-term implementation. The
+others depend on Phase 1's data accumulation.
+
+### What lands
+
+- New package `internal/router/features` with:
+  - `Embedder` interface: `Embed(ctx, prompt string) ([]float32, error)`.
+  - Implementations: `OllamaEmbedder`, `BGE3Embedder`, `NoopEmbedder`
+    (default; returns nil features when no embedding model is
+    configured).
+- New config `[slm.embedding]` section:
+  ```toml
+  [slm.embedding]
+  enabled  = false                       # default off; opt-in
+  backend  = "ollama"                    # ollama | bge-m3 | noop
+  model    = "qwen3-embedding:0.6b"      # ollama model tag
+  base_url = ""                          # backend endpoint override
+  ```
+- Feature extraction hook in `internal/engine/loop.go`: after the
+  classifier runs but before `selectBest`, compute the embedding
+  for the prompt and attach to the routing `Task` as an opaque
+  `Features []float32` field.
+- New on-disk store at `~/.config/gnoma/router-features.jsonl`,
+  one record per observation: `{ts, prompt_hash, features,
+  task_type, arm_id, success, tokens, duration}`.
+  - `prompt_hash` is a SHA-256 of the prompt — never the prompt
+    itself — to keep the file local-only-but-not-secret-laden.
+  - Append-only, atomic-write, incognito-gated, same discipline as
+    the firewall audit log.
+- No selector change. `selectBest` continues to use today's
+  heuristic + EMA blend. Phase 1 just observes.
+
+### Why off by default
+
+Embedding inference adds 50–200ms per prompt depending on backend
+and model size. That latency is fine for ollama users running on
+a workstation, painful for users on slower setups. Opt-in keeps
+the regression risk at zero.
+
+### Phase 1 task list
+
+- **F1-1:** Define the `Embedder` interface and `NoopEmbedder` in
+  `internal/router/features/`.
+- **F1-2:** `OllamaEmbedder` wraps `provider/openaicompat` with the
+  ollama embedding endpoint (`/api/embeddings`).
+- **F1-3:** Add the `[slm.embedding]` config section to
+  `internal/config/config.go` with the same defaults-via-zero
+  discipline as the rest of the config.
+- **F1-4:** Wire the embedder into `loop.go` between classifier and
+  selector. Failures log at Debug and don't block routing.
+- **F1-5:** Append-only feature store in
+  `~/.config/gnoma/router-features.jsonl` with atomic writes,
+  incognito gate, opt-out via `[slm.embedding].enabled = false`.
+- **F1-6:** Tests covering: embedder mock + observation record;
+  noop embedder produces empty features; incognito skips the
+  store entirely.
+
+---
+
+## Phase 2+ — Bandit policy (sketch only; needs data first)
+
+Spelled out for context. Not for near-term implementation.
+
+### Feature set per the research
+
+```
+prompt_embedding          — 384-1024 dim depending on model
+token_count               — len of tokenized prompt
+language                  — ISO code from a small lang-detect
+has_code                  — fenced-block heuristic
+has_error_log             — pattern match for stack traces
+needs_tools               — from current heuristic
+needs_vision              — from [Image:...] markers
+estimated_complexity      — current heuristic score
+requested_latency         — turn-budget hint (future)
+arm_context_window        — from arm metadata
+arm_vram_cost             — from arm metadata
+arm_avg_latency           — from quality EMA
+arm_success_rate          — from quality EMA
+```
+
+### Reward function per the research
+
+```
+reward = quality_score
+       - latency_penalty
+       - vram_penalty
+       - failure_penalty
+       - escalation_penalty
+```
+
+- `quality_score`: 1.0 on success, 0.0 on hard error today; richer
+  signal (elf-mediated, user thumbs, tool-call success) once the
+  TODO `Bandit selector — design decisions deferred` resolves.
+- `latency_penalty`: monotone in observed seconds.
+- `vram_penalty`: monotone in declared VRAM cost.
+- `failure_penalty`: hard cost on explicit errors (sandbox
+  denied, parse failed).
+- `escalation_penalty`: cost when a downstream elf had to escalate
+  to a heavier arm because this arm failed.
+
+### Policy
+
+LinUCB (linear contextual bandit, deterministic exploration
+bounded by UCB) or Thompson Sampling (Bayesian, smoother
+exploration). LinUCB is the safer starting point — fewer
+hyperparameters, well-known behaviour, easier to debug.
+
+---
+
+## Risks
+
+- **Latency.** Embedding inference adds 50–200ms per prompt. Phase
+  1's opt-in default means users see no regression; Phase 2's
+  "make it default" decision requires latency benchmarks first.
+- **Data sparsity for fine-tuning (Phase 4).** ModernBERT
+  fine-tuning needs ~10k labelled observations to start being
+  useful. Phase 1 might run for months before Phase 4 is viable.
+  Plan B: synthesise labels from existing prompt logs + rule-based
+  pre-labels.
+- **Off-the-shelf embedding quality.** BGE-M3 / Qwen3-Embedding
+  weren't trained specifically for routing decisions. Phase 4
+  exists precisely to close this gap; Phase 1's data accumulation
+  is what makes Phase 4 possible.
+- **Architectural complexity.** This plan introduces an entire new
+  ML pipeline (embedder → feature store → bandit → reward loop).
+  Phase 1 keeps it side-by-side with the existing path; Phase 2's
+  "swap" decision is reversible because the existing path stays
+  in code.
+- **Privacy.** Prompt hashes (not raw prompts) in the feature
+  store. Still a local-only file; same opt-out plumbing as the
+  project registry from the config-migration plan.
+
+---
+
+## Open questions
+
+- **Should the feature store be per-project or global?** Per-project
+  is more privacy-respecting (one project's prompts don't influence
+  another's routing). Global is more data-efficient (more samples
+  → better bandit). Phase 1 chooses global by default; revisit
+  during Phase 2.
+- **How does this interact with `[router].prefer = local|cloud`?**
+  Easy answer: prefer policy stays as a hard tier-shift, applied
+  after bandit selection. Bandit picks the best feasible arm; the
+  prefer policy is consulted as a final filter / weight.
+- **What about CLI-agent subprocess arms?** They proxy to cloud but
+  run locally; today's `prefer` treats them as non-local. Bandit
+  features should include `is_subprocess` as a distinct feature
+  so the policy can learn the user's preferences for those arms
+  independent of local/cloud.
+- **Cold start.** With no observations, the bandit defaults to
+  pure exploration. Should we seed with the existing heuristic
+  defaults from `internal/router/defaults.go`? Probably yes —
+  warm-start with the curated Strengths as priors.
+
+---
+
+## Rollout
+
+- **Phase 1** ships as v0.5.0 (additive, opt-in, no behaviour
+  change by default). Schema-touching so warrants a minor bump.
+- **Phase 2** ships when Phase 1 has accumulated enough data
+  (~500–1000 observations per user) — opt-in via
+  `[router].bandit_policy = "linucb"` initially, becoming default
+  in a later release once measured better.
+- **Phase 3 (deprecation of decoder-SLM classifier)** is a v0.6.x
+  conversation, gated on Phase 2 measurably outperforming.
+- **Phase 4 (ModernBERT fine-tune)** is v0.7+ — requires the
+  fine-tuned model artifact distributed via Ollama or HF, plus
+  the auto-download story.
+- **Phase 5 (FunctionGemma sanity layer)** is independent of all
+  of the above; lands when the sibling `tool-router-specialization`
+  plan justifies it on did-switch-rate telemetry.
+
+---
+
+## Cross-references
+
+- TODO.md entry "Bandit selector — design decisions deferred" —
+  the strategic question this plan answers in the long run.
+- TODO.md entry "Tool-router specialization (functiongemma)" — the
+  sibling track; complementary, not competing.
+- [`2026-05-23-tool-router-specialization.md`](2026-05-23-tool-router-specialization.md) — FunctionGemma fine-tune plan.
+- [`2026-05-07-gnoma-roadmap.md`](2026-05-07-gnoma-roadmap.md) §Phase 4 — the original "re-evaluate bandit learning" entry.
+- 2026-05-25 diagnostic session (this conversation) — the trigger.