docs(plans): encoder + contextual-bandit router architecture

Captures the architectural research surfaced during the 2026-05-25
SLM-failure diagnostic session: RouteLLM treats routing as
classification, ModernBERT is well-suited to that classification, and
FunctionGemma fits as an optional JSON-sanity layer rather than the
primary classifier. The current decoder-SLM-as-classifier design is
the wrong shape (100% failure rate observed across two model swaps).

Five-phase plan:
  1. Embedding feature scaffold (near-term, additive, opt-in)
  2. Contextual bandit (LinUCB / Thompson) over the feature set
  3. Retire the decoder-SLM classifier once 2 outperforms
  4. ModernBERT fine-tune on the accumulated labelled data
  5. FunctionGemma JSON sanity layer (optional final stage)

Phase 1 is the only piece scoped for near-term implementation; the
rest is multi-month and hinges on the strategic 'EMA vs SLM'
question already tracked in TODO.

Cross-references the existing tool-router-specialization plan so a
reader of either lands on both. Updates the TODO entry for the
bandit selector to note the supersession path.
This commit is contained in:
2026-05-25 01:22:18 +02:00
parent c0c2e4bff5
commit 24945b1eb2
3 changed files with 357 additions and 1 deletions
+4 -1
View File
@@ -146,7 +146,10 @@ Active work, newest first.
decision in #1.
Surfaced from the r/coolgithubprojects v0.3.1 launch thread
(2026-05-24, `u/Ha_Deal_5079`).
(2026-05-24, `u/Ha_Deal_5079`). The encoder + contextual bandit
alternative is now sketched in
[`docs/superpowers/plans/2026-05-25-encoder-bandit-router.md`](docs/superpowers/plans/2026-05-25-encoder-bandit-router.md) —
that plan supersedes #1 above when it ships.
- **Security boundary — egress controls + session audit log.** The
current `Firewall` is a content boundary only (scans messages and
@@ -1,5 +1,14 @@
# Tool-Router Specialization (functiongemma) — 2026-05-23
> **Companion plan from 2026-05-25:**
> [`2026-05-25-encoder-bandit-router.md`](2026-05-25-encoder-bandit-router.md)
> sketches an alternative architecture (encoder + contextual bandit
> instead of decoder-SLM-as-classifier). The two are complementary,
> not competing — FunctionGemma fits as the optional Phase 5 "JSON
> sanity layer" in that plan. Decide which track to invest in based
> on the did-switch-rate telemetry (this plan) vs the bandit-data
> accumulation (companion plan).
Follow-up to
[`2026-05-19-post-slm-unlock.md`](2026-05-19-post-slm-unlock.md)
Phase A, which shipped two-stage tool routing: round 1 sends a single
@@ -0,0 +1,344 @@
# Encoder + Contextual-Bandit Router — 2026-05-25
Proposes a long-arc architectural rethink of gnoma's routing layer:
**replace the decoder-SLM-as-classifier design with an encoder-only
embedding model feeding a contextual bandit policy**, and treat a
strict tiny SLM (FunctionGemma-270M-it) as the optional "emit a
structured route decision" layer rather than the primary classifier.
Surfaced from external research (RouteLLM, ModernBERT, Gemma 3
270M, Qwen3-Embedding, BGE-M3) brought into the 2026-05-25
diagnostic session where gnoma's current decoder-SLM classifier
exhibited a 100% failure rate across two model swaps
(`reecdev/tiny3.5:1.5b`, `qwen2.5-coder:1.5b`).
This plan is **strategic / multi-month**. Phase 1 below is the only
piece scoped for near-term implementation; everything else hinges on
the bandit-vs-SLM strategic decision tracked in the existing
`Bandit selector — design decisions deferred` TODO entry.
Sibling plans:
[`2026-05-23-tool-router-specialization.md`](2026-05-23-tool-router-specialization.md)
already covers the **FunctionGemma fine-tune** track as the
strict-SLM option; this plan adds the **encoder + bandit** track
as the alternative (and arguably better-suited) architecture.
---
## Problem
The current router has three coupled problems:
1. **The classifier is a decoder LLM in a job an encoder would do
better.** Routing is a classification task with cost/quality
trade-offs, not a reasoning task. Asking a decoder model to emit
structured JSON for every classify call is high-latency, fragile
to chain-of-thought leakage, and indeterministic.
2. **The bandit can't actually learn quality** because the only
success signal is `err == nil` (per `internal/engine/loop.go:118`).
EMA scores converge to 1.00 for every arm — see the 2026-05-24
`router stats` snapshot where 22 of 25 arm/task pairs sit at
exactly 1.00.
3. **The classifier and bandit live in adjacent code but were
designed in separate phases**, so the integration point (`Task`
built by SLM classifier → fed to `selectBest`) is just data
flow, not a learning loop. The SLM's wins/losses don't update
the SLM; the bandit's wins/losses don't change which arms the
classifier considers.
The 100% SLM-failure incident on 2026-05-25 made (1) urgent. The
zero-discrimination EMA on 2026-05-24 made (2) urgent. (3) is the
underlying integration debt.
---
## Non-goals
- **Killing the existing SLM classifier today.** Phase 1 of this
plan is purely additive (encoder feature extraction); the existing
classifier stays as a baseline until the new path is measurably
better.
- **Reimplementing bandit math.** LinUCB and Thompson Sampling are
well-understood. The work is the feature pipeline and reward
function, not the policy core.
- **Choosing a single embedding model permanently.** Phase 1 ships
with a default but exposes a `[slm.embedding].model` knob so
swapping is config-only.
- **The strict-SLM track.** FunctionGemma fine-tuning is the sibling
`2026-05-23-tool-router-specialization.md` plan; this plan
references it but does not duplicate it.
---
## Background — research summary
Citations follow the user-provided research thread (RouteLLM 2024,
ModernBERT 2024, Google FunctionGemma 2025).
- **RouteLLM** tested router types as a classification problem:
similarity routing, matrix factorization, BERT classifier, causal
LLM classifier. The BERT classifier was competitive with the
causal-LLM classifier at lower cost and latency. Routing is a
classification task; treating it like a generation task is paying
generation cost for classification value.
- **ModernBERT** (Dec 2024) is an encoder-only model with 8k context,
trained partly on code, designed for fast classification and
retrieval. The 'base' size is ~150M parameters, the 'large' size
~400M. Both are tiny compared to even small decoder LLMs.
- **FunctionGemma-270M-it** (Aug 2025) is Google's small model
fine-tuned for natural-language → function-call output. Google's
own positioning materials list **query routing** as a use case.
- **Qwen3-Embedding-0.6B** and **BGE-M3** are strong multilingual
embedding models with long-context support; either can serve as
feature extractors for downstream classification or bandit
policies.
The throughline: **encoder models are the right tool for the
classification side of routing**; generative SLMs (FunctionGemma)
are the right tool only when the *output* must be a structured
decision blob with confidence + tags + fallback. For pure routing,
encoder features + bandit policy is cheaper, faster, more
deterministic.
---
## Approach overview
Five phases. Phase 1 is near-term; Phases 24 are the actual
architectural shift; Phase 5 is the long-arc fine-tune.
### Phase 1 — Embedding feature scaffold (near-term, additive)
Add an embedding pipeline that runs alongside the existing
classifier. Extract features for every prompt; log them to disk
next to the existing quality-EMA. No routing decision changes yet.
**Why first:** lets us build up a labelled dataset of (prompt,
features, arm, outcome) tuples without disturbing today's routing
behaviour. Phase 2 trains against this dataset.
### Phase 2 — Contextual bandit over the feature set
Once Phase 1 has ~5001000 labelled observations, swap `selectBest`
from heuristic quality + EMA score to a LinUCB-style contextual
bandit that takes the embedding features + the existing arm metadata
(MaxComplexity, CostWeight, Strengths). The existing EMA quality
score becomes one feature among many.
### Phase 3 — Retire the decoder-SLM classifier
When Phase 2 routing is measurably better than today's heuristic +
EMA blend, the decoder-SLM classifier (currently producing 0
useful classifications on the user's setup) is no longer
load-bearing. Deprecate it; keep the same `[slm]` config knobs for
backwards compatibility but route them at a different runtime path.
### Phase 4 — ModernBERT fine-tune
The off-the-shelf embedding model from Phase 1 (BGE-M3 or
Qwen3-Embedding-0.6B by default) gives general-purpose embeddings.
Phase 4 fine-tunes a router-specific classification head on top of
ModernBERT-base using the labelled dataset accumulated since Phase
1. Pure performance win; falls back gracefully to off-the-shelf
embeddings if the fine-tune isn't loaded.
### Phase 5 — FunctionGemma JSON sanity layer (optional)
For users who want a structured route decision (arm + confidence +
fallback) alongside or instead of the bandit output, plug
FunctionGemma-270M-it (fine-tuned per the
`tool-router-specialization` plan) as a final-stage decision blob
emitter. Sits *after* the encoder + bandit, not in front of them.
---
## Phase 1 — Embedding feature scaffold (detailed)
This is the only phase scoped for near-term implementation. The
others depend on Phase 1's data accumulation.
### What lands
- New package `internal/router/features` with:
- `Embedder` interface: `Embed(ctx, prompt string) ([]float32, error)`.
- Implementations: `OllamaEmbedder`, `BGE3Embedder`, `NoopEmbedder`
(default; returns nil features when no embedding model is
configured).
- New config `[slm.embedding]` section:
```toml
[slm.embedding]
enabled = false # default off; opt-in
backend = "ollama" # ollama | bge-m3 | noop
model = "qwen3-embedding:0.6b" # ollama model tag
base_url = "" # backend endpoint override
```
- Feature extraction hook in `internal/engine/loop.go`: after the
classifier runs but before `selectBest`, compute the embedding
for the prompt and attach to the routing `Task` as an opaque
`Features []float32` field.
- New on-disk store at `~/.config/gnoma/router-features.jsonl`,
one record per observation: `{ts, prompt_hash, features,
task_type, arm_id, success, tokens, duration}`.
- `prompt_hash` is a SHA-256 of the prompt — never the prompt
itself — to keep the file local-only-but-not-secret-laden.
- Append-only, atomic-write, incognito-gated, same discipline as
the firewall audit log.
- No selector change. `selectBest` continues to use today's
heuristic + EMA blend. Phase 1 just observes.
### Why off by default
Embedding inference adds 50200ms per prompt depending on backend
and model size. That latency is fine for ollama users running on
a workstation, painful for users on slower setups. Opt-in keeps
the regression risk at zero.
### Phase 1 task list
- **F1-1:** Define the `Embedder` interface and `NoopEmbedder` in
`internal/router/features/`.
- **F1-2:** `OllamaEmbedder` wraps `provider/openaicompat` with the
ollama embedding endpoint (`/api/embeddings`).
- **F1-3:** Add the `[slm.embedding]` config section to
`internal/config/config.go` with the same defaults-via-zero
discipline as the rest of the config.
- **F1-4:** Wire the embedder into `loop.go` between classifier and
selector. Failures log at Debug and don't block routing.
- **F1-5:** Append-only feature store in
`~/.config/gnoma/router-features.jsonl` with atomic writes,
incognito gate, opt-out via `[slm.embedding].enabled = false`.
- **F1-6:** Tests covering: embedder mock + observation record;
noop embedder produces empty features; incognito skips the
store entirely.
---
## Phase 2+ — Bandit policy (sketch only; needs data first)
Spelled out for context. Not for near-term implementation.
### Feature set per the research
```
prompt_embedding — 384-1024 dim depending on model
token_count — len of tokenized prompt
language — ISO code from a small lang-detect
has_code — fenced-block heuristic
has_error_log — pattern match for stack traces
needs_tools — from current heuristic
needs_vision — from [Image:...] markers
estimated_complexity — current heuristic score
requested_latency — turn-budget hint (future)
arm_context_window — from arm metadata
arm_vram_cost — from arm metadata
arm_avg_latency — from quality EMA
arm_success_rate — from quality EMA
```
### Reward function per the research
```
reward = quality_score
- latency_penalty
- vram_penalty
- failure_penalty
- escalation_penalty
```
- `quality_score`: 1.0 on success, 0.0 on hard error today; richer
signal (elf-mediated, user thumbs, tool-call success) once the
TODO `Bandit selector — design decisions deferred` resolves.
- `latency_penalty`: monotone in observed seconds.
- `vram_penalty`: monotone in declared VRAM cost.
- `failure_penalty`: hard cost on explicit errors (sandbox
denied, parse failed).
- `escalation_penalty`: cost when a downstream elf had to escalate
to a heavier arm because this arm failed.
### Policy
LinUCB (linear contextual bandit, deterministic exploration
bounded by UCB) or Thompson Sampling (Bayesian, smoother
exploration). LinUCB is the safer starting point — fewer
hyperparameters, well-known behaviour, easier to debug.
---
## Risks
- **Latency.** Embedding inference adds 50200ms per prompt. Phase
1's opt-in default means users see no regression; Phase 2's
"make it default" decision requires latency benchmarks first.
- **Data sparsity for fine-tuning (Phase 4).** ModernBERT
fine-tuning needs ~10k labelled observations to start being
useful. Phase 1 might run for months before Phase 4 is viable.
Plan B: synthesise labels from existing prompt logs + rule-based
pre-labels.
- **Off-the-shelf embedding quality.** BGE-M3 / Qwen3-Embedding
weren't trained specifically for routing decisions. Phase 4
exists precisely to close this gap; Phase 1's data accumulation
is what makes Phase 4 possible.
- **Architectural complexity.** This plan introduces an entire new
ML pipeline (embedder → feature store → bandit → reward loop).
Phase 1 keeps it side-by-side with the existing path; Phase 2's
"swap" decision is reversible because the existing path stays
in code.
- **Privacy.** Prompt hashes (not raw prompts) in the feature
store. Still a local-only file; same opt-out plumbing as the
project registry from the config-migration plan.
---
## Open questions
- **Should the feature store be per-project or global?** Per-project
is more privacy-respecting (one project's prompts don't influence
another's routing). Global is more data-efficient (more samples
→ better bandit). Phase 1 chooses global by default; revisit
during Phase 2.
- **How does this interact with `[router].prefer = local|cloud`?**
Easy answer: prefer policy stays as a hard tier-shift, applied
after bandit selection. Bandit picks the best feasible arm; the
prefer policy is consulted as a final filter / weight.
- **What about CLI-agent subprocess arms?** They proxy to cloud but
run locally; today's `prefer` treats them as non-local. Bandit
features should include `is_subprocess` as a distinct feature
so the policy can learn the user's preferences for those arms
independent of local/cloud.
- **Cold start.** With no observations, the bandit defaults to
pure exploration. Should we seed with the existing heuristic
defaults from `internal/router/defaults.go`? Probably yes —
warm-start with the curated Strengths as priors.
---
## Rollout
- **Phase 1** ships as v0.5.0 (additive, opt-in, no behaviour
change by default). Schema-touching so warrants a minor bump.
- **Phase 2** ships when Phase 1 has accumulated enough data
(~5001000 observations per user) — opt-in via
`[router].bandit_policy = "linucb"` initially, becoming default
in a later release once measured better.
- **Phase 3 (deprecation of decoder-SLM classifier)** is a v0.6.x
conversation, gated on Phase 2 measurably outperforming.
- **Phase 4 (ModernBERT fine-tune)** is v0.7+ — requires the
fine-tuned model artifact distributed via Ollama or HF, plus
the auto-download story.
- **Phase 5 (FunctionGemma sanity layer)** is independent of all
of the above; lands when the sibling `tool-router-specialization`
plan justifies it on did-switch-rate telemetry.
---
## Cross-references
- TODO.md entry "Bandit selector — design decisions deferred" —
the strategic question this plan answers in the long run.
- TODO.md entry "Tool-router specialization (functiongemma)" — the
sibling track; complementary, not competing.
- [`2026-05-23-tool-router-specialization.md`](2026-05-23-tool-router-specialization.md) — FunctionGemma fine-tune plan.
- [`2026-05-07-gnoma-roadmap.md`](2026-05-07-gnoma-roadmap.md) §Phase 4 — the original "re-evaluate bandit learning" entry.
- 2026-05-25 diagnostic session (this conversation) — the trigger.