docs(plans): encoder + contextual-bandit router architecture
Captures the architectural research surfaced during the 2026-05-25 SLM-failure diagnostic session: RouteLLM treats routing as classification, ModernBERT is well-suited to that classification, and FunctionGemma fits as an optional JSON-sanity layer rather than the primary classifier. The current decoder-SLM-as-classifier design is the wrong shape (100% failure rate observed across two model swaps). Five-phase plan: 1. Embedding feature scaffold (near-term, additive, opt-in) 2. Contextual bandit (LinUCB / Thompson) over the feature set 3. Retire the decoder-SLM classifier once 2 outperforms 4. ModernBERT fine-tune on the accumulated labelled data 5. FunctionGemma JSON sanity layer (optional final stage) Phase 1 is the only piece scoped for near-term implementation; the rest is multi-month and hinges on the strategic 'EMA vs SLM' question already tracked in TODO. Cross-references the existing tool-router-specialization plan so a reader of either lands on both. Updates the TODO entry for the bandit selector to note the supersession path.
This commit is contained in:
@@ -146,7 +146,10 @@ Active work, newest first.
|
||||
decision in #1.
|
||||
|
||||
Surfaced from the r/coolgithubprojects v0.3.1 launch thread
|
||||
(2026-05-24, `u/Ha_Deal_5079`).
|
||||
(2026-05-24, `u/Ha_Deal_5079`). The encoder + contextual bandit
|
||||
alternative is now sketched in
|
||||
[`docs/superpowers/plans/2026-05-25-encoder-bandit-router.md`](docs/superpowers/plans/2026-05-25-encoder-bandit-router.md) —
|
||||
that plan supersedes #1 above when it ships.
|
||||
|
||||
- **Security boundary — egress controls + session audit log.** The
|
||||
current `Firewall` is a content boundary only (scans messages and
|
||||
|
||||
@@ -1,5 +1,14 @@
|
||||
# Tool-Router Specialization (functiongemma) — 2026-05-23
|
||||
|
||||
> **Companion plan from 2026-05-25:**
|
||||
> [`2026-05-25-encoder-bandit-router.md`](2026-05-25-encoder-bandit-router.md)
|
||||
> sketches an alternative architecture (encoder + contextual bandit
|
||||
> instead of decoder-SLM-as-classifier). The two are complementary,
|
||||
> not competing — FunctionGemma fits as the optional Phase 5 "JSON
|
||||
> sanity layer" in that plan. Decide which track to invest in based
|
||||
> on the did-switch-rate telemetry (this plan) vs the bandit-data
|
||||
> accumulation (companion plan).
|
||||
|
||||
Follow-up to
|
||||
[`2026-05-19-post-slm-unlock.md`](2026-05-19-post-slm-unlock.md)
|
||||
Phase A, which shipped two-stage tool routing: round 1 sends a single
|
||||
|
||||
@@ -0,0 +1,344 @@
|
||||
# Encoder + Contextual-Bandit Router — 2026-05-25
|
||||
|
||||
Proposes a long-arc architectural rethink of gnoma's routing layer:
|
||||
**replace the decoder-SLM-as-classifier design with an encoder-only
|
||||
embedding model feeding a contextual bandit policy**, and treat a
|
||||
strict tiny SLM (FunctionGemma-270M-it) as the optional "emit a
|
||||
structured route decision" layer rather than the primary classifier.
|
||||
|
||||
Surfaced from external research (RouteLLM, ModernBERT, Gemma 3
|
||||
270M, Qwen3-Embedding, BGE-M3) brought into the 2026-05-25
|
||||
diagnostic session where gnoma's current decoder-SLM classifier
|
||||
exhibited a 100% failure rate across two model swaps
|
||||
(`reecdev/tiny3.5:1.5b`, `qwen2.5-coder:1.5b`).
|
||||
|
||||
This plan is **strategic / multi-month**. Phase 1 below is the only
|
||||
piece scoped for near-term implementation; everything else hinges on
|
||||
the bandit-vs-SLM strategic decision tracked in the existing
|
||||
`Bandit selector — design decisions deferred` TODO entry.
|
||||
|
||||
Sibling plans:
|
||||
[`2026-05-23-tool-router-specialization.md`](2026-05-23-tool-router-specialization.md)
|
||||
already covers the **FunctionGemma fine-tune** track as the
|
||||
strict-SLM option; this plan adds the **encoder + bandit** track
|
||||
as the alternative (and arguably better-suited) architecture.
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
The current router has three coupled problems:
|
||||
|
||||
1. **The classifier is a decoder LLM in a job an encoder would do
|
||||
better.** Routing is a classification task with cost/quality
|
||||
trade-offs, not a reasoning task. Asking a decoder model to emit
|
||||
structured JSON for every classify call is high-latency, fragile
|
||||
to chain-of-thought leakage, and indeterministic.
|
||||
|
||||
2. **The bandit can't actually learn quality** because the only
|
||||
success signal is `err == nil` (per `internal/engine/loop.go:118`).
|
||||
EMA scores converge to 1.00 for every arm — see the 2026-05-24
|
||||
`router stats` snapshot where 22 of 25 arm/task pairs sit at
|
||||
exactly 1.00.
|
||||
|
||||
3. **The classifier and bandit live in adjacent code but were
|
||||
designed in separate phases**, so the integration point (`Task`
|
||||
built by SLM classifier → fed to `selectBest`) is just data
|
||||
flow, not a learning loop. The SLM's wins/losses don't update
|
||||
the SLM; the bandit's wins/losses don't change which arms the
|
||||
classifier considers.
|
||||
|
||||
The 100% SLM-failure incident on 2026-05-25 made (1) urgent. The
|
||||
zero-discrimination EMA on 2026-05-24 made (2) urgent. (3) is the
|
||||
underlying integration debt.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Killing the existing SLM classifier today.** Phase 1 of this
|
||||
plan is purely additive (encoder feature extraction); the existing
|
||||
classifier stays as a baseline until the new path is measurably
|
||||
better.
|
||||
- **Reimplementing bandit math.** LinUCB and Thompson Sampling are
|
||||
well-understood. The work is the feature pipeline and reward
|
||||
function, not the policy core.
|
||||
- **Choosing a single embedding model permanently.** Phase 1 ships
|
||||
with a default but exposes a `[slm.embedding].model` knob so
|
||||
swapping is config-only.
|
||||
- **The strict-SLM track.** FunctionGemma fine-tuning is the sibling
|
||||
`2026-05-23-tool-router-specialization.md` plan; this plan
|
||||
references it but does not duplicate it.
|
||||
|
||||
---
|
||||
|
||||
## Background — research summary
|
||||
|
||||
Citations follow the user-provided research thread (RouteLLM 2024,
|
||||
ModernBERT 2024, Google FunctionGemma 2025).
|
||||
|
||||
- **RouteLLM** tested router types as a classification problem:
|
||||
similarity routing, matrix factorization, BERT classifier, causal
|
||||
LLM classifier. The BERT classifier was competitive with the
|
||||
causal-LLM classifier at lower cost and latency. Routing is a
|
||||
classification task; treating it like a generation task is paying
|
||||
generation cost for classification value.
|
||||
- **ModernBERT** (Dec 2024) is an encoder-only model with 8k context,
|
||||
trained partly on code, designed for fast classification and
|
||||
retrieval. The 'base' size is ~150M parameters, the 'large' size
|
||||
~400M. Both are tiny compared to even small decoder LLMs.
|
||||
- **FunctionGemma-270M-it** (Aug 2025) is Google's small model
|
||||
fine-tuned for natural-language → function-call output. Google's
|
||||
own positioning materials list **query routing** as a use case.
|
||||
- **Qwen3-Embedding-0.6B** and **BGE-M3** are strong multilingual
|
||||
embedding models with long-context support; either can serve as
|
||||
feature extractors for downstream classification or bandit
|
||||
policies.
|
||||
|
||||
The throughline: **encoder models are the right tool for the
|
||||
classification side of routing**; generative SLMs (FunctionGemma)
|
||||
are the right tool only when the *output* must be a structured
|
||||
decision blob with confidence + tags + fallback. For pure routing,
|
||||
encoder features + bandit policy is cheaper, faster, more
|
||||
deterministic.
|
||||
|
||||
---
|
||||
|
||||
## Approach overview
|
||||
|
||||
Five phases. Phase 1 is near-term; Phases 2–4 are the actual
|
||||
architectural shift; Phase 5 is the long-arc fine-tune.
|
||||
|
||||
### Phase 1 — Embedding feature scaffold (near-term, additive)
|
||||
|
||||
Add an embedding pipeline that runs alongside the existing
|
||||
classifier. Extract features for every prompt; log them to disk
|
||||
next to the existing quality-EMA. No routing decision changes yet.
|
||||
|
||||
**Why first:** lets us build up a labelled dataset of (prompt,
|
||||
features, arm, outcome) tuples without disturbing today's routing
|
||||
behaviour. Phase 2 trains against this dataset.
|
||||
|
||||
### Phase 2 — Contextual bandit over the feature set
|
||||
|
||||
Once Phase 1 has ~500–1000 labelled observations, swap `selectBest`
|
||||
from heuristic quality + EMA score to a LinUCB-style contextual
|
||||
bandit that takes the embedding features + the existing arm metadata
|
||||
(MaxComplexity, CostWeight, Strengths). The existing EMA quality
|
||||
score becomes one feature among many.
|
||||
|
||||
### Phase 3 — Retire the decoder-SLM classifier
|
||||
|
||||
When Phase 2 routing is measurably better than today's heuristic +
|
||||
EMA blend, the decoder-SLM classifier (currently producing 0
|
||||
useful classifications on the user's setup) is no longer
|
||||
load-bearing. Deprecate it; keep the same `[slm]` config knobs for
|
||||
backwards compatibility but route them at a different runtime path.
|
||||
|
||||
### Phase 4 — ModernBERT fine-tune
|
||||
|
||||
The off-the-shelf embedding model from Phase 1 (BGE-M3 or
|
||||
Qwen3-Embedding-0.6B by default) gives general-purpose embeddings.
|
||||
Phase 4 fine-tunes a router-specific classification head on top of
|
||||
ModernBERT-base using the labelled dataset accumulated since Phase
|
||||
1. Pure performance win; falls back gracefully to off-the-shelf
|
||||
embeddings if the fine-tune isn't loaded.
|
||||
|
||||
### Phase 5 — FunctionGemma JSON sanity layer (optional)
|
||||
|
||||
For users who want a structured route decision (arm + confidence +
|
||||
fallback) alongside or instead of the bandit output, plug
|
||||
FunctionGemma-270M-it (fine-tuned per the
|
||||
`tool-router-specialization` plan) as a final-stage decision blob
|
||||
emitter. Sits *after* the encoder + bandit, not in front of them.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Embedding feature scaffold (detailed)
|
||||
|
||||
This is the only phase scoped for near-term implementation. The
|
||||
others depend on Phase 1's data accumulation.
|
||||
|
||||
### What lands
|
||||
|
||||
- New package `internal/router/features` with:
|
||||
- `Embedder` interface: `Embed(ctx, prompt string) ([]float32, error)`.
|
||||
- Implementations: `OllamaEmbedder`, `BGE3Embedder`, `NoopEmbedder`
|
||||
(default; returns nil features when no embedding model is
|
||||
configured).
|
||||
- New config `[slm.embedding]` section:
|
||||
```toml
|
||||
[slm.embedding]
|
||||
enabled = false # default off; opt-in
|
||||
backend = "ollama" # ollama | bge-m3 | noop
|
||||
model = "qwen3-embedding:0.6b" # ollama model tag
|
||||
base_url = "" # backend endpoint override
|
||||
```
|
||||
- Feature extraction hook in `internal/engine/loop.go`: after the
|
||||
classifier runs but before `selectBest`, compute the embedding
|
||||
for the prompt and attach to the routing `Task` as an opaque
|
||||
`Features []float32` field.
|
||||
- New on-disk store at `~/.config/gnoma/router-features.jsonl`,
|
||||
one record per observation: `{ts, prompt_hash, features,
|
||||
task_type, arm_id, success, tokens, duration}`.
|
||||
- `prompt_hash` is a SHA-256 of the prompt — never the prompt
|
||||
itself — to keep the file local-only-but-not-secret-laden.
|
||||
- Append-only, atomic-write, incognito-gated, same discipline as
|
||||
the firewall audit log.
|
||||
- No selector change. `selectBest` continues to use today's
|
||||
heuristic + EMA blend. Phase 1 just observes.
|
||||
|
||||
### Why off by default
|
||||
|
||||
Embedding inference adds 50–200ms per prompt depending on backend
|
||||
and model size. That latency is fine for ollama users running on
|
||||
a workstation, painful for users on slower setups. Opt-in keeps
|
||||
the regression risk at zero.
|
||||
|
||||
### Phase 1 task list
|
||||
|
||||
- **F1-1:** Define the `Embedder` interface and `NoopEmbedder` in
|
||||
`internal/router/features/`.
|
||||
- **F1-2:** `OllamaEmbedder` wraps `provider/openaicompat` with the
|
||||
ollama embedding endpoint (`/api/embeddings`).
|
||||
- **F1-3:** Add the `[slm.embedding]` config section to
|
||||
`internal/config/config.go` with the same defaults-via-zero
|
||||
discipline as the rest of the config.
|
||||
- **F1-4:** Wire the embedder into `loop.go` between classifier and
|
||||
selector. Failures log at Debug and don't block routing.
|
||||
- **F1-5:** Append-only feature store in
|
||||
`~/.config/gnoma/router-features.jsonl` with atomic writes,
|
||||
incognito gate, opt-out via `[slm.embedding].enabled = false`.
|
||||
- **F1-6:** Tests covering: embedder mock + observation record;
|
||||
noop embedder produces empty features; incognito skips the
|
||||
store entirely.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2+ — Bandit policy (sketch only; needs data first)
|
||||
|
||||
Spelled out for context. Not for near-term implementation.
|
||||
|
||||
### Feature set per the research
|
||||
|
||||
```
|
||||
prompt_embedding — 384-1024 dim depending on model
|
||||
token_count — len of tokenized prompt
|
||||
language — ISO code from a small lang-detect
|
||||
has_code — fenced-block heuristic
|
||||
has_error_log — pattern match for stack traces
|
||||
needs_tools — from current heuristic
|
||||
needs_vision — from [Image:...] markers
|
||||
estimated_complexity — current heuristic score
|
||||
requested_latency — turn-budget hint (future)
|
||||
arm_context_window — from arm metadata
|
||||
arm_vram_cost — from arm metadata
|
||||
arm_avg_latency — from quality EMA
|
||||
arm_success_rate — from quality EMA
|
||||
```
|
||||
|
||||
### Reward function per the research
|
||||
|
||||
```
|
||||
reward = quality_score
|
||||
- latency_penalty
|
||||
- vram_penalty
|
||||
- failure_penalty
|
||||
- escalation_penalty
|
||||
```
|
||||
|
||||
- `quality_score`: 1.0 on success, 0.0 on hard error today; richer
|
||||
signal (elf-mediated, user thumbs, tool-call success) once the
|
||||
TODO `Bandit selector — design decisions deferred` resolves.
|
||||
- `latency_penalty`: monotone in observed seconds.
|
||||
- `vram_penalty`: monotone in declared VRAM cost.
|
||||
- `failure_penalty`: hard cost on explicit errors (sandbox
|
||||
denied, parse failed).
|
||||
- `escalation_penalty`: cost when a downstream elf had to escalate
|
||||
to a heavier arm because this arm failed.
|
||||
|
||||
### Policy
|
||||
|
||||
LinUCB (linear contextual bandit, deterministic exploration
|
||||
bounded by UCB) or Thompson Sampling (Bayesian, smoother
|
||||
exploration). LinUCB is the safer starting point — fewer
|
||||
hyperparameters, well-known behaviour, easier to debug.
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- **Latency.** Embedding inference adds 50–200ms per prompt. Phase
|
||||
1's opt-in default means users see no regression; Phase 2's
|
||||
"make it default" decision requires latency benchmarks first.
|
||||
- **Data sparsity for fine-tuning (Phase 4).** ModernBERT
|
||||
fine-tuning needs ~10k labelled observations to start being
|
||||
useful. Phase 1 might run for months before Phase 4 is viable.
|
||||
Plan B: synthesise labels from existing prompt logs + rule-based
|
||||
pre-labels.
|
||||
- **Off-the-shelf embedding quality.** BGE-M3 / Qwen3-Embedding
|
||||
weren't trained specifically for routing decisions. Phase 4
|
||||
exists precisely to close this gap; Phase 1's data accumulation
|
||||
is what makes Phase 4 possible.
|
||||
- **Architectural complexity.** This plan introduces an entire new
|
||||
ML pipeline (embedder → feature store → bandit → reward loop).
|
||||
Phase 1 keeps it side-by-side with the existing path; Phase 2's
|
||||
"swap" decision is reversible because the existing path stays
|
||||
in code.
|
||||
- **Privacy.** Prompt hashes (not raw prompts) in the feature
|
||||
store. Still a local-only file; same opt-out plumbing as the
|
||||
project registry from the config-migration plan.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Should the feature store be per-project or global?** Per-project
|
||||
is more privacy-respecting (one project's prompts don't influence
|
||||
another's routing). Global is more data-efficient (more samples
|
||||
→ better bandit). Phase 1 chooses global by default; revisit
|
||||
during Phase 2.
|
||||
- **How does this interact with `[router].prefer = local|cloud`?**
|
||||
Easy answer: prefer policy stays as a hard tier-shift, applied
|
||||
after bandit selection. Bandit picks the best feasible arm; the
|
||||
prefer policy is consulted as a final filter / weight.
|
||||
- **What about CLI-agent subprocess arms?** They proxy to cloud but
|
||||
run locally; today's `prefer` treats them as non-local. Bandit
|
||||
features should include `is_subprocess` as a distinct feature
|
||||
so the policy can learn the user's preferences for those arms
|
||||
independent of local/cloud.
|
||||
- **Cold start.** With no observations, the bandit defaults to
|
||||
pure exploration. Should we seed with the existing heuristic
|
||||
defaults from `internal/router/defaults.go`? Probably yes —
|
||||
warm-start with the curated Strengths as priors.
|
||||
|
||||
---
|
||||
|
||||
## Rollout
|
||||
|
||||
- **Phase 1** ships as v0.5.0 (additive, opt-in, no behaviour
|
||||
change by default). Schema-touching so warrants a minor bump.
|
||||
- **Phase 2** ships when Phase 1 has accumulated enough data
|
||||
(~500–1000 observations per user) — opt-in via
|
||||
`[router].bandit_policy = "linucb"` initially, becoming default
|
||||
in a later release once measured better.
|
||||
- **Phase 3 (deprecation of decoder-SLM classifier)** is a v0.6.x
|
||||
conversation, gated on Phase 2 measurably outperforming.
|
||||
- **Phase 4 (ModernBERT fine-tune)** is v0.7+ — requires the
|
||||
fine-tuned model artifact distributed via Ollama or HF, plus
|
||||
the auto-download story.
|
||||
- **Phase 5 (FunctionGemma sanity layer)** is independent of all
|
||||
of the above; lands when the sibling `tool-router-specialization`
|
||||
plan justifies it on did-switch-rate telemetry.
|
||||
|
||||
---
|
||||
|
||||
## Cross-references
|
||||
|
||||
- TODO.md entry "Bandit selector — design decisions deferred" —
|
||||
the strategic question this plan answers in the long run.
|
||||
- TODO.md entry "Tool-router specialization (functiongemma)" — the
|
||||
sibling track; complementary, not competing.
|
||||
- [`2026-05-23-tool-router-specialization.md`](2026-05-23-tool-router-specialization.md) — FunctionGemma fine-tune plan.
|
||||
- [`2026-05-07-gnoma-roadmap.md`](2026-05-07-gnoma-roadmap.md) §Phase 4 — the original "re-evaluate bandit learning" entry.
|
||||
- 2026-05-25 diagnostic session (this conversation) — the trigger.
|
||||
Reference in New Issue
Block a user