diff --git a/TODO.md b/TODO.md index e50ed01..3704a1f 100644 --- a/TODO.md +++ b/TODO.md @@ -146,7 +146,10 @@ Active work, newest first. decision in #1. Surfaced from the r/coolgithubprojects v0.3.1 launch thread - (2026-05-24, `u/Ha_Deal_5079`). + (2026-05-24, `u/Ha_Deal_5079`). The encoder + contextual bandit + alternative is now sketched in + [`docs/superpowers/plans/2026-05-25-encoder-bandit-router.md`](docs/superpowers/plans/2026-05-25-encoder-bandit-router.md) — + that plan supersedes #1 above when it ships. - **Security boundary — egress controls + session audit log.** The current `Firewall` is a content boundary only (scans messages and diff --git a/docs/superpowers/plans/2026-05-23-tool-router-specialization.md b/docs/superpowers/plans/2026-05-23-tool-router-specialization.md index c205961..a334434 100644 --- a/docs/superpowers/plans/2026-05-23-tool-router-specialization.md +++ b/docs/superpowers/plans/2026-05-23-tool-router-specialization.md @@ -1,5 +1,14 @@ # Tool-Router Specialization (functiongemma) — 2026-05-23 +> **Companion plan from 2026-05-25:** +> [`2026-05-25-encoder-bandit-router.md`](2026-05-25-encoder-bandit-router.md) +> sketches an alternative architecture (encoder + contextual bandit +> instead of decoder-SLM-as-classifier). The two are complementary, +> not competing — FunctionGemma fits as the optional Phase 5 "JSON +> sanity layer" in that plan. Decide which track to invest in based +> on the did-switch-rate telemetry (this plan) vs the bandit-data +> accumulation (companion plan). + Follow-up to [`2026-05-19-post-slm-unlock.md`](2026-05-19-post-slm-unlock.md) Phase A, which shipped two-stage tool routing: round 1 sends a single diff --git a/docs/superpowers/plans/2026-05-25-encoder-bandit-router.md b/docs/superpowers/plans/2026-05-25-encoder-bandit-router.md new file mode 100644 index 0000000..0b605e8 --- /dev/null +++ b/docs/superpowers/plans/2026-05-25-encoder-bandit-router.md @@ -0,0 +1,344 @@ +# Encoder + Contextual-Bandit Router — 2026-05-25 + +Proposes a long-arc architectural rethink of gnoma's routing layer: +**replace the decoder-SLM-as-classifier design with an encoder-only +embedding model feeding a contextual bandit policy**, and treat a +strict tiny SLM (FunctionGemma-270M-it) as the optional "emit a +structured route decision" layer rather than the primary classifier. + +Surfaced from external research (RouteLLM, ModernBERT, Gemma 3 +270M, Qwen3-Embedding, BGE-M3) brought into the 2026-05-25 +diagnostic session where gnoma's current decoder-SLM classifier +exhibited a 100% failure rate across two model swaps +(`reecdev/tiny3.5:1.5b`, `qwen2.5-coder:1.5b`). + +This plan is **strategic / multi-month**. Phase 1 below is the only +piece scoped for near-term implementation; everything else hinges on +the bandit-vs-SLM strategic decision tracked in the existing +`Bandit selector — design decisions deferred` TODO entry. + +Sibling plans: +[`2026-05-23-tool-router-specialization.md`](2026-05-23-tool-router-specialization.md) +already covers the **FunctionGemma fine-tune** track as the +strict-SLM option; this plan adds the **encoder + bandit** track +as the alternative (and arguably better-suited) architecture. + +--- + +## Problem + +The current router has three coupled problems: + +1. **The classifier is a decoder LLM in a job an encoder would do + better.** Routing is a classification task with cost/quality + trade-offs, not a reasoning task. Asking a decoder model to emit + structured JSON for every classify call is high-latency, fragile + to chain-of-thought leakage, and indeterministic. + +2. **The bandit can't actually learn quality** because the only + success signal is `err == nil` (per `internal/engine/loop.go:118`). + EMA scores converge to 1.00 for every arm — see the 2026-05-24 + `router stats` snapshot where 22 of 25 arm/task pairs sit at + exactly 1.00. + +3. **The classifier and bandit live in adjacent code but were + designed in separate phases**, so the integration point (`Task` + built by SLM classifier → fed to `selectBest`) is just data + flow, not a learning loop. The SLM's wins/losses don't update + the SLM; the bandit's wins/losses don't change which arms the + classifier considers. + +The 100% SLM-failure incident on 2026-05-25 made (1) urgent. The +zero-discrimination EMA on 2026-05-24 made (2) urgent. (3) is the +underlying integration debt. + +--- + +## Non-goals + +- **Killing the existing SLM classifier today.** Phase 1 of this + plan is purely additive (encoder feature extraction); the existing + classifier stays as a baseline until the new path is measurably + better. +- **Reimplementing bandit math.** LinUCB and Thompson Sampling are + well-understood. The work is the feature pipeline and reward + function, not the policy core. +- **Choosing a single embedding model permanently.** Phase 1 ships + with a default but exposes a `[slm.embedding].model` knob so + swapping is config-only. +- **The strict-SLM track.** FunctionGemma fine-tuning is the sibling + `2026-05-23-tool-router-specialization.md` plan; this plan + references it but does not duplicate it. + +--- + +## Background — research summary + +Citations follow the user-provided research thread (RouteLLM 2024, +ModernBERT 2024, Google FunctionGemma 2025). + +- **RouteLLM** tested router types as a classification problem: + similarity routing, matrix factorization, BERT classifier, causal + LLM classifier. The BERT classifier was competitive with the + causal-LLM classifier at lower cost and latency. Routing is a + classification task; treating it like a generation task is paying + generation cost for classification value. +- **ModernBERT** (Dec 2024) is an encoder-only model with 8k context, + trained partly on code, designed for fast classification and + retrieval. The 'base' size is ~150M parameters, the 'large' size + ~400M. Both are tiny compared to even small decoder LLMs. +- **FunctionGemma-270M-it** (Aug 2025) is Google's small model + fine-tuned for natural-language → function-call output. Google's + own positioning materials list **query routing** as a use case. +- **Qwen3-Embedding-0.6B** and **BGE-M3** are strong multilingual + embedding models with long-context support; either can serve as + feature extractors for downstream classification or bandit + policies. + +The throughline: **encoder models are the right tool for the +classification side of routing**; generative SLMs (FunctionGemma) +are the right tool only when the *output* must be a structured +decision blob with confidence + tags + fallback. For pure routing, +encoder features + bandit policy is cheaper, faster, more +deterministic. + +--- + +## Approach overview + +Five phases. Phase 1 is near-term; Phases 2–4 are the actual +architectural shift; Phase 5 is the long-arc fine-tune. + +### Phase 1 — Embedding feature scaffold (near-term, additive) + +Add an embedding pipeline that runs alongside the existing +classifier. Extract features for every prompt; log them to disk +next to the existing quality-EMA. No routing decision changes yet. + +**Why first:** lets us build up a labelled dataset of (prompt, +features, arm, outcome) tuples without disturbing today's routing +behaviour. Phase 2 trains against this dataset. + +### Phase 2 — Contextual bandit over the feature set + +Once Phase 1 has ~500–1000 labelled observations, swap `selectBest` +from heuristic quality + EMA score to a LinUCB-style contextual +bandit that takes the embedding features + the existing arm metadata +(MaxComplexity, CostWeight, Strengths). The existing EMA quality +score becomes one feature among many. + +### Phase 3 — Retire the decoder-SLM classifier + +When Phase 2 routing is measurably better than today's heuristic + +EMA blend, the decoder-SLM classifier (currently producing 0 +useful classifications on the user's setup) is no longer +load-bearing. Deprecate it; keep the same `[slm]` config knobs for +backwards compatibility but route them at a different runtime path. + +### Phase 4 — ModernBERT fine-tune + +The off-the-shelf embedding model from Phase 1 (BGE-M3 or +Qwen3-Embedding-0.6B by default) gives general-purpose embeddings. +Phase 4 fine-tunes a router-specific classification head on top of +ModernBERT-base using the labelled dataset accumulated since Phase +1. Pure performance win; falls back gracefully to off-the-shelf +embeddings if the fine-tune isn't loaded. + +### Phase 5 — FunctionGemma JSON sanity layer (optional) + +For users who want a structured route decision (arm + confidence + +fallback) alongside or instead of the bandit output, plug +FunctionGemma-270M-it (fine-tuned per the +`tool-router-specialization` plan) as a final-stage decision blob +emitter. Sits *after* the encoder + bandit, not in front of them. + +--- + +## Phase 1 — Embedding feature scaffold (detailed) + +This is the only phase scoped for near-term implementation. The +others depend on Phase 1's data accumulation. + +### What lands + +- New package `internal/router/features` with: + - `Embedder` interface: `Embed(ctx, prompt string) ([]float32, error)`. + - Implementations: `OllamaEmbedder`, `BGE3Embedder`, `NoopEmbedder` + (default; returns nil features when no embedding model is + configured). +- New config `[slm.embedding]` section: + ```toml + [slm.embedding] + enabled = false # default off; opt-in + backend = "ollama" # ollama | bge-m3 | noop + model = "qwen3-embedding:0.6b" # ollama model tag + base_url = "" # backend endpoint override + ``` +- Feature extraction hook in `internal/engine/loop.go`: after the + classifier runs but before `selectBest`, compute the embedding + for the prompt and attach to the routing `Task` as an opaque + `Features []float32` field. +- New on-disk store at `~/.config/gnoma/router-features.jsonl`, + one record per observation: `{ts, prompt_hash, features, + task_type, arm_id, success, tokens, duration}`. + - `prompt_hash` is a SHA-256 of the prompt — never the prompt + itself — to keep the file local-only-but-not-secret-laden. + - Append-only, atomic-write, incognito-gated, same discipline as + the firewall audit log. +- No selector change. `selectBest` continues to use today's + heuristic + EMA blend. Phase 1 just observes. + +### Why off by default + +Embedding inference adds 50–200ms per prompt depending on backend +and model size. That latency is fine for ollama users running on +a workstation, painful for users on slower setups. Opt-in keeps +the regression risk at zero. + +### Phase 1 task list + +- **F1-1:** Define the `Embedder` interface and `NoopEmbedder` in + `internal/router/features/`. +- **F1-2:** `OllamaEmbedder` wraps `provider/openaicompat` with the + ollama embedding endpoint (`/api/embeddings`). +- **F1-3:** Add the `[slm.embedding]` config section to + `internal/config/config.go` with the same defaults-via-zero + discipline as the rest of the config. +- **F1-4:** Wire the embedder into `loop.go` between classifier and + selector. Failures log at Debug and don't block routing. +- **F1-5:** Append-only feature store in + `~/.config/gnoma/router-features.jsonl` with atomic writes, + incognito gate, opt-out via `[slm.embedding].enabled = false`. +- **F1-6:** Tests covering: embedder mock + observation record; + noop embedder produces empty features; incognito skips the + store entirely. + +--- + +## Phase 2+ — Bandit policy (sketch only; needs data first) + +Spelled out for context. Not for near-term implementation. + +### Feature set per the research + +``` +prompt_embedding — 384-1024 dim depending on model +token_count — len of tokenized prompt +language — ISO code from a small lang-detect +has_code — fenced-block heuristic +has_error_log — pattern match for stack traces +needs_tools — from current heuristic +needs_vision — from [Image:...] markers +estimated_complexity — current heuristic score +requested_latency — turn-budget hint (future) +arm_context_window — from arm metadata +arm_vram_cost — from arm metadata +arm_avg_latency — from quality EMA +arm_success_rate — from quality EMA +``` + +### Reward function per the research + +``` +reward = quality_score + - latency_penalty + - vram_penalty + - failure_penalty + - escalation_penalty +``` + +- `quality_score`: 1.0 on success, 0.0 on hard error today; richer + signal (elf-mediated, user thumbs, tool-call success) once the + TODO `Bandit selector — design decisions deferred` resolves. +- `latency_penalty`: monotone in observed seconds. +- `vram_penalty`: monotone in declared VRAM cost. +- `failure_penalty`: hard cost on explicit errors (sandbox + denied, parse failed). +- `escalation_penalty`: cost when a downstream elf had to escalate + to a heavier arm because this arm failed. + +### Policy + +LinUCB (linear contextual bandit, deterministic exploration +bounded by UCB) or Thompson Sampling (Bayesian, smoother +exploration). LinUCB is the safer starting point — fewer +hyperparameters, well-known behaviour, easier to debug. + +--- + +## Risks + +- **Latency.** Embedding inference adds 50–200ms per prompt. Phase + 1's opt-in default means users see no regression; Phase 2's + "make it default" decision requires latency benchmarks first. +- **Data sparsity for fine-tuning (Phase 4).** ModernBERT + fine-tuning needs ~10k labelled observations to start being + useful. Phase 1 might run for months before Phase 4 is viable. + Plan B: synthesise labels from existing prompt logs + rule-based + pre-labels. +- **Off-the-shelf embedding quality.** BGE-M3 / Qwen3-Embedding + weren't trained specifically for routing decisions. Phase 4 + exists precisely to close this gap; Phase 1's data accumulation + is what makes Phase 4 possible. +- **Architectural complexity.** This plan introduces an entire new + ML pipeline (embedder → feature store → bandit → reward loop). + Phase 1 keeps it side-by-side with the existing path; Phase 2's + "swap" decision is reversible because the existing path stays + in code. +- **Privacy.** Prompt hashes (not raw prompts) in the feature + store. Still a local-only file; same opt-out plumbing as the + project registry from the config-migration plan. + +--- + +## Open questions + +- **Should the feature store be per-project or global?** Per-project + is more privacy-respecting (one project's prompts don't influence + another's routing). Global is more data-efficient (more samples + → better bandit). Phase 1 chooses global by default; revisit + during Phase 2. +- **How does this interact with `[router].prefer = local|cloud`?** + Easy answer: prefer policy stays as a hard tier-shift, applied + after bandit selection. Bandit picks the best feasible arm; the + prefer policy is consulted as a final filter / weight. +- **What about CLI-agent subprocess arms?** They proxy to cloud but + run locally; today's `prefer` treats them as non-local. Bandit + features should include `is_subprocess` as a distinct feature + so the policy can learn the user's preferences for those arms + independent of local/cloud. +- **Cold start.** With no observations, the bandit defaults to + pure exploration. Should we seed with the existing heuristic + defaults from `internal/router/defaults.go`? Probably yes — + warm-start with the curated Strengths as priors. + +--- + +## Rollout + +- **Phase 1** ships as v0.5.0 (additive, opt-in, no behaviour + change by default). Schema-touching so warrants a minor bump. +- **Phase 2** ships when Phase 1 has accumulated enough data + (~500–1000 observations per user) — opt-in via + `[router].bandit_policy = "linucb"` initially, becoming default + in a later release once measured better. +- **Phase 3 (deprecation of decoder-SLM classifier)** is a v0.6.x + conversation, gated on Phase 2 measurably outperforming. +- **Phase 4 (ModernBERT fine-tune)** is v0.7+ — requires the + fine-tuned model artifact distributed via Ollama or HF, plus + the auto-download story. +- **Phase 5 (FunctionGemma sanity layer)** is independent of all + of the above; lands when the sibling `tool-router-specialization` + plan justifies it on did-switch-rate telemetry. + +--- + +## Cross-references + +- TODO.md entry "Bandit selector — design decisions deferred" — + the strategic question this plan answers in the long run. +- TODO.md entry "Tool-router specialization (functiongemma)" — the + sibling track; complementary, not competing. +- [`2026-05-23-tool-router-specialization.md`](2026-05-23-tool-router-specialization.md) — FunctionGemma fine-tune plan. +- [`2026-05-07-gnoma-roadmap.md`](2026-05-07-gnoma-roadmap.md) §Phase 4 — the original "re-evaluate bandit learning" entry. +- 2026-05-25 diagnostic session (this conversation) — the trigger.