Compare commits
17 Commits
v0.3.1-rc2
...
v0.3.4
| Author | SHA1 | Date | |
|---|---|---|---|
| 7213a1e2fd | |||
| fd327107df | |||
| 0d3d190a8b | |||
| c065a2dea7 | |||
| 24945b1eb2 | |||
| c0c2e4bff5 | |||
| f3c70bd802 | |||
| fa65a68728 | |||
| 8b9bdc2978 | |||
| eea26a262e | |||
| 352cab4a94 | |||
| 58f4001917 | |||
| 6c5e969217 | |||
| 74bd570438 | |||
| d38d7daf25 | |||
| 06d4069076 | |||
| f641bd4971 |
@@ -61,3 +61,8 @@ jobs:
|
||||
args: release --clean
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
# Force GoReleaser to use the triggering tag rather than fall
|
||||
# back to `git describe` — which can resolve to an older tag
|
||||
# (e.g., a vX.Y.Z-rc tag) when multiple tags point at the same
|
||||
# commit. Surfaced as the v0.3.1 release failure on 2026-05-24.
|
||||
GORELEASER_CURRENT_TAG: ${{ github.ref_name }}
|
||||
|
||||
@@ -364,9 +364,12 @@ gnoma can run a tiny local model alongside the main provider to:
|
||||
|
||||
```toml
|
||||
[slm]
|
||||
enabled = true
|
||||
backend = "auto" # ollama | llamacpp | llamafile | openaicompat | auto | disabled
|
||||
model = "reecdev/tiny3.5:500m"
|
||||
enabled = true
|
||||
backend = "auto" # ollama | llamacpp | llamafile | openaicompat | auto | disabled
|
||||
model = "qwen3:0.6b"
|
||||
register_as_arm = true # default; set to false to make the SLM classifier-only
|
||||
# (e.g. for FunctionGemma, code-completion-tuned models)
|
||||
classify_timeout = "15s" # default; bump higher for slow cold-loads
|
||||
```
|
||||
|
||||
Setup, presets, and verification: [docs/slm-backends.md](docs/slm-backends.md).
|
||||
|
||||
@@ -4,6 +4,153 @@ Active work, newest first.
|
||||
|
||||
## In flight
|
||||
|
||||
- **Config write/merge — silent corruption of layered configs.**
|
||||
`internal/config/write.go:setConfig` reads the existing TOML into a
|
||||
zero-valued `Config` struct, sets one field, and writes the entire
|
||||
struct back out — so every untouched field gets serialized at its
|
||||
Go zero value (empty strings, zero ints, `false` bools). On the
|
||||
next load, those explicit zeros overwrite higher-priority layers
|
||||
via `toml.Decode`'s "present field beats absent field" semantics.
|
||||
|
||||
Concrete symptom (2026-05-24): user's `~/.config/gnoma/config.toml`
|
||||
had `[router].prefer = "cloud"` but the project-level
|
||||
`.gnoma/config.toml` had `prefer = ""` (generated by an earlier
|
||||
`gnoma config set ...` call), which silently downgraded the
|
||||
effective policy to `auto` — visible only via the new `/router`
|
||||
TUI command, with no warning.
|
||||
|
||||
Same root cause is responsible for the zero-spammed global config
|
||||
the same user has (`max_tokens = 0`, `permission.mode = ""`,
|
||||
`bash_timeout = 0`, etc.) — all overwriting sensible defaults.
|
||||
|
||||
**Fix surface (multi-part, plan-worthy):**
|
||||
|
||||
1. **Stop generating zero-spam.** Two options:
|
||||
- Tag struct fields with `,omitempty` so the BurntSushi encoder
|
||||
skips zero values. Caveat: conflates "unset" with "explicitly
|
||||
zero" for primitive types (a user who wants `max_keep = 0`
|
||||
loses it). Safe for strings/maps/slices where empty is never
|
||||
user-intent; lossy for numeric fields.
|
||||
- Switch to `pelletier/go-toml/v2` and use its document model
|
||||
to edit only the targeted key, preserving everything else
|
||||
byte-for-byte. Cleaner semantics, bigger refactor.
|
||||
- Hybrid: omitempty on string/map/slice fields, document-level
|
||||
edit for numerics. Fastest path that doesn't lose intent.
|
||||
|
||||
2. **`gnoma doctor` — read-only diagnostic.** Scans both global
|
||||
and project configs and reports:
|
||||
- Zero-spam fields that would silently shadow defaults or
|
||||
upstream layers.
|
||||
- Invalid enum values (e.g. `permission.mode = ""`).
|
||||
- Unknown / removed keys from older schema versions.
|
||||
- Effective-merged values (so the user sees what gnoma will
|
||||
actually use after layering). No writes. Exits non-zero on
|
||||
findings so it's CI-friendly.
|
||||
|
||||
3. **`gnoma upgrade-config` — active migration.** For each config
|
||||
file (global, profiles, project):
|
||||
- Compute the cleaned form (only fields the user actually set,
|
||||
dropping zeros that match defaults).
|
||||
- Write the original to `<path>.bak` with timestamp suffix.
|
||||
- Write the cleaned form to the original path.
|
||||
- Print a diff of what changed so the user can verify.
|
||||
|
||||
4. **Project-level auto-migration on startup.** If gnoma detects
|
||||
a zero-spammed project `.gnoma/config.toml` at launch:
|
||||
- Auto-run the upgrade (project-only, never auto-touch the
|
||||
global config).
|
||||
- Write `.gnoma/config.toml.bak-YYYY-MM-DD-HHMMSS`.
|
||||
- Surface a one-line notice in the startup safety banner:
|
||||
`config: migrated .gnoma/config.toml (see .bak)`.
|
||||
- The auto-migration is non-destructive (`.bak` preserves
|
||||
original) but still gated behind a `[config].auto_migrate`
|
||||
toggle, defaulting to `true`. Global configs require
|
||||
explicit `gnoma upgrade-config`.
|
||||
|
||||
5. **Project registry** (`~/.config/gnoma/projects.json`). Today
|
||||
there is no record of which directories gnoma has been launched
|
||||
in — items #2 and #3 can work with a filesystem scan
|
||||
(`find ~ -type d -name .gnoma`), but a registry makes them
|
||||
significantly faster and unlocks cross-project features.
|
||||
Sketch:
|
||||
|
||||
```json
|
||||
{
|
||||
"projects": [
|
||||
{
|
||||
"path": "/home/.../my-repo",
|
||||
"first_seen": "2026-04-15T10:30:00Z",
|
||||
"last_seen": "2026-05-24T19:23:00Z",
|
||||
"session_count": 47
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Update on every successful startup (record project root,
|
||||
bump `last_seen` + increment `session_count`). Enables:
|
||||
- Fast `gnoma doctor --all-projects` without a filesystem walk.
|
||||
- Cross-project session listing (`gnoma sessions --all`
|
||||
picker; surface most-recent sessions across the registry).
|
||||
- `gnoma upgrade-config` that can migrate every known project
|
||||
in one invocation.
|
||||
- Future local-only aggregate stats (`gnoma stats`) — still
|
||||
no-phone-home, just a sum across the registry.
|
||||
|
||||
**Caveats and design constraints:**
|
||||
- The registry file becomes another silent-corruption surface
|
||||
— must use the same `omitempty` / atomic-write discipline
|
||||
as the encoder fix in #1, or it'll exhibit the same class
|
||||
of bug.
|
||||
- Stale entries (deleted projects). `gnoma doctor` should
|
||||
detect and offer to prune; do not auto-delete.
|
||||
- Privacy: this is literally a log of directories the user
|
||||
has worked in. Local-only, never sent off-machine (per the
|
||||
no-phone-home positioning), but worth a one-line note in
|
||||
the Security section of the README so users know it exists.
|
||||
- Opt-out: `[config].project_registry = false` for users who
|
||||
don't want this tracked. Default `true`.
|
||||
- Atomic writes (temp file + rename) so a crash mid-write
|
||||
doesn't corrupt the file.
|
||||
|
||||
Surfaced from the v0.3.1 launch wave (2026-05-24).
|
||||
Plan:
|
||||
[`docs/superpowers/plans/2026-05-24-config-migration.md`](docs/superpowers/plans/2026-05-24-config-migration.md).
|
||||
|
||||
- **Bandit selector — design decisions deferred.** The current
|
||||
selector (`internal/router/selector.go:scoreArm`) is greedy
|
||||
quality-weighted: per-(arm × task-type) EMA scores blended 70/30
|
||||
with heuristic defaults, divided by CostWeight-adjusted cost. It
|
||||
is **not** a true multi-armed bandit — no UCB-style exploration
|
||||
bonus, no Thompson sampling. Tracked as a design question rather
|
||||
than a must-implement item because of two open dependencies:
|
||||
|
||||
1. **Whether to keep numeric EMA at all.** The 2026-05-07 roadmap
|
||||
(Phase 4) puts re-evaluating bandit learning on hold until the
|
||||
SLM-driven dispatcher is in production. Three options on the
|
||||
table: keep bandit as feedback for the SLM, retire EMA in
|
||||
favour of qualitative outcome summaries fed to the SLM, or
|
||||
split responsibilities (SLM = intent routing, bandit =
|
||||
cost/quality within a tier). See
|
||||
[`docs/superpowers/plans/2026-05-07-gnoma-roadmap.md`](docs/superpowers/plans/2026-05-07-gnoma-roadmap.md)
|
||||
§Phase 4.
|
||||
|
||||
2. **User-tunable selector knobs.** Several constants are
|
||||
hardcoded today: `qualityAlpha` (EMA smoothing, ~3-sample
|
||||
memory), the 70/30 observed/heuristic blend,
|
||||
`strengthScoreBonus` for tagged task types, and the
|
||||
`DefaultThresholds.Minimum` quality floor. Surfacing these as
|
||||
`[router.bandit]` config keys would let users tune for their
|
||||
workloads (faster alpha for shifting model performance, longer
|
||||
memory for stable fleets) without waiting for the strategic
|
||||
decision in #1.
|
||||
|
||||
Surfaced from the r/coolgithubprojects v0.3.1 launch thread
|
||||
(2026-05-24, `u/Ha_Deal_5079`). The encoder + contextual bandit
|
||||
alternative is now sketched in
|
||||
[`docs/superpowers/plans/2026-05-25-encoder-bandit-router.md`](docs/superpowers/plans/2026-05-25-encoder-bandit-router.md) —
|
||||
that plan supersedes #1 above when it ships.
|
||||
|
||||
- **Security boundary — egress controls + session audit log.** The
|
||||
current `Firewall` is a content boundary only (scans messages and
|
||||
tool results for secrets via regex + Shannon entropy, redacts or
|
||||
@@ -57,7 +204,8 @@ Active work, newest first.
|
||||
warning when the content matches sensitive heuristics, a
|
||||
consent-gated review step, and consistent treatment across the
|
||||
three paths. Cross-cuts with Phase F entropy work and the
|
||||
outgoing-scan firewall.
|
||||
outgoing-scan firewall. Plan:
|
||||
[`docs/superpowers/plans/2026-05-24-sensitive-content-policy.md`](docs/superpowers/plans/2026-05-24-sensitive-content-policy.md).
|
||||
- **Distribution — follow-ups.** v0.1.0 shipped (archives on
|
||||
github.com/VikingOwl91/gnoma/releases, multi-arch images on
|
||||
ghcr.io/vikingowl91/gnoma). Still optional: Homebrew tap,
|
||||
@@ -76,7 +224,13 @@ Active work, newest first.
|
||||
- **Structured output** with JSON schema validation — M12.
|
||||
- **Native agy JSON output** — switch the subprocess provider to
|
||||
`--output-format stream-json` once the agy CLI supports it,
|
||||
replacing the current prompt-augmentation fallback.
|
||||
replacing the current prompt-augmentation fallback. Until then,
|
||||
agy's `ToolUse` capability is set to `false` (see
|
||||
`internal/provider/subprocess/agent.go` agy entry) — without
|
||||
structured tool-call output, the router would otherwise dispatch
|
||||
tool-needing tasks to agy and the turn would hang on prose
|
||||
hallucinations of tool calls. Flip the capability back to `true`
|
||||
in the same change that lands stream-json parsing.
|
||||
- **SQLite session persistence** + serve mode — M10.
|
||||
- **Task learning** (pattern recognition, persistent tasks) — M11.
|
||||
- **Web UI** (`gnoma web`) — M15.
|
||||
|
||||
+49
-11
@@ -180,7 +180,7 @@ func main() {
|
||||
case "slm":
|
||||
os.Exit(runSLMCommand(cliArgs[1:], cfg, logger))
|
||||
case "router":
|
||||
os.Exit(runRouterCommand(cliArgs[1:], profile))
|
||||
os.Exit(runRouterCommand(cliArgs[1:], cfg, profile))
|
||||
case "profile":
|
||||
os.Exit(runProfileCommand(cliArgs[1:], cfg, profile))
|
||||
}
|
||||
@@ -397,7 +397,17 @@ func main() {
|
||||
|
||||
// Create router and register the provider as a single arm
|
||||
// (M4 foundation: one provider from CLI. Multi-provider routing comes with config.)
|
||||
rtr := router.New(router.Config{Logger: logger})
|
||||
// BanditParams come from [router.bandit] config keys; zero values
|
||||
// resolve to built-in defaults inside the router package.
|
||||
rtr := router.New(router.Config{
|
||||
Logger: logger,
|
||||
Bandit: router.BanditParams{
|
||||
QualityAlpha: cfg.Router.Bandit.QualityAlpha,
|
||||
MinObservations: cfg.Router.Bandit.MinObservations,
|
||||
ObservedWeight: cfg.Router.Bandit.ObservedWeight,
|
||||
StrengthBonus: cfg.Router.Bandit.StrengthBonus,
|
||||
},
|
||||
})
|
||||
|
||||
// Apply the prefer-routing-policy from config (default: auto).
|
||||
// Invalid values are rejected here with an actionable error rather
|
||||
@@ -672,6 +682,17 @@ func main() {
|
||||
store := persist.New(sessionID, fw.Incognito())
|
||||
logger.Debug("session store initialized", "dir", store.Dir())
|
||||
|
||||
// Per-session firewall audit log: append-only JSONL at
|
||||
// <projectRoot>/.gnoma/sessions/<sessionID>/audit.jsonl. Honours
|
||||
// incognito (writes skipped when active) and tolerates fs errors —
|
||||
// scan pipeline never depends on the audit succeeding.
|
||||
auditPath := filepath.Join(gnomacfg.ProjectRoot(), ".gnoma", "sessions", sessionID, "audit.jsonl")
|
||||
fw.SetAudit(security.NewAuditLogger(security.AuditLoggerConfig{
|
||||
Path: auditPath,
|
||||
Incognito: fw.Incognito(),
|
||||
Logger: logger,
|
||||
}))
|
||||
|
||||
// Create elf manager and register agent tools.
|
||||
// Must be created after fw and permChecker so elfs inherit security layers.
|
||||
elfMgr := elf.NewManager(elf.ManagerConfig{
|
||||
@@ -860,21 +881,38 @@ func main() {
|
||||
// transport and as a router arm. Both paths route through the
|
||||
// firewall after fwRef.Set fires above.
|
||||
slmProvider := security.WrapProvider(boot.Provider, fwRef)
|
||||
lazy.set(slm.NewClassifier(slmProvider, boot.Model, logger))
|
||||
lazy.set(slm.NewClassifier(slmProvider, boot.Model, time.Duration(cfg.SLM.ClassifyTimeout), logger))
|
||||
// ToolUse comes from the live probe of the actual model. For
|
||||
// completion-only models (e.g. TinyLlama), the SLM arm only
|
||||
// handles knowledge-only prompts where the trivial-prompt
|
||||
// heuristic flipped RequiresTools=false. For tool-capable
|
||||
// models, the SLM also covers simple file reads etc., gated
|
||||
// by MaxComplexity=0.3.
|
||||
rtr.RegisterArm(&router.Arm{
|
||||
ID: router.ArmID("slm/" + string(boot.Backend)),
|
||||
Provider: slmProvider,
|
||||
ModelName: boot.Model,
|
||||
IsLocal: true,
|
||||
MaxComplexity: 0.3,
|
||||
Capabilities: provider.Capabilities{ToolUse: boot.ToolSupport},
|
||||
})
|
||||
//
|
||||
// [slm].register_as_arm gates the dual-role registration.
|
||||
// Default (nil) is true to preserve pre-config behaviour.
|
||||
// Explicit false makes the SLM classifier-only, which is
|
||||
// the correct setting for task-specialised models
|
||||
// (FunctionGemma, code-completion-tuned models, etc.) that
|
||||
// would mishandle a general prompt routed to them as the
|
||||
// answer-producing arm.
|
||||
registerAsArm := true
|
||||
if cfg.SLM.RegisterAsArm != nil {
|
||||
registerAsArm = *cfg.SLM.RegisterAsArm
|
||||
}
|
||||
if registerAsArm {
|
||||
rtr.RegisterArm(&router.Arm{
|
||||
ID: router.ArmID("slm/" + string(boot.Backend)),
|
||||
Provider: slmProvider,
|
||||
ModelName: boot.Model,
|
||||
IsLocal: true,
|
||||
MaxComplexity: 0.3,
|
||||
Capabilities: provider.Capabilities{ToolUse: boot.ToolSupport},
|
||||
})
|
||||
} else {
|
||||
logger.Info("SLM registered as classifier only ([slm].register_as_arm=false)",
|
||||
"model", boot.Model)
|
||||
}
|
||||
slmCleanup = boot.Close
|
||||
slmInfo.Active = true
|
||||
slmInfo.Backend = string(boot.Backend)
|
||||
|
||||
+31
-8
@@ -12,7 +12,7 @@ import (
|
||||
)
|
||||
|
||||
// runRouterCommand handles `gnoma router <subcommand>`. Returns an exit code.
|
||||
func runRouterCommand(args []string, profile gnomacfg.Profile) int {
|
||||
func runRouterCommand(args []string, cfg *gnomacfg.Config, profile gnomacfg.Profile) int {
|
||||
if len(args) == 0 {
|
||||
fmt.Fprintln(os.Stderr, "usage: gnoma router <command>")
|
||||
fmt.Fprintln(os.Stderr, "commands:")
|
||||
@@ -21,14 +21,14 @@ func runRouterCommand(args []string, profile gnomacfg.Profile) int {
|
||||
}
|
||||
switch args[0] {
|
||||
case "stats":
|
||||
return runRouterStats(profile)
|
||||
return runRouterStats(cfg, profile)
|
||||
default:
|
||||
fmt.Fprintf(os.Stderr, "unknown router command: %s\n", args[0])
|
||||
return 1
|
||||
}
|
||||
}
|
||||
|
||||
func runRouterStats(profile gnomacfg.Profile) int {
|
||||
func runRouterStats(cfg *gnomacfg.Config, profile gnomacfg.Profile) int {
|
||||
path := profile.QualityFile(gnomacfg.GlobalConfigDir())
|
||||
data, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
@@ -52,7 +52,7 @@ func runRouterStats(profile gnomacfg.Profile) int {
|
||||
}
|
||||
printArmTable(snap)
|
||||
fmt.Println()
|
||||
printClassifierTable(snap)
|
||||
printClassifierTable(snap, cfg)
|
||||
return 0
|
||||
}
|
||||
|
||||
@@ -86,7 +86,7 @@ func printArmTable(snap router.QualitySnapshot) {
|
||||
_ = tw.Flush()
|
||||
}
|
||||
|
||||
func printClassifierTable(snap router.QualitySnapshot) {
|
||||
func printClassifierTable(snap router.QualitySnapshot, cfg *gnomacfg.Config) {
|
||||
fmt.Println("Classifier source breakdown:")
|
||||
counts := snap.ClassifierCounts
|
||||
if len(counts) == 0 {
|
||||
@@ -125,16 +125,39 @@ func printClassifierTable(snap router.QualitySnapshot) {
|
||||
_ = tw.Flush()
|
||||
fmt.Printf(" total observations: %d\n", total)
|
||||
|
||||
// Phase-4 trust hint.
|
||||
// Effective heuristic share: both pure heuristic and slm_fallback
|
||||
// observations were routed via the HeuristicClassifier — the only
|
||||
// difference is whether the SLM was attempted first. Surfacing the
|
||||
// combined share answers "how often did the SLM actually drive
|
||||
// routing?" honestly.
|
||||
effectiveHeuristic := counts["heuristic"] + counts["slm_fallback"]
|
||||
if total > 0 {
|
||||
fmt.Printf(" effective heuristic share: %.1f%% (%d fallbacks + %d pure heuristic)\n",
|
||||
float64(effectiveHeuristic)/float64(total)*100,
|
||||
counts["slm_fallback"], counts["heuristic"])
|
||||
}
|
||||
|
||||
// Phase-4 trust hint. Distinguishes the three diagnostic cases —
|
||||
// SLM never called, SLM called but every call failed, SLM working
|
||||
// but minority share — and templates the actionable advice off
|
||||
// the configured backend so the hint doesn't mention llamafile
|
||||
// when the user is on ollama (or vice versa).
|
||||
slmShare := 0.0
|
||||
if total > 0 {
|
||||
slmShare = float64(counts["slm"]) / float64(total) * 100
|
||||
}
|
||||
backend := "the SLM"
|
||||
if cfg != nil && cfg.SLM.Backend != "" {
|
||||
backend = cfg.SLM.Backend
|
||||
}
|
||||
switch {
|
||||
case total < 50:
|
||||
fmt.Println(" hint: < 50 observations — too sparse for Phase 4 trust signal yet.")
|
||||
case counts["slm"] == 0:
|
||||
fmt.Println(" hint: SLM has never classified — check that llamafile boots before short-lived runs end.")
|
||||
case counts["slm"] == 0 && counts["slm_fallback"] == 0:
|
||||
fmt.Printf(" hint: SLM never called — check [slm].enabled and that %s is reachable.\n", backend)
|
||||
case counts["slm"] == 0 && counts["slm_fallback"] > 0:
|
||||
fmt.Printf(" hint: SLM was called %d times but every call fell back — run with `--verbose` to see the underlying error (likely a timeout or parse failure for %s).\n",
|
||||
counts["slm_fallback"], backend)
|
||||
case slmShare < 50:
|
||||
fmt.Printf(" hint: SLM share is %.0f%% — fallback is doing most of the work.\n", slmShare)
|
||||
}
|
||||
|
||||
+24
-10
@@ -24,27 +24,41 @@ The "ollama" path is the easiest if you're already running a local model — it
|
||||
|
||||
## Presets
|
||||
|
||||
Presets use `reecdev/tiny3.5:500m` as the default model — a 500 M-parameter Qwen3.5 distillation with tool support, available on Ollama. Pull it once with:
|
||||
Presets use `qwen3:0.6b` as the default model — a 600 M-parameter Qwen3 instruction-tuned model with native `/no_think` support, available on Ollama. Pull it once with:
|
||||
|
||||
```bash
|
||||
ollama pull reecdev/tiny3.5:500m # ~1 GB
|
||||
# or the 1.5 B variant for slightly better quality:
|
||||
ollama pull reecdev/tiny3.5:1.5b # ~3 GB
|
||||
ollama pull qwen3:0.6b # ~520 MB
|
||||
```
|
||||
|
||||
### Model choice notes
|
||||
|
||||
Empirical testing (2026-05-25) across three candidate SLMs on identical prompts:
|
||||
|
||||
| Model | Classifier success | Notes |
|
||||
|---|---|---|
|
||||
| `qwen3:0.6b` | consistent across trivial + knowledge prompts | recommended default; honours `/no_think` cleanly |
|
||||
| `functiongemma:270m` | works on trivial prompts, derails on knowledge ones | needs function-signature prompt rewrite or LoRA fine-tune to be reliable |
|
||||
| `gemma3:1b` | unusable | emits malformed JSON (just `{` or invented keys) |
|
||||
| `reecdev/tiny3.5:1.5b` | unusable | thinking-mode distillation; ignores `/no_think` and emits `<Thought Process>` blocks |
|
||||
| `qwen2.5-coder:1.5b` | unusable | code-completion-tuned; ignores the classifier prompt entirely and answers in prose |
|
||||
|
||||
Substitute any small Ollama model you prefer. The probe at startup reads each model's actual capability — `tools` enables the SLM arm to handle simple file reads; without it, the SLM only handles knowledge-only prompts.
|
||||
|
||||
If your SLM is task-specialised (function-call models like FunctionGemma; embedding-only models; code-completion-tuned models) and produces wrong-shape output when asked to answer a general prompt, set `register_as_arm = false` so the SLM stays classifier-only and execution routes to other local arms.
|
||||
|
||||
### Preset 1 — Ollama (recommended for most users)
|
||||
|
||||
```toml
|
||||
[slm]
|
||||
enabled = true
|
||||
backend = "ollama"
|
||||
model = "reecdev/tiny3.5:500m"
|
||||
enabled = true
|
||||
backend = "ollama"
|
||||
model = "qwen3:0.6b"
|
||||
register_as_arm = true # default; set false for classifier-only models
|
||||
classify_timeout = "15s" # default; bump for slow cold-load
|
||||
# base_url defaults to http://localhost:11434
|
||||
```
|
||||
|
||||
Prereq: `ollama pull reecdev/tiny3.5:500m` (or any model you'd rather use).
|
||||
Prereq: `ollama pull qwen3:0.6b` (or any model you'd rather use).
|
||||
|
||||
### Preset 2 — llama.cpp server
|
||||
|
||||
@@ -150,10 +164,10 @@ Output looks like:
|
||||
```
|
||||
slm enabled: true
|
||||
slm backend: ollama
|
||||
model: reecdev/tiny3.5:500m
|
||||
model: qwen3:0.6b
|
||||
|
||||
live probe:
|
||||
✓ ollama ready (model=reecdev/tiny3.5:500m, boot=0s)
|
||||
✓ ollama ready (model=qwen3:0.6b, boot=0s)
|
||||
```
|
||||
|
||||
Run a few prompts, then check:
|
||||
|
||||
@@ -1,5 +1,14 @@
|
||||
# Tool-Router Specialization (functiongemma) — 2026-05-23
|
||||
|
||||
> **Companion plan from 2026-05-25:**
|
||||
> [`2026-05-25-encoder-bandit-router.md`](2026-05-25-encoder-bandit-router.md)
|
||||
> sketches an alternative architecture (encoder + contextual bandit
|
||||
> instead of decoder-SLM-as-classifier). The two are complementary,
|
||||
> not competing — FunctionGemma fits as the optional Phase 5 "JSON
|
||||
> sanity layer" in that plan. Decide which track to invest in based
|
||||
> on the did-switch-rate telemetry (this plan) vs the bandit-data
|
||||
> accumulation (companion plan).
|
||||
|
||||
Follow-up to
|
||||
[`2026-05-19-post-slm-unlock.md`](2026-05-19-post-slm-unlock.md)
|
||||
Phase A, which shipped two-stage tool routing: round 1 sends a single
|
||||
|
||||
@@ -0,0 +1,356 @@
|
||||
# Config Migration — 2026-05-24
|
||||
|
||||
Fixes the silent-corruption pattern in `internal/config/write.go`
|
||||
that produces zero-spammed config files, adds reader-side telemetry
|
||||
to surface the resulting layering bugs (`gnoma doctor`), ships an
|
||||
active migration command (`gnoma upgrade-config`), wires automatic
|
||||
project-level migration on startup, and introduces a per-user
|
||||
project registry so all of the above can operate cross-project.
|
||||
|
||||
Surfaces in TODO.md as "Config write/merge — silent corruption of
|
||||
layered configs" with five sub-items; this plan promotes that entry
|
||||
out of the bullet form into a phased design.
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
`setConfig()` in `internal/config/write.go` reads the existing TOML
|
||||
into a zero-valued `Config` struct, mutates one field, and writes
|
||||
the entire struct back out. The encoder doesn't skip zero values,
|
||||
so every untouched field gets serialized at its Go default — empty
|
||||
strings, zero ints, `false` bools, empty maps.
|
||||
|
||||
The next layered load (`Load()` → `toml.Decode` over multiple
|
||||
files) then **does not** treat those present-but-zero fields as
|
||||
"unset" — TOML's "present field wins" semantics mean those zeros
|
||||
overwrite higher-priority layers. Concrete failure observed
|
||||
2026-05-24:
|
||||
|
||||
- User's global `~/.config/gnoma/config.toml` has
|
||||
`[router].prefer = "cloud"`.
|
||||
- An earlier `gnoma config set ...` call generated a project-level
|
||||
`.gnoma/config.toml` containing `[router].prefer = ""`.
|
||||
- The merge collapses to `Prefer = ""`, which
|
||||
`ParsePreferPolicy("")` maps to `PreferAuto`.
|
||||
- The TUI's `/router` command reads `auto` despite the global
|
||||
config saying `cloud`. No warning, no error — purely silent.
|
||||
|
||||
Same root cause produces zero-spammed global configs
|
||||
(`max_tokens = 0`, `permission.mode = ""`, etc.) that silently
|
||||
override sensible defaults in `internal/config/defaults.go`.
|
||||
|
||||
This affects every layered field — provider, permission, tools,
|
||||
session, router, security, slm. Cannot be patched per-field;
|
||||
needs a structural fix.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Schema redesign.** The current `Config` struct stays as-is.
|
||||
This plan addresses how it's written and read, not what fields
|
||||
exist.
|
||||
- **Validation.** Future work; `gnoma doctor` will flag obviously
|
||||
invalid values (empty enum strings, etc.) but a full validation
|
||||
pass against the schema is out of scope here.
|
||||
- **Migration of the bandit-router quality JSON.** Unrelated file,
|
||||
unrelated format, separate concerns.
|
||||
|
||||
---
|
||||
|
||||
## Approach overview
|
||||
|
||||
Five phases, in dependency order:
|
||||
|
||||
1. **Encoder fix** — stop generating zero-spam in the first place.
|
||||
2. **Project registry** — `~/.config/gnoma/projects.json` so later
|
||||
phases can operate cross-project without filesystem walks.
|
||||
3. **`gnoma doctor`** — read-only diagnostic, scans global +
|
||||
project configs (via registry), reports zero-spam, invalid
|
||||
enums, removed keys, and the effective-merged view.
|
||||
4. **`gnoma upgrade-config`** — active migration with `.bak`
|
||||
backup + diff output; targets one file or all known projects.
|
||||
5. **Auto-migration on startup** — when launch detects a
|
||||
zero-spammed project config, run upgrade-config silently with
|
||||
a banner-line notice.
|
||||
|
||||
Phases 1 + 2 land first. 3 builds on 1 + 2. 4 builds on 3. 5
|
||||
builds on 4.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Encoder fix
|
||||
|
||||
`setConfig()` is the bug generator. The TOML library
|
||||
(`BurntSushi/toml`) supports `omitempty` on struct tags but the
|
||||
project's `Config` struct doesn't use it. Three options:
|
||||
|
||||
### Option A — `omitempty` on all fields
|
||||
|
||||
Tag every field with `,omitempty`. The encoder skips fields at
|
||||
their Go zero value. **Caveat:** conflates "unset" with
|
||||
"explicitly zero" for primitive types — a user who actually
|
||||
wants `max_keep = 0` (no session retention) loses that setting on
|
||||
the next write.
|
||||
|
||||
### Option B — `pelletier/go-toml/v2` document model
|
||||
|
||||
Switch encoder to a TOML library that exposes a document AST.
|
||||
Edit only the targeted key, preserve everything else byte-for-byte.
|
||||
Cleaner semantics, bigger refactor — also affects the decoder side.
|
||||
|
||||
### Option C (chosen) — hybrid
|
||||
|
||||
Use `omitempty` for fields where the Go zero value is never
|
||||
user-intent (strings, maps, slices). For numeric fields where 0
|
||||
is a legitimate user choice, switch the field to a pointer
|
||||
(`*int`, `*float64`) so `nil` means "unset" and `*0` means
|
||||
"explicitly zero". On decode, fall back to defaults for nil
|
||||
pointers in the resolution layer.
|
||||
|
||||
This keeps the existing BurntSushi library, preserves user intent
|
||||
across the full type space, and limits churn to the fields where
|
||||
the zero/unset ambiguity actually matters.
|
||||
|
||||
### Phase 1 task list
|
||||
|
||||
- **P1-1:** Audit every `Config`-tree field. Tag string/map/slice
|
||||
fields with `,omitempty`. List numeric/bool fields that need
|
||||
pointer conversion.
|
||||
- **P1-2:** Convert numeric/bool fields requiring zero-vs-unset
|
||||
distinction to pointers. Update construction sites and getters.
|
||||
- **P1-3:** Add a `Resolve()` method on `Config` that walks the
|
||||
struct and substitutes default values for nil pointers, called
|
||||
exactly once at the end of `Load()`. All consumer code reads
|
||||
resolved values; raw layered structs are internal.
|
||||
- **P1-4:** Tests covering: (a) write-then-read roundtrip
|
||||
preserves only user-set fields, (b) explicit zero (e.g.
|
||||
`max_keep = 0`) survives the roundtrip, (c) field absent from
|
||||
TOML resolves to default.
|
||||
- **P1-5:** Backwards-compat: when reading an existing zero-spammed
|
||||
file, the resolver must treat all-zeros-in-a-section as the
|
||||
default — see Phase 5 for the heuristic.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Project registry
|
||||
|
||||
New file at `~/.config/gnoma/projects.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"projects": [
|
||||
{
|
||||
"path": "/home/user/git/foo",
|
||||
"first_seen": "2026-04-15T10:30:00Z",
|
||||
"last_seen": "2026-05-24T19:23:00Z",
|
||||
"session_count": 47
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 2 task list
|
||||
|
||||
- **P2-1:** Add `internal/config/registry.go` with `Registry`,
|
||||
`Load`, `Save`, `Record(projectRoot)`, `Prune(staleAfter time.Duration)`.
|
||||
- **P2-2:** Save uses atomic-write (temp file + `os.Rename`) so a
|
||||
crash mid-write doesn't corrupt the file.
|
||||
- **P2-3:** Call `Registry.Record(projectRoot)` from
|
||||
`cmd/gnoma/main.go` right after the startup-safety banner
|
||||
decides to proceed. Failure is logged at Warn level but never
|
||||
blocks startup.
|
||||
- **P2-4:** Add `[config].project_registry` toggle in defaults.go
|
||||
(bool, default `true`). When `false`, Record is a no-op.
|
||||
- **P2-5:** Document the file in README §Security as part of the
|
||||
no-phone-home scope note: this is purely local, never sent.
|
||||
- **P2-6:** Tests: round-trip, atomic-write under fault injection,
|
||||
toggle off path.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — `gnoma doctor`
|
||||
|
||||
New subcommand. Read-only. Scans:
|
||||
|
||||
- Global config at `GlobalConfigPath()`.
|
||||
- Every project in the registry (or filesystem-scan fallback when
|
||||
the registry is disabled or empty).
|
||||
- Active profile (when profile mode is on).
|
||||
|
||||
Reports per-file:
|
||||
|
||||
- **Zero-spam fields** — present-with-zero where higher layer or
|
||||
default has non-zero. The very thing this plan exists to fix.
|
||||
- **Invalid enum values** — `permission.mode = ""`,
|
||||
`router.prefer = "yes"`, etc. Use existing parsers to detect.
|
||||
- **Unknown keys** — fields in the TOML that don't map to any
|
||||
`Config` struct field. Decoder ignores these silently today;
|
||||
doctor surfaces them.
|
||||
- **Removed keys** — known-historical fields from older schema
|
||||
versions; suggest removal.
|
||||
|
||||
Reports per-stack:
|
||||
|
||||
- **Effective-merged values** — what gnoma will actually use after
|
||||
layering. Helps the user see whether a project file is masking
|
||||
a global setting.
|
||||
|
||||
### Phase 3 task list
|
||||
|
||||
- **P3-1:** Add `cmd/gnoma/doctor_cmd.go` with the subcommand
|
||||
scaffold.
|
||||
- **P3-2:** `internal/config/doctor.go` with the scan logic;
|
||||
exported `Diagnose(paths []string) []Finding`.
|
||||
- **P3-3:** Output: human format by default, `--json` for
|
||||
CI/script consumption.
|
||||
- **P3-4:** Exit non-zero when findings have severity ≥ Warn so
|
||||
doctor is CI-friendly.
|
||||
- **P3-5:** `--all-projects` flag (default off; uses registry).
|
||||
- **P3-6:** Tests covering each finding type.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — `gnoma upgrade-config`
|
||||
|
||||
Active migration. Writes:
|
||||
|
||||
- Original file → `<path>.bak-YYYYMMDD-HHMMSS` (deterministic
|
||||
timestamp suffix).
|
||||
- Cleaned content → original path.
|
||||
- Stdout: unified diff of what changed.
|
||||
|
||||
### Phase 4 task list
|
||||
|
||||
- **P4-1:** Add `cmd/gnoma/upgrade_config_cmd.go`.
|
||||
- **P4-2:** `internal/config/upgrade.go` with `Upgrade(path string)`
|
||||
→ reads file, applies the Phase 1 cleaning (drop fields equal to
|
||||
their resolved default, keep explicit zeros that diverge from the
|
||||
default via the pointer semantics).
|
||||
- **P4-3:** Atomic two-step write: rename original to `.bak-...`,
|
||||
then atomic-write new content to original path. Crash midway
|
||||
leaves both files present, never the corrupted state.
|
||||
- **P4-4:** `--all-projects` flag using the registry.
|
||||
- **P4-5:** `--dry-run` prints diffs without writing.
|
||||
- **P4-6:** Tests: round-trip of zero-spammed input → cleaned
|
||||
output → identical re-read; idempotency (running twice yields
|
||||
no second `.bak`).
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Auto-migration on startup
|
||||
|
||||
When `Load()` parses a project `.gnoma/config.toml` and the
|
||||
heuristic flags it as zero-spammed (every field at the Go zero
|
||||
value, no user content), gnoma:
|
||||
|
||||
- Runs the Phase 4 upgrade in-process.
|
||||
- Writes `.gnoma/config.toml.bak-...`.
|
||||
- Emits a single line to the startup safety banner:
|
||||
`config: migrated .gnoma/config.toml (see .bak)`.
|
||||
- Continues startup with the cleaned config.
|
||||
|
||||
### Heuristic for "zero-spam"
|
||||
|
||||
A config section is zero-spam if **all** of these hold:
|
||||
|
||||
- Every primitive field present in the file is at its Go zero
|
||||
value.
|
||||
- No `[[arms]]`, `[[mcp_servers]]`, or `[[hooks]]` blocks (those
|
||||
are always user content).
|
||||
- File modification time ≥ 24h old (so we don't migrate a config
|
||||
the user is actively editing).
|
||||
|
||||
If only some fields are zero and some are user-set, we don't touch
|
||||
it — the user's mix of explicit zeros and meaningful values takes
|
||||
precedence.
|
||||
|
||||
### Phase 5 task list
|
||||
|
||||
- **P5-1:** Add `isZeroSpam(*Config) bool` heuristic in
|
||||
`internal/config/upgrade.go`.
|
||||
- **P5-2:** Wire from `Load()` post-merge: if project layer
|
||||
is_zero_spam → call Upgrade on the project file, log via banner.
|
||||
- **P5-3:** Add `[config].auto_migrate` toggle, default `true`.
|
||||
Global configs are never auto-migrated; only project-level.
|
||||
- **P5-4:** Banner integration: the existing safety banner gets
|
||||
a new optional line for "config notices" right under the
|
||||
cwd/sensitivity summary.
|
||||
- **P5-5:** Tests: zero-spam project file gets migrated; mixed
|
||||
project file is left alone; recently-modified file is left
|
||||
alone; auto_migrate=false disables.
|
||||
|
||||
---
|
||||
|
||||
## Cross-cutting: schemas and resolution
|
||||
|
||||
The pointer-field design (Phase 1) needs a clear resolution layer.
|
||||
Proposal: every Config section gets a `Resolved...Section` mirror
|
||||
that has plain (non-pointer) types. After Load, the resolver
|
||||
populates one from the other, substituting defaults for nils.
|
||||
|
||||
Examples already exist in the codebase: `ResolvedSafetySection`
|
||||
mirrors `SafetySection`. The pattern is established; we just need
|
||||
to extend it.
|
||||
|
||||
Consumer-side: code reads from `cfg.Resolved.X` not `cfg.X`.
|
||||
Loud renaming will catch any reader still using the raw layered
|
||||
struct.
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- **Pointer-field migration is wide-scope.** Every reader of the
|
||||
affected fields needs to change. Mitigated by the
|
||||
resolver-mirror pattern (`ResolvedXSection`) — readers move from
|
||||
one struct to another, but the call sites don't change shape.
|
||||
- **Auto-migration writes silently.** Users might be surprised
|
||||
even with the banner notice. Mitigated by `.bak` preservation
|
||||
and the heuristic only firing on files that are obviously
|
||||
zero-spam.
|
||||
- **Registry becomes the same class of bug.** Documented in the
|
||||
TODO entry already; Phase 2 explicitly requires atomic-write
|
||||
and `omitempty` discipline. If we get this wrong the fix is the
|
||||
same shape as Phase 1.
|
||||
- **Privacy.** The registry is a list of directories the user has
|
||||
worked in. Local-only, opt-out toggle, README note required.
|
||||
- **Backwards compatibility for tests.** Tests that construct
|
||||
`Config` by hand with explicit zeros may need updating.
|
||||
Approach: add a `MustResolve` helper for test construction so
|
||||
tests don't need to know about the pointer/resolver split.
|
||||
|
||||
---
|
||||
|
||||
## Rollout
|
||||
|
||||
Phases 1 + 2 ship together as a single release (encoder fix
|
||||
needs the resolver, registry is independent but small). Tag as
|
||||
`v0.4.0` — schema-touching changes warrant a minor bump per
|
||||
the project's pre-1.0 semver discipline.
|
||||
|
||||
Phase 3 (`gnoma doctor`) can ship in a `v0.4.x` patch — it's
|
||||
read-only and adds no surface compatibility risk.
|
||||
|
||||
Phase 4 (`gnoma upgrade-config`) ships in a follow-up `v0.4.x`.
|
||||
|
||||
Phase 5 (auto-migration) ships once Phase 4 has been in the wild
|
||||
for at least one release cycle, so users have a way to opt in /
|
||||
inspect before it becomes implicit.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- Should `gnoma doctor` also check that the `quality.json` file
|
||||
is well-formed? Same dir, different concern — probably belongs
|
||||
in doctor's scope as the umbrella "diagnose my gnoma install"
|
||||
command.
|
||||
- Registry size cap? After a year of usage on a busy machine
|
||||
the file could grow to a few thousand entries. Reasonable; no
|
||||
cap planned, but `Prune(staleAfter)` exposed for users who
|
||||
want manual cleanup.
|
||||
- Profiles: how do profile configs interact with the doctor /
|
||||
upgrade flow? Default: treat each profile file as its own
|
||||
upgradeable unit. Doctor lists findings per-profile.
|
||||
@@ -0,0 +1,278 @@
|
||||
# Sensitive Content — Unified Policy — 2026-05-24
|
||||
|
||||
Promotes the "sensitive-content handling — unified policy" TODO
|
||||
entry into a phased design. Three input paths can introduce
|
||||
sensitive content into the conversation context — pasted images,
|
||||
pasted text, and tool-read files. Today each path has different
|
||||
defences; this plan unifies them behind a single policy with a
|
||||
single consent UI.
|
||||
|
||||
Sibling concerns:
|
||||
[`2026-05-19-post-slm-unlock.md`](2026-05-19-post-slm-unlock.md)
|
||||
Phase F (entropy detection) and the outgoing-scan firewall
|
||||
already cover detection in some places; this plan unifies the
|
||||
*decision* layer that sits in front of them.
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
Three input paths to the engine carry distinct sensitivity
|
||||
risks; each is handled differently today.
|
||||
|
||||
### Path 1 — Pasted images (Ctrl+V in the TUI)
|
||||
|
||||
Screenshot might contain API keys, terminal output with creds,
|
||||
private repo contents, family photos, etc. Today:
|
||||
|
||||
- Image bytes land in the user cache dir.
|
||||
- The router only sends to vision-capable arms.
|
||||
- Local arms are fine; cloud arms send full image content to
|
||||
the provider.
|
||||
- Incognito skips paste entirely (per the no-persistence
|
||||
contract).
|
||||
|
||||
What's missing: at-paste preview / warning. The user often does
|
||||
not realise what the screenshot contained until after it's been
|
||||
sent.
|
||||
|
||||
### Path 2 — Pasted text
|
||||
|
||||
User pastes a chunk into the input composer. Could be a log
|
||||
snippet with credentials, an `.env` file content, an SSH key,
|
||||
or just text. Today:
|
||||
|
||||
- Goes straight into the input buffer with no scanning.
|
||||
- Outgoing firewall scans the final composed message before
|
||||
send — *after* the user has already pressed Enter, often
|
||||
redacting silently in the background.
|
||||
- The user sees `[REDACTED]` in their own message after the
|
||||
fact, no consent step.
|
||||
|
||||
What's missing: at-paste detection so the user sees the warning
|
||||
*before* committing to send.
|
||||
|
||||
### Path 3 — Tool-read files
|
||||
|
||||
`fs_read`, `bash`, etc. surface file contents to the model. Today:
|
||||
|
||||
- Outgoing firewall scans tool *results* before they reach the
|
||||
next provider turn (`ScanToolResult`).
|
||||
- Format-aware entropy detection (Phase F-1) reduces false
|
||||
positives on UUIDs / SHA / ISO timestamps.
|
||||
- The audit log (just shipped) records what got blocked /
|
||||
redacted per session.
|
||||
|
||||
What's missing: nothing structurally on this path; it's the
|
||||
most-mature of the three. Listed here only for completeness so
|
||||
the unified policy can be honest about asymmetric coverage.
|
||||
|
||||
### The unification question
|
||||
|
||||
These three paths converge into "content that joins the context
|
||||
window." A consistent policy needs to answer, for each path:
|
||||
|
||||
1. **When** does detection run? (at paste / at send / at receive)
|
||||
2. **What** does the user see? (warning / preview / redacted
|
||||
placeholder / silent)
|
||||
3. **What** is their consent gate? (approve / deny / approve-with-
|
||||
redaction / skip)
|
||||
4. **Where** is the action recorded? (audit log, banner, slog)
|
||||
|
||||
Today the answers vary per path. This plan picks one set of
|
||||
answers and applies them everywhere.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **New detectors.** This plan reuses the existing scanner
|
||||
(regex + entropy + unicode-sanitize). Phase F-2's SLM-assisted
|
||||
detector lands separately when telemetry warrants.
|
||||
- **Egress allowlist.** Tracked in the security-boundary TODO
|
||||
entry, separate plan.
|
||||
- **Provider-side redaction.** That's the provider's problem.
|
||||
This plan is about what leaves gnoma's process.
|
||||
|
||||
---
|
||||
|
||||
## Approach
|
||||
|
||||
Single policy module: `internal/security/sensitive_policy.go`.
|
||||
Exposes one decision function:
|
||||
|
||||
```go
|
||||
type Decision int
|
||||
const (
|
||||
DecisionAllow Decision = iota
|
||||
DecisionWarn // show warning, allow on confirm
|
||||
DecisionRedactAndAllow
|
||||
DecisionBlock
|
||||
)
|
||||
|
||||
type Inspection struct {
|
||||
Path string // "paste_text", "paste_image", "tool_result"
|
||||
Content string // for text paths
|
||||
ImageBytes []byte // for image paths; nil otherwise
|
||||
Matches []scanner.Match // pre-scanned hits
|
||||
}
|
||||
|
||||
func Decide(insp Inspection, mode IncognitoMode, prefs Preferences) Decision
|
||||
```
|
||||
|
||||
All three paths route through `Decide` with their own
|
||||
`Inspection`. UI surface — the at-paste prompt, the at-send
|
||||
warning, the redacted-placeholder view — sits in the TUI and is
|
||||
driven by the Decision value.
|
||||
|
||||
### Path-specific wiring
|
||||
|
||||
| Path | When | UI | Default Decision rules |
|
||||
|---|---|---|---|
|
||||
| paste_text | Ctrl+V into composer | Inline warning under input box, with `Tab` to expand match details | Match in scanner → `Warn` (text stays, user dismisses); explicit block-tier match → `Block` (paste dropped) |
|
||||
| paste_image | Ctrl+V image | Pre-paste OCR scan (small local model) + warning before insertion | OCR finds secret pattern → `Warn`; user can choose `Redact` (image kept, warning attached) or `Cancel`. Incognito → `Block` (already today). |
|
||||
| tool_result | After tool runs | Banner: `firewall: redacted N items in this tool result` | Existing behaviour. `Decide` invoked just to keep the API surface consistent; matches go to audit log. |
|
||||
|
||||
### Preferences
|
||||
|
||||
New `[security.sensitive]` config section:
|
||||
|
||||
```toml
|
||||
[security.sensitive]
|
||||
warn_on_paste_text = true # default true
|
||||
warn_on_paste_image = true # default true
|
||||
ocr_image_paste = false # opt-in: requires local vision arm
|
||||
auto_redact = false # default false: ask first, redact second
|
||||
silent_tool_results = false # default false: show banner when redactions happen
|
||||
```
|
||||
|
||||
### Incognito interaction
|
||||
|
||||
When incognito is active, **every** Decision is treated as either
|
||||
`Block` or `RedactAndAllow` — never `Warn`-then-`Allow`. Incognito
|
||||
implies "I don't trust this conversation to persist"; the
|
||||
sensible default is to be strict about what flows in.
|
||||
|
||||
---
|
||||
|
||||
## Phases
|
||||
|
||||
### Phase A — Policy module + config
|
||||
|
||||
- **A-1:** Add `[security.sensitive]` section to config.go with
|
||||
the four flags above.
|
||||
- **A-2:** Add `internal/security/sensitive_policy.go` with
|
||||
`Inspection`, `Decision`, `Decide`.
|
||||
- **A-3:** Unit tests for the decision matrix.
|
||||
|
||||
### Phase B — Path 2 (pasted text)
|
||||
|
||||
Highest user-visible payoff for the smallest surface.
|
||||
|
||||
- **B-1:** TUI input composer intercepts paste, runs
|
||||
`Decide(paste_text, ...)` before the bytes enter the buffer.
|
||||
- **B-2:** Decision = Warn → status-line warning, paste still
|
||||
goes in. `Tab` expands details.
|
||||
- **B-3:** Decision = Block → paste discarded, status line
|
||||
explains why; user can override with `Ctrl+Shift+V`
|
||||
(force-paste) which bypasses but writes to audit log.
|
||||
- **B-4:** Tests: paste-of-known-secret triggers warning;
|
||||
redacted variant shows what would have been sent.
|
||||
|
||||
### Phase C — Path 3 (tool-results) banner
|
||||
|
||||
- **C-1:** When `ScanToolResult` redacts ≥1 item, the engine
|
||||
emits a system message: `firewall: redacted 2 items in
|
||||
read-file output (see audit log)`.
|
||||
- **C-2:** Gated behind `silent_tool_results = false` default.
|
||||
Users who already trust the firewall can flip it on.
|
||||
- **C-3:** Tests: integration test asserting the system
|
||||
message appears.
|
||||
|
||||
### Phase D — Path 1 (pasted images)
|
||||
|
||||
Most complex. Image OCR requires a local vision model; without
|
||||
one the paste falls back to today's behaviour.
|
||||
|
||||
- **D-1:** Add OCR hook: when `ocr_image_paste = true` and a
|
||||
vision-capable local arm is available, run a small OCR pass
|
||||
over the image before insertion.
|
||||
- **D-2:** Feed OCR output through the regex/entropy scanner.
|
||||
Matches → `Decide(paste_image, ...)` with the original image
|
||||
attached.
|
||||
- **D-3:** TUI shows a preview thumbnail + warning before
|
||||
insertion confirmation.
|
||||
- **D-4:** Without a vision arm: feature degrades gracefully
|
||||
(no OCR, paste proceeds as today, banner notes "image paste
|
||||
scan unavailable — no local vision arm").
|
||||
|
||||
### Phase E — Audit log integration
|
||||
|
||||
All four Decision outcomes get an audit entry. The audit log
|
||||
already has the file format from the security-boundary work;
|
||||
just need to define new Action values:
|
||||
|
||||
- `paste_warn`, `paste_block`, `paste_force_override`
|
||||
- `image_paste_warn`, `image_paste_block`, `image_paste_ocr_skip`
|
||||
- `tool_result_banner` (when redactions surfaced to user)
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- **OCR adds latency to paste.** Bad UX if image OCR takes >300ms.
|
||||
Mitigation: hard-cap OCR time at 500ms, skip if exceeded, fall
|
||||
back to no-scan path with banner notice. Local vision models on
|
||||
consumer hardware should comfortably make this budget.
|
||||
- **False positives on text paste become annoying.** If
|
||||
`warn_on_paste_text = true` fires on every code snippet, users
|
||||
turn it off and the protection is gone. Use the same
|
||||
entropy_safelist Phase F-1 ships (uuid/sha/iso8601/url) — those
|
||||
are the high-FP categories.
|
||||
- **OCR introduces a new attack surface.** A malicious image could
|
||||
exploit the OCR model. Mitigation: only local-arm OCR (the
|
||||
attacker's input never leaves the machine); never call cloud
|
||||
vision models for OCR (would defeat the privacy purpose).
|
||||
- **Phase D depends on having a local vision model.** Users without
|
||||
one get degraded UX. Document this clearly; consider whether to
|
||||
ship a small bundled OCR-tuned model (probably no — adds 100MB+
|
||||
to install).
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- Should there be a "trusted projects" list where the warnings
|
||||
are suppressed? Could live in the project registry (sibling
|
||||
plan). Useful for monorepos where the user explicitly trusts
|
||||
the local code.
|
||||
- The `Ctrl+Shift+V` force-paste override is a footgun. Do we
|
||||
want a confirm-second-time dialog, or just the keybind?
|
||||
- Should clipboard contents be cleared from the host clipboard
|
||||
after a sensitive paste? Cross-platform-tricky; defer.
|
||||
- Sensitive-pattern feedback loop: when a user dismisses a warning
|
||||
as "this isn't a secret", do we learn from that? Privacy concern
|
||||
— would need an explicit opt-in.
|
||||
|
||||
---
|
||||
|
||||
## Rollout
|
||||
|
||||
Phases A + B + C land together as one feature release. Phase D
|
||||
(image OCR) is opt-in (`ocr_image_paste = true`) and can land in
|
||||
a follow-up patch — its surface is large and benefits from real-
|
||||
world UX feedback. Phase E threads through all four; it lands
|
||||
incrementally per phase, not as a single batch.
|
||||
|
||||
Realistic target: Phase A/B/C in v0.5.0; Phase D in v0.5.x. All
|
||||
behaviour is gated behind the four config flags so existing users
|
||||
who don't opt in see no behavioural change.
|
||||
|
||||
---
|
||||
|
||||
## Cross-references
|
||||
|
||||
- TODO.md entry "Sensitive-content handling — unified policy"
|
||||
- [`2026-05-19-post-slm-unlock.md`](2026-05-19-post-slm-unlock.md) — Phase F entropy detection
|
||||
- [`2026-05-19-security-wave2-incognito.md`](2026-05-19-security-wave2-incognito.md) — incognito-mode contract
|
||||
- TODO.md entry "Security boundary — egress controls + session audit log" — the audit log this plan piggybacks on
|
||||
@@ -0,0 +1,344 @@
|
||||
# Encoder + Contextual-Bandit Router — 2026-05-25
|
||||
|
||||
Proposes a long-arc architectural rethink of gnoma's routing layer:
|
||||
**replace the decoder-SLM-as-classifier design with an encoder-only
|
||||
embedding model feeding a contextual bandit policy**, and treat a
|
||||
strict tiny SLM (FunctionGemma-270M-it) as the optional "emit a
|
||||
structured route decision" layer rather than the primary classifier.
|
||||
|
||||
Surfaced from external research (RouteLLM, ModernBERT, Gemma 3
|
||||
270M, Qwen3-Embedding, BGE-M3) brought into the 2026-05-25
|
||||
diagnostic session where gnoma's current decoder-SLM classifier
|
||||
exhibited a 100% failure rate across two model swaps
|
||||
(`reecdev/tiny3.5:1.5b`, `qwen2.5-coder:1.5b`).
|
||||
|
||||
This plan is **strategic / multi-month**. Phase 1 below is the only
|
||||
piece scoped for near-term implementation; everything else hinges on
|
||||
the bandit-vs-SLM strategic decision tracked in the existing
|
||||
`Bandit selector — design decisions deferred` TODO entry.
|
||||
|
||||
Sibling plans:
|
||||
[`2026-05-23-tool-router-specialization.md`](2026-05-23-tool-router-specialization.md)
|
||||
already covers the **FunctionGemma fine-tune** track as the
|
||||
strict-SLM option; this plan adds the **encoder + bandit** track
|
||||
as the alternative (and arguably better-suited) architecture.
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
The current router has three coupled problems:
|
||||
|
||||
1. **The classifier is a decoder LLM in a job an encoder would do
|
||||
better.** Routing is a classification task with cost/quality
|
||||
trade-offs, not a reasoning task. Asking a decoder model to emit
|
||||
structured JSON for every classify call is high-latency, fragile
|
||||
to chain-of-thought leakage, and indeterministic.
|
||||
|
||||
2. **The bandit can't actually learn quality** because the only
|
||||
success signal is `err == nil` (per `internal/engine/loop.go:118`).
|
||||
EMA scores converge to 1.00 for every arm — see the 2026-05-24
|
||||
`router stats` snapshot where 22 of 25 arm/task pairs sit at
|
||||
exactly 1.00.
|
||||
|
||||
3. **The classifier and bandit live in adjacent code but were
|
||||
designed in separate phases**, so the integration point (`Task`
|
||||
built by SLM classifier → fed to `selectBest`) is just data
|
||||
flow, not a learning loop. The SLM's wins/losses don't update
|
||||
the SLM; the bandit's wins/losses don't change which arms the
|
||||
classifier considers.
|
||||
|
||||
The 100% SLM-failure incident on 2026-05-25 made (1) urgent. The
|
||||
zero-discrimination EMA on 2026-05-24 made (2) urgent. (3) is the
|
||||
underlying integration debt.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Killing the existing SLM classifier today.** Phase 1 of this
|
||||
plan is purely additive (encoder feature extraction); the existing
|
||||
classifier stays as a baseline until the new path is measurably
|
||||
better.
|
||||
- **Reimplementing bandit math.** LinUCB and Thompson Sampling are
|
||||
well-understood. The work is the feature pipeline and reward
|
||||
function, not the policy core.
|
||||
- **Choosing a single embedding model permanently.** Phase 1 ships
|
||||
with a default but exposes a `[slm.embedding].model` knob so
|
||||
swapping is config-only.
|
||||
- **The strict-SLM track.** FunctionGemma fine-tuning is the sibling
|
||||
`2026-05-23-tool-router-specialization.md` plan; this plan
|
||||
references it but does not duplicate it.
|
||||
|
||||
---
|
||||
|
||||
## Background — research summary
|
||||
|
||||
Citations follow the user-provided research thread (RouteLLM 2024,
|
||||
ModernBERT 2024, Google FunctionGemma 2025).
|
||||
|
||||
- **RouteLLM** tested router types as a classification problem:
|
||||
similarity routing, matrix factorization, BERT classifier, causal
|
||||
LLM classifier. The BERT classifier was competitive with the
|
||||
causal-LLM classifier at lower cost and latency. Routing is a
|
||||
classification task; treating it like a generation task is paying
|
||||
generation cost for classification value.
|
||||
- **ModernBERT** (Dec 2024) is an encoder-only model with 8k context,
|
||||
trained partly on code, designed for fast classification and
|
||||
retrieval. The 'base' size is ~150M parameters, the 'large' size
|
||||
~400M. Both are tiny compared to even small decoder LLMs.
|
||||
- **FunctionGemma-270M-it** (Aug 2025) is Google's small model
|
||||
fine-tuned for natural-language → function-call output. Google's
|
||||
own positioning materials list **query routing** as a use case.
|
||||
- **Qwen3-Embedding-0.6B** and **BGE-M3** are strong multilingual
|
||||
embedding models with long-context support; either can serve as
|
||||
feature extractors for downstream classification or bandit
|
||||
policies.
|
||||
|
||||
The throughline: **encoder models are the right tool for the
|
||||
classification side of routing**; generative SLMs (FunctionGemma)
|
||||
are the right tool only when the *output* must be a structured
|
||||
decision blob with confidence + tags + fallback. For pure routing,
|
||||
encoder features + bandit policy is cheaper, faster, more
|
||||
deterministic.
|
||||
|
||||
---
|
||||
|
||||
## Approach overview
|
||||
|
||||
Five phases. Phase 1 is near-term; Phases 2–4 are the actual
|
||||
architectural shift; Phase 5 is the long-arc fine-tune.
|
||||
|
||||
### Phase 1 — Embedding feature scaffold (near-term, additive)
|
||||
|
||||
Add an embedding pipeline that runs alongside the existing
|
||||
classifier. Extract features for every prompt; log them to disk
|
||||
next to the existing quality-EMA. No routing decision changes yet.
|
||||
|
||||
**Why first:** lets us build up a labelled dataset of (prompt,
|
||||
features, arm, outcome) tuples without disturbing today's routing
|
||||
behaviour. Phase 2 trains against this dataset.
|
||||
|
||||
### Phase 2 — Contextual bandit over the feature set
|
||||
|
||||
Once Phase 1 has ~500–1000 labelled observations, swap `selectBest`
|
||||
from heuristic quality + EMA score to a LinUCB-style contextual
|
||||
bandit that takes the embedding features + the existing arm metadata
|
||||
(MaxComplexity, CostWeight, Strengths). The existing EMA quality
|
||||
score becomes one feature among many.
|
||||
|
||||
### Phase 3 — Retire the decoder-SLM classifier
|
||||
|
||||
When Phase 2 routing is measurably better than today's heuristic +
|
||||
EMA blend, the decoder-SLM classifier (currently producing 0
|
||||
useful classifications on the user's setup) is no longer
|
||||
load-bearing. Deprecate it; keep the same `[slm]` config knobs for
|
||||
backwards compatibility but route them at a different runtime path.
|
||||
|
||||
### Phase 4 — ModernBERT fine-tune
|
||||
|
||||
The off-the-shelf embedding model from Phase 1 (BGE-M3 or
|
||||
Qwen3-Embedding-0.6B by default) gives general-purpose embeddings.
|
||||
Phase 4 fine-tunes a router-specific classification head on top of
|
||||
ModernBERT-base using the labelled dataset accumulated since Phase
|
||||
1. Pure performance win; falls back gracefully to off-the-shelf
|
||||
embeddings if the fine-tune isn't loaded.
|
||||
|
||||
### Phase 5 — FunctionGemma JSON sanity layer (optional)
|
||||
|
||||
For users who want a structured route decision (arm + confidence +
|
||||
fallback) alongside or instead of the bandit output, plug
|
||||
FunctionGemma-270M-it (fine-tuned per the
|
||||
`tool-router-specialization` plan) as a final-stage decision blob
|
||||
emitter. Sits *after* the encoder + bandit, not in front of them.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Embedding feature scaffold (detailed)
|
||||
|
||||
This is the only phase scoped for near-term implementation. The
|
||||
others depend on Phase 1's data accumulation.
|
||||
|
||||
### What lands
|
||||
|
||||
- New package `internal/router/features` with:
|
||||
- `Embedder` interface: `Embed(ctx, prompt string) ([]float32, error)`.
|
||||
- Implementations: `OllamaEmbedder`, `BGE3Embedder`, `NoopEmbedder`
|
||||
(default; returns nil features when no embedding model is
|
||||
configured).
|
||||
- New config `[slm.embedding]` section:
|
||||
```toml
|
||||
[slm.embedding]
|
||||
enabled = false # default off; opt-in
|
||||
backend = "ollama" # ollama | bge-m3 | noop
|
||||
model = "qwen3-embedding:0.6b" # ollama model tag
|
||||
base_url = "" # backend endpoint override
|
||||
```
|
||||
- Feature extraction hook in `internal/engine/loop.go`: after the
|
||||
classifier runs but before `selectBest`, compute the embedding
|
||||
for the prompt and attach to the routing `Task` as an opaque
|
||||
`Features []float32` field.
|
||||
- New on-disk store at `~/.config/gnoma/router-features.jsonl`,
|
||||
one record per observation: `{ts, prompt_hash, features,
|
||||
task_type, arm_id, success, tokens, duration}`.
|
||||
- `prompt_hash` is a SHA-256 of the prompt — never the prompt
|
||||
itself — to keep the file local-only-but-not-secret-laden.
|
||||
- Append-only, atomic-write, incognito-gated, same discipline as
|
||||
the firewall audit log.
|
||||
- No selector change. `selectBest` continues to use today's
|
||||
heuristic + EMA blend. Phase 1 just observes.
|
||||
|
||||
### Why off by default
|
||||
|
||||
Embedding inference adds 50–200ms per prompt depending on backend
|
||||
and model size. That latency is fine for ollama users running on
|
||||
a workstation, painful for users on slower setups. Opt-in keeps
|
||||
the regression risk at zero.
|
||||
|
||||
### Phase 1 task list
|
||||
|
||||
- **F1-1:** Define the `Embedder` interface and `NoopEmbedder` in
|
||||
`internal/router/features/`.
|
||||
- **F1-2:** `OllamaEmbedder` wraps `provider/openaicompat` with the
|
||||
ollama embedding endpoint (`/api/embeddings`).
|
||||
- **F1-3:** Add the `[slm.embedding]` config section to
|
||||
`internal/config/config.go` with the same defaults-via-zero
|
||||
discipline as the rest of the config.
|
||||
- **F1-4:** Wire the embedder into `loop.go` between classifier and
|
||||
selector. Failures log at Debug and don't block routing.
|
||||
- **F1-5:** Append-only feature store in
|
||||
`~/.config/gnoma/router-features.jsonl` with atomic writes,
|
||||
incognito gate, opt-out via `[slm.embedding].enabled = false`.
|
||||
- **F1-6:** Tests covering: embedder mock + observation record;
|
||||
noop embedder produces empty features; incognito skips the
|
||||
store entirely.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2+ — Bandit policy (sketch only; needs data first)
|
||||
|
||||
Spelled out for context. Not for near-term implementation.
|
||||
|
||||
### Feature set per the research
|
||||
|
||||
```
|
||||
prompt_embedding — 384-1024 dim depending on model
|
||||
token_count — len of tokenized prompt
|
||||
language — ISO code from a small lang-detect
|
||||
has_code — fenced-block heuristic
|
||||
has_error_log — pattern match for stack traces
|
||||
needs_tools — from current heuristic
|
||||
needs_vision — from [Image:...] markers
|
||||
estimated_complexity — current heuristic score
|
||||
requested_latency — turn-budget hint (future)
|
||||
arm_context_window — from arm metadata
|
||||
arm_vram_cost — from arm metadata
|
||||
arm_avg_latency — from quality EMA
|
||||
arm_success_rate — from quality EMA
|
||||
```
|
||||
|
||||
### Reward function per the research
|
||||
|
||||
```
|
||||
reward = quality_score
|
||||
- latency_penalty
|
||||
- vram_penalty
|
||||
- failure_penalty
|
||||
- escalation_penalty
|
||||
```
|
||||
|
||||
- `quality_score`: 1.0 on success, 0.0 on hard error today; richer
|
||||
signal (elf-mediated, user thumbs, tool-call success) once the
|
||||
TODO `Bandit selector — design decisions deferred` resolves.
|
||||
- `latency_penalty`: monotone in observed seconds.
|
||||
- `vram_penalty`: monotone in declared VRAM cost.
|
||||
- `failure_penalty`: hard cost on explicit errors (sandbox
|
||||
denied, parse failed).
|
||||
- `escalation_penalty`: cost when a downstream elf had to escalate
|
||||
to a heavier arm because this arm failed.
|
||||
|
||||
### Policy
|
||||
|
||||
LinUCB (linear contextual bandit, deterministic exploration
|
||||
bounded by UCB) or Thompson Sampling (Bayesian, smoother
|
||||
exploration). LinUCB is the safer starting point — fewer
|
||||
hyperparameters, well-known behaviour, easier to debug.
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- **Latency.** Embedding inference adds 50–200ms per prompt. Phase
|
||||
1's opt-in default means users see no regression; Phase 2's
|
||||
"make it default" decision requires latency benchmarks first.
|
||||
- **Data sparsity for fine-tuning (Phase 4).** ModernBERT
|
||||
fine-tuning needs ~10k labelled observations to start being
|
||||
useful. Phase 1 might run for months before Phase 4 is viable.
|
||||
Plan B: synthesise labels from existing prompt logs + rule-based
|
||||
pre-labels.
|
||||
- **Off-the-shelf embedding quality.** BGE-M3 / Qwen3-Embedding
|
||||
weren't trained specifically for routing decisions. Phase 4
|
||||
exists precisely to close this gap; Phase 1's data accumulation
|
||||
is what makes Phase 4 possible.
|
||||
- **Architectural complexity.** This plan introduces an entire new
|
||||
ML pipeline (embedder → feature store → bandit → reward loop).
|
||||
Phase 1 keeps it side-by-side with the existing path; Phase 2's
|
||||
"swap" decision is reversible because the existing path stays
|
||||
in code.
|
||||
- **Privacy.** Prompt hashes (not raw prompts) in the feature
|
||||
store. Still a local-only file; same opt-out plumbing as the
|
||||
project registry from the config-migration plan.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Should the feature store be per-project or global?** Per-project
|
||||
is more privacy-respecting (one project's prompts don't influence
|
||||
another's routing). Global is more data-efficient (more samples
|
||||
→ better bandit). Phase 1 chooses global by default; revisit
|
||||
during Phase 2.
|
||||
- **How does this interact with `[router].prefer = local|cloud`?**
|
||||
Easy answer: prefer policy stays as a hard tier-shift, applied
|
||||
after bandit selection. Bandit picks the best feasible arm; the
|
||||
prefer policy is consulted as a final filter / weight.
|
||||
- **What about CLI-agent subprocess arms?** They proxy to cloud but
|
||||
run locally; today's `prefer` treats them as non-local. Bandit
|
||||
features should include `is_subprocess` as a distinct feature
|
||||
so the policy can learn the user's preferences for those arms
|
||||
independent of local/cloud.
|
||||
- **Cold start.** With no observations, the bandit defaults to
|
||||
pure exploration. Should we seed with the existing heuristic
|
||||
defaults from `internal/router/defaults.go`? Probably yes —
|
||||
warm-start with the curated Strengths as priors.
|
||||
|
||||
---
|
||||
|
||||
## Rollout
|
||||
|
||||
- **Phase 1** ships as v0.5.0 (additive, opt-in, no behaviour
|
||||
change by default). Schema-touching so warrants a minor bump.
|
||||
- **Phase 2** ships when Phase 1 has accumulated enough data
|
||||
(~500–1000 observations per user) — opt-in via
|
||||
`[router].bandit_policy = "linucb"` initially, becoming default
|
||||
in a later release once measured better.
|
||||
- **Phase 3 (deprecation of decoder-SLM classifier)** is a v0.6.x
|
||||
conversation, gated on Phase 2 measurably outperforming.
|
||||
- **Phase 4 (ModernBERT fine-tune)** is v0.7+ — requires the
|
||||
fine-tuned model artifact distributed via Ollama or HF, plus
|
||||
the auto-download story.
|
||||
- **Phase 5 (FunctionGemma sanity layer)** is independent of all
|
||||
of the above; lands when the sibling `tool-router-specialization`
|
||||
plan justifies it on did-switch-rate telemetry.
|
||||
|
||||
---
|
||||
|
||||
## Cross-references
|
||||
|
||||
- TODO.md entry "Bandit selector — design decisions deferred" —
|
||||
the strategic question this plan answers in the long run.
|
||||
- TODO.md entry "Tool-router specialization (functiongemma)" — the
|
||||
sibling track; complementary, not competing.
|
||||
- [`2026-05-23-tool-router-specialization.md`](2026-05-23-tool-router-specialization.md) — FunctionGemma fine-tune plan.
|
||||
- [`2026-05-07-gnoma-roadmap.md`](2026-05-07-gnoma-roadmap.md) §Phase 4 — the original "re-evaluate bandit learning" entry.
|
||||
- 2026-05-25 diagnostic session (this conversation) — the trigger.
|
||||
@@ -48,6 +48,27 @@ type SLMSection struct {
|
||||
DataDir string `toml:"data_dir"` // llamafile-only: where to put it (empty = XDG default)
|
||||
ExpectedSHA256 string `toml:"expected_sha256"` // llamafile-only: verify hash if non-empty
|
||||
StartupTimeout Duration `toml:"startup_timeout"` // llamafile-only: first-launch wait budget; 0 = default 5s
|
||||
|
||||
// ClassifyTimeout caps each task-classification call to the SLM.
|
||||
// 0 here means "use the built-in default" (15s). Cold-start model
|
||||
// loads + thinking-mode first-token latency can easily exceed 5s
|
||||
// on smaller hardware, so the default is generous. Tune down to
|
||||
// 2-3s on fast setups, or up to 30s for very slow ones.
|
||||
ClassifyTimeout Duration `toml:"classify_timeout"`
|
||||
|
||||
// RegisterAsArm controls whether the SLM model is registered as
|
||||
// a tier-0 execution arm in addition to its classifier role.
|
||||
// nil (absent) → true (preserve historical behaviour: SLM is
|
||||
// both classifier and an execution arm for trivial-complexity
|
||||
// prompts). Explicitly false → SLM is classifier-only; trivial
|
||||
// prompts route to other local arms instead.
|
||||
//
|
||||
// Set this to false when the SLM model is task-specialised
|
||||
// (FunctionGemma, embedding-only models, code-completion-tuned
|
||||
// models) and would produce wrong-shape output if asked to
|
||||
// answer a general prompt. Pointer type so the absent-value
|
||||
// case can be distinguished from explicit false.
|
||||
RegisterAsArm *bool `toml:"register_as_arm"`
|
||||
}
|
||||
|
||||
// ArmConfig tunes routing for a single registered arm. Multiple [[arms]]
|
||||
@@ -157,6 +178,40 @@ type RouterSection struct {
|
||||
// and incognito take priority over this knob. See
|
||||
// docs/superpowers/plans/2026-05-23-prefer-routing-policy.md.
|
||||
Prefer string `toml:"prefer"`
|
||||
|
||||
// Bandit exposes the selector's tuning knobs. Defaults preserve
|
||||
// previous hard-coded behaviour exactly; only set these when you
|
||||
// need to tune the EMA quality tracker for an unusual workload.
|
||||
Bandit BanditSection `toml:"bandit"`
|
||||
}
|
||||
|
||||
// BanditSection holds the scoring knobs for the EMA quality tracker
|
||||
// and the score blend used by the selector. Each field has a sentinel
|
||||
// zero value that means "use the built-in default" so an empty TOML
|
||||
// block is byte-identical to pre-config behaviour. See
|
||||
// internal/router/feedback.go and internal/router/selector.go for the
|
||||
// formulas these knobs feed into.
|
||||
type BanditSection struct {
|
||||
// QualityAlpha is the EMA smoothing factor for arm-quality
|
||||
// observations. Larger values weight recent observations more.
|
||||
// Default: 0.3 (~3-sample memory). 0.0 here means "use default".
|
||||
QualityAlpha float64 `toml:"quality_alpha"`
|
||||
|
||||
// MinObservations is the minimum number of samples required
|
||||
// before observed EMA overrides the heuristic fallback. Default:
|
||||
// 3. 0 here means "use default".
|
||||
MinObservations int `toml:"min_observations"`
|
||||
|
||||
// ObservedWeight is the weight of the observed EMA in the
|
||||
// observed/heuristic blend inside scoreArm: the final quality is
|
||||
// `observed*W + heuristic*(1-W)`. Default: 0.7. 0.0 here means
|
||||
// "use default".
|
||||
ObservedWeight float64 `toml:"observed_weight"`
|
||||
|
||||
// StrengthBonus is the quality bonus added when an arm declares
|
||||
// the current task type in its Strengths list. Default: 0.15.
|
||||
// 0.0 here means "use default".
|
||||
StrengthBonus float64 `toml:"strength_bonus"`
|
||||
}
|
||||
|
||||
// MCPServerConfig defines an MCP server to start and connect to.
|
||||
|
||||
@@ -5,6 +5,8 @@ import (
|
||||
"path/filepath"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/BurntSushi/toml"
|
||||
)
|
||||
|
||||
func TestDefaults(t *testing.T) {
|
||||
@@ -448,3 +450,50 @@ model = "claude-haiku"
|
||||
t.Errorf("MaxTokens = %d, want 4096 (from global)", cfg.Provider.MaxTokens)
|
||||
}
|
||||
}
|
||||
|
||||
func TestSLMSection_RegisterAsArm_AbsentDefaultsToTrue(t *testing.T) {
|
||||
// Absent field → nil pointer → caller treats as default true,
|
||||
// preserving pre-config behaviour where the SLM is always
|
||||
// registered as an execution arm.
|
||||
var cfg Config
|
||||
if _, err := toml.Decode(`[slm]
|
||||
enabled = true
|
||||
`, &cfg); err != nil {
|
||||
t.Fatalf("decode: %v", err)
|
||||
}
|
||||
if cfg.SLM.RegisterAsArm != nil {
|
||||
t.Errorf("expected nil pointer for absent register_as_arm, got %v", *cfg.SLM.RegisterAsArm)
|
||||
}
|
||||
}
|
||||
|
||||
func TestSLMSection_RegisterAsArm_ExplicitFalse(t *testing.T) {
|
||||
var cfg Config
|
||||
if _, err := toml.Decode(`[slm]
|
||||
enabled = true
|
||||
register_as_arm = false
|
||||
`, &cfg); err != nil {
|
||||
t.Fatalf("decode: %v", err)
|
||||
}
|
||||
if cfg.SLM.RegisterAsArm == nil {
|
||||
t.Fatal("expected non-nil pointer when register_as_arm is set")
|
||||
}
|
||||
if *cfg.SLM.RegisterAsArm {
|
||||
t.Errorf("expected register_as_arm=false to decode as *false, got *true")
|
||||
}
|
||||
}
|
||||
|
||||
func TestSLMSection_RegisterAsArm_ExplicitTrue(t *testing.T) {
|
||||
var cfg Config
|
||||
if _, err := toml.Decode(`[slm]
|
||||
enabled = true
|
||||
register_as_arm = true
|
||||
`, &cfg); err != nil {
|
||||
t.Fatalf("decode: %v", err)
|
||||
}
|
||||
if cfg.SLM.RegisterAsArm == nil {
|
||||
t.Fatal("expected non-nil pointer when register_as_arm is set")
|
||||
}
|
||||
if !*cfg.SLM.RegisterAsArm {
|
||||
t.Errorf("expected register_as_arm=true to decode as *true, got *false")
|
||||
}
|
||||
}
|
||||
|
||||
@@ -186,6 +186,26 @@ func translateRequest(req provider.Request) oai.ChatCompletionNewParams {
|
||||
params.ReasoningEffort = effortToReasoningEffort(req.Thinking.Level)
|
||||
}
|
||||
|
||||
// Honour ResponseFormat. ollama (via OpenAI-compatible endpoint) and
|
||||
// llama.cpp both translate response_format=json_object to a decoding-
|
||||
// time JSON constraint, which is the only reliable way to keep small
|
||||
// models from emitting prose where structured output is required.
|
||||
// Previously this field was silently dropped on the OpenAI path,
|
||||
// which is why the SLM classifier saw a 100% prose-failure rate even
|
||||
// after Move 1 wired ResponseFormat at the gnoma layer.
|
||||
if req.ResponseFormat != nil {
|
||||
switch req.ResponseFormat.Type {
|
||||
case provider.ResponseJSON:
|
||||
params.ResponseFormat = oai.ChatCompletionNewParamsResponseFormatUnion{
|
||||
OfJSONObject: &shared.ResponseFormatJSONObjectParam{},
|
||||
}
|
||||
case provider.ResponseText:
|
||||
params.ResponseFormat = oai.ChatCompletionNewParamsResponseFormatUnion{
|
||||
OfText: &shared.ResponseFormatTextParam{},
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if len(params.Tools) > 0 {
|
||||
choice := "auto"
|
||||
if req.ToolChoice != "" {
|
||||
|
||||
@@ -189,3 +189,47 @@ func TestTranslateRequest_ToolChoiceDefault(t *testing.T) {
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestTranslateRequest_ResponseFormatJSON(t *testing.T) {
|
||||
req := provider.Request{
|
||||
Model: "qwen2.5-coder:1.5b",
|
||||
Messages: []message.Message{
|
||||
{Role: message.RoleUser, Content: []message.Content{{Type: message.ContentText, Text: "hi"}}},
|
||||
},
|
||||
ResponseFormat: &provider.ResponseFormat{Type: provider.ResponseJSON},
|
||||
}
|
||||
params := translateRequest(req)
|
||||
if params.ResponseFormat.OfJSONObject == nil {
|
||||
t.Errorf("expected OfJSONObject set when ResponseFormat=ResponseJSON, got %+v", params.ResponseFormat)
|
||||
}
|
||||
if params.ResponseFormat.OfText != nil {
|
||||
t.Errorf("expected OfText nil when ResponseFormat=ResponseJSON")
|
||||
}
|
||||
}
|
||||
|
||||
func TestTranslateRequest_ResponseFormatText(t *testing.T) {
|
||||
req := provider.Request{
|
||||
Model: "qwen2.5-coder:1.5b",
|
||||
Messages: []message.Message{
|
||||
{Role: message.RoleUser, Content: []message.Content{{Type: message.ContentText, Text: "hi"}}},
|
||||
},
|
||||
ResponseFormat: &provider.ResponseFormat{Type: provider.ResponseText},
|
||||
}
|
||||
params := translateRequest(req)
|
||||
if params.ResponseFormat.OfText == nil {
|
||||
t.Errorf("expected OfText set when ResponseFormat=ResponseText, got %+v", params.ResponseFormat)
|
||||
}
|
||||
}
|
||||
|
||||
func TestTranslateRequest_ResponseFormatUnset(t *testing.T) {
|
||||
req := provider.Request{
|
||||
Model: "qwen2.5-coder:1.5b",
|
||||
Messages: []message.Message{
|
||||
{Role: message.RoleUser, Content: []message.Content{{Type: message.ContentText, Text: "hi"}}},
|
||||
},
|
||||
}
|
||||
params := translateRequest(req)
|
||||
if params.ResponseFormat.OfJSONObject != nil || params.ResponseFormat.OfText != nil {
|
||||
t.Errorf("expected zero-valued ResponseFormat when not set, got %+v", params.ResponseFormat)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -109,8 +109,19 @@ var knownAgents = []CLIAgent{
|
||||
// structured-output flag and no image-input mechanism. JSON support
|
||||
// is faked via PromptResponseFormat (best-effort, model-dependent);
|
||||
// see TODO.md for tracking native stream-json support.
|
||||
//
|
||||
// ToolUse is false on purpose. agy streams plain text and the
|
||||
// agyParser turns every line into an EventTextDelta — there is
|
||||
// no path for a structured ToolCall event to come back. With
|
||||
// ToolUse=true the router would dispatch tool-needing tasks
|
||||
// (security_review, spawn_elfs, file edit) to agy; the
|
||||
// underlying Gemini model would describe calling the tool in
|
||||
// prose (invented UUIDs and "I will pause now"-style stubs),
|
||||
// the engine would receive only text, and the turn would hang
|
||||
// waiting for a tool call that never arrives. Flip back to
|
||||
// true when native stream-json lands.
|
||||
Capabilities: provider.Capabilities{
|
||||
ToolUse: true,
|
||||
ToolUse: false,
|
||||
ContextWindow: 200000,
|
||||
},
|
||||
PromptResponseFormat: true,
|
||||
|
||||
@@ -57,12 +57,12 @@ func benchTasks() []Task {
|
||||
func BenchmarkSelectBest(b *testing.B) {
|
||||
arms := benchArms()
|
||||
tasks := benchTasks()
|
||||
qt := NewQualityTracker()
|
||||
qt := NewQualityTracker(0, 0)
|
||||
|
||||
b.ResetTimer()
|
||||
for b.Loop() {
|
||||
for _, task := range tasks {
|
||||
selectBest(qt, arms, task, PreferAuto)
|
||||
selectBest(qt, BanditParams{}, arms, task, PreferAuto)
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -99,13 +99,13 @@ func BenchmarkRouterSelect(b *testing.B) {
|
||||
|
||||
func BenchmarkScoreArm(b *testing.B) {
|
||||
arms := benchArms()
|
||||
qt := NewQualityTracker()
|
||||
qt := NewQualityTracker(0, 0)
|
||||
task := Task{Type: TaskGeneration, Priority: PriorityNormal, EstimatedTokens: 2000, RequiresTools: true, ComplexityScore: 0.5}
|
||||
|
||||
b.ResetTimer()
|
||||
for b.Loop() {
|
||||
for _, arm := range arms {
|
||||
scoreArm(qt, arm, task)
|
||||
scoreArm(qt, BanditParams{}, arm, task)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -93,16 +93,27 @@ func DiscoverOllama(ctx context.Context, baseURL string, probeCache map[string]O
|
||||
Size: m.Size,
|
||||
}
|
||||
|
||||
// Always probe; the cache is optional. Previously nil-cache was
|
||||
// treated as "skip probing entirely", which left SupportsTools
|
||||
// at its zero value (false) for every model — every ollama-
|
||||
// discovered arm then got marked as tool-unsupported and
|
||||
// rejected by filterFeasible for any tool-requiring task. main.go
|
||||
// passes nil from the synchronous discovery path; we still want
|
||||
// real probe data there.
|
||||
var result OllamaProbeResult
|
||||
if probeCache != nil {
|
||||
result, ok := probeCache[m.Name]
|
||||
if !ok {
|
||||
if cached, ok := probeCache[m.Name]; ok {
|
||||
result = cached
|
||||
} else {
|
||||
result = probeOllamaModel(ctx, baseURL, m.Name)
|
||||
probeCache[m.Name] = result
|
||||
}
|
||||
dm.SupportsTools = result.SupportsTools
|
||||
dm.SupportsVision = result.SupportsVision
|
||||
dm.ContextSize = result.ContextSize
|
||||
} else {
|
||||
result = probeOllamaModel(ctx, baseURL, m.Name)
|
||||
}
|
||||
dm.SupportsTools = result.SupportsTools
|
||||
dm.SupportsVision = result.SupportsVision
|
||||
dm.ContextSize = result.ContextSize
|
||||
|
||||
if dm.ContextSize == 0 {
|
||||
dm.ContextSize = defaultOllamaContextSize
|
||||
|
||||
@@ -2,9 +2,15 @@ package router
|
||||
|
||||
import "sync"
|
||||
|
||||
// Built-in defaults for the bandit knobs. Surfaced via
|
||||
// [router.bandit] config keys; see BanditParams in router.go. Kept
|
||||
// here so the QualityTracker has a sensible fallback when constructed
|
||||
// without explicit parameters (tests, ad-hoc callers).
|
||||
const (
|
||||
qualityAlpha = 0.3 // EMA smoothing factor (~3-sample memory)
|
||||
minObservations = 3 // min samples before observed score overrides heuristic
|
||||
defaultQualityAlpha = 0.3 // EMA smoothing factor (~3-sample memory)
|
||||
defaultMinObservations = 3 // min samples before observed score overrides heuristic
|
||||
defaultObservedWeight = 0.7 // weight of observed score in observed/heuristic blend
|
||||
defaultStrengthBonus = 0.15
|
||||
)
|
||||
|
||||
// EMAScore tracks an exponential moving average quality score.
|
||||
@@ -19,13 +25,27 @@ type QualityTracker struct {
|
||||
mu sync.RWMutex
|
||||
scores map[ArmID]map[TaskType]*EMAScore
|
||||
classifierCount map[ClassifierSource]int
|
||||
|
||||
// Configurable knobs — set via NewQualityTracker. Pass 0 for any
|
||||
// argument to keep the built-in default.
|
||||
alpha float64
|
||||
minObservations int
|
||||
}
|
||||
|
||||
// NewQualityTracker returns an empty QualityTracker.
|
||||
func NewQualityTracker() *QualityTracker {
|
||||
// NewQualityTracker returns an empty QualityTracker. Pass 0 for any
|
||||
// argument to keep the built-in default (alpha=0.3, minObs=3).
|
||||
func NewQualityTracker(alpha float64, minObs int) *QualityTracker {
|
||||
if alpha == 0 {
|
||||
alpha = defaultQualityAlpha
|
||||
}
|
||||
if minObs == 0 {
|
||||
minObs = defaultMinObservations
|
||||
}
|
||||
return &QualityTracker{
|
||||
scores: make(map[ArmID]map[TaskType]*EMAScore),
|
||||
classifierCount: make(map[ClassifierSource]int),
|
||||
alpha: alpha,
|
||||
minObservations: minObs,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -71,7 +91,7 @@ func (qt *QualityTracker) Record(armID ArmID, taskType TaskType, success bool) {
|
||||
if s.Count == 0 {
|
||||
s.Value = observation
|
||||
} else {
|
||||
s.Value = qualityAlpha*observation + (1-qualityAlpha)*s.Value
|
||||
s.Value = qt.alpha*observation + (1-qt.alpha)*s.Value
|
||||
}
|
||||
s.Count++
|
||||
}
|
||||
@@ -86,7 +106,7 @@ func (qt *QualityTracker) Quality(armID ArmID, taskType TaskType) (score float64
|
||||
return 0, false
|
||||
}
|
||||
s, ok := m[taskType]
|
||||
if !ok || s.Count < minObservations {
|
||||
if !ok || s.Count < qt.minObservations {
|
||||
return 0, false
|
||||
}
|
||||
return s.Value, true
|
||||
|
||||
@@ -8,7 +8,7 @@ import (
|
||||
)
|
||||
|
||||
func TestQualityTracker_NoDataReturnsHeuristic(t *testing.T) {
|
||||
qt := router.NewQualityTracker()
|
||||
qt := router.NewQualityTracker(0, 0)
|
||||
_, hasData := qt.Quality("arm:model", router.TaskGeneration)
|
||||
if hasData {
|
||||
t.Error("expected no data for unobserved arm")
|
||||
@@ -16,7 +16,7 @@ func TestQualityTracker_NoDataReturnsHeuristic(t *testing.T) {
|
||||
}
|
||||
|
||||
func TestQualityTracker_RecordUpdatesEMA(t *testing.T) {
|
||||
qt := router.NewQualityTracker()
|
||||
qt := router.NewQualityTracker(0, 0)
|
||||
for i := 0; i < 3; i++ {
|
||||
qt.Record("arm:model", router.TaskGeneration, true)
|
||||
}
|
||||
@@ -30,7 +30,7 @@ func TestQualityTracker_RecordUpdatesEMA(t *testing.T) {
|
||||
}
|
||||
|
||||
func TestQualityTracker_AllFailuresLowScore(t *testing.T) {
|
||||
qt := router.NewQualityTracker()
|
||||
qt := router.NewQualityTracker(0, 0)
|
||||
for i := 0; i < 5; i++ {
|
||||
qt.Record("arm:model", router.TaskDebug, false)
|
||||
}
|
||||
@@ -41,7 +41,7 @@ func TestQualityTracker_AllFailuresLowScore(t *testing.T) {
|
||||
}
|
||||
|
||||
func TestQualityTracker_ConcurrentSafe(t *testing.T) {
|
||||
qt := router.NewQualityTracker()
|
||||
qt := router.NewQualityTracker(0, 0)
|
||||
done := make(chan struct{})
|
||||
for i := 0; i < 10; i++ {
|
||||
go func(success bool) {
|
||||
@@ -113,3 +113,45 @@ func TestQualityTracker_InsufficientDataFallsBackToHeuristic(t *testing.T) {
|
||||
}
|
||||
decision.Rollback()
|
||||
}
|
||||
|
||||
func TestQualityTracker_CustomAlphaShortensMemory(t *testing.T) {
|
||||
// alpha=0.9 weights the latest sample heavily; after a single
|
||||
// failure the score should drop further than with the default 0.3.
|
||||
fast := router.NewQualityTracker(0.9, 0)
|
||||
slow := router.NewQualityTracker(0.0, 0) // 0 → default 0.3
|
||||
|
||||
for _, qt := range []*router.QualityTracker{fast, slow} {
|
||||
// Build up history at the high end with 5 successes.
|
||||
for i := 0; i < 5; i++ {
|
||||
qt.Record("arm:m", router.TaskGeneration, true)
|
||||
}
|
||||
// One failure.
|
||||
qt.Record("arm:m", router.TaskGeneration, false)
|
||||
}
|
||||
|
||||
fastScore, _ := fast.Quality("arm:m", router.TaskGeneration)
|
||||
slowScore, _ := slow.Quality("arm:m", router.TaskGeneration)
|
||||
|
||||
if !(fastScore < slowScore) {
|
||||
t.Errorf("expected fast alpha (0.9) to drop quality faster than default (0.3): fast=%f slow=%f", fastScore, slowScore)
|
||||
}
|
||||
}
|
||||
|
||||
func TestQualityTracker_CustomMinObservationsGatesScore(t *testing.T) {
|
||||
// minObs=10 means Quality should return hasData=false until 10
|
||||
// observations are recorded, even though the default would say
|
||||
// "yes" after 3.
|
||||
qt := router.NewQualityTracker(0, 10)
|
||||
for i := 0; i < 5; i++ {
|
||||
qt.Record("arm:m", router.TaskGeneration, true)
|
||||
}
|
||||
if _, hasData := qt.Quality("arm:m", router.TaskGeneration); hasData {
|
||||
t.Error("expected hasData=false at 5 observations with minObs=10")
|
||||
}
|
||||
for i := 0; i < 5; i++ {
|
||||
qt.Record("arm:m", router.TaskGeneration, true)
|
||||
}
|
||||
if _, hasData := qt.Quality("arm:m", router.TaskGeneration); !hasData {
|
||||
t.Error("expected hasData=true after 10 observations with minObs=10")
|
||||
}
|
||||
}
|
||||
|
||||
@@ -8,7 +8,7 @@ import (
|
||||
)
|
||||
|
||||
func TestQualityTracker_SnapshotRestore_RoundTrip(t *testing.T) {
|
||||
qt := router.NewQualityTracker()
|
||||
qt := router.NewQualityTracker(0, 0)
|
||||
// Record some outcomes
|
||||
qt.Record("anthropic/claude-3-5-sonnet", router.TaskGeneration, true)
|
||||
qt.Record("anthropic/claude-3-5-sonnet", router.TaskGeneration, true)
|
||||
@@ -33,7 +33,7 @@ func TestQualityTracker_SnapshotRestore_RoundTrip(t *testing.T) {
|
||||
}
|
||||
|
||||
// Restore into a fresh tracker
|
||||
qt2 := router.NewQualityTracker()
|
||||
qt2 := router.NewQualityTracker(0, 0)
|
||||
qt2.Restore(restored)
|
||||
|
||||
// After restore, Quality() should return data (Count >= minObservations=3)
|
||||
@@ -47,7 +47,7 @@ func TestQualityTracker_SnapshotRestore_RoundTrip(t *testing.T) {
|
||||
}
|
||||
|
||||
func TestQualityTracker_Snapshot_Empty(t *testing.T) {
|
||||
qt := router.NewQualityTracker()
|
||||
qt := router.NewQualityTracker(0, 0)
|
||||
snap := qt.Snapshot()
|
||||
if snap.Scores == nil {
|
||||
t.Error("scores map should be initialized (not nil)")
|
||||
@@ -58,7 +58,7 @@ func TestQualityTracker_Snapshot_Empty(t *testing.T) {
|
||||
}
|
||||
|
||||
func TestQualityTracker_ClassifierCounts_RecordAndSnapshot(t *testing.T) {
|
||||
qt := router.NewQualityTracker()
|
||||
qt := router.NewQualityTracker(0, 0)
|
||||
qt.RecordClassifier(router.ClassifierHeuristic)
|
||||
qt.RecordClassifier(router.ClassifierSLM)
|
||||
qt.RecordClassifier(router.ClassifierSLM)
|
||||
@@ -92,7 +92,7 @@ func TestQualityTracker_ClassifierCounts_RecordAndSnapshot(t *testing.T) {
|
||||
if err := json.Unmarshal(data, &restored); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
qt2 := router.NewQualityTracker()
|
||||
qt2 := router.NewQualityTracker(0, 0)
|
||||
qt2.Restore(restored)
|
||||
if qt2.ClassifierCounts()[router.ClassifierSLM] != 2 {
|
||||
t.Errorf("restored slm count = %d, want 2", qt2.ClassifierCounts()[router.ClassifierSLM])
|
||||
@@ -107,7 +107,7 @@ func TestQualityTracker_Restore_BackCompat_NoClassifierCounts(t *testing.T) {
|
||||
if err := json.Unmarshal(legacy, &snap); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
qt := router.NewQualityTracker()
|
||||
qt := router.NewQualityTracker(0, 0)
|
||||
qt.Restore(snap)
|
||||
if qt.ClassifierCounts() == nil {
|
||||
t.Error("ClassifierCounts() must return a non-nil map after restoring old snapshot")
|
||||
@@ -122,7 +122,7 @@ func TestQualityTracker_Restore_BackCompat_NoClassifierCounts(t *testing.T) {
|
||||
}
|
||||
|
||||
func TestQualityTracker_Restore_Replaces(t *testing.T) {
|
||||
qt := router.NewQualityTracker()
|
||||
qt := router.NewQualityTracker(0, 0)
|
||||
qt.Record("arm-a", router.TaskDebug, true)
|
||||
qt.Record("arm-a", router.TaskDebug, true)
|
||||
qt.Record("arm-a", router.TaskDebug, true)
|
||||
|
||||
@@ -27,6 +27,7 @@ type Router struct {
|
||||
preferPolicy PreferPolicy
|
||||
|
||||
quality *QualityTracker
|
||||
bandit BanditParams
|
||||
}
|
||||
|
||||
// PreferPolicy biases the scoring step toward local or cloud arms.
|
||||
@@ -77,6 +78,41 @@ func (p PreferPolicy) String() string {
|
||||
|
||||
type Config struct {
|
||||
Logger *slog.Logger
|
||||
// Bandit tunes the selector's scoring knobs. Pass a zero value to
|
||||
// keep all pre-config behaviour byte-identical; set individual
|
||||
// fields to override the corresponding default.
|
||||
Bandit BanditParams
|
||||
}
|
||||
|
||||
// BanditParams controls the EMA quality tracker and score blend used
|
||||
// by the selector. Each field has a "use default" sentinel (0 for
|
||||
// floats and ints) so a zero-valued BanditParams is byte-identical to
|
||||
// the pre-config hardcoded constants. Defaults are defined in
|
||||
// resolveBanditParams below.
|
||||
type BanditParams struct {
|
||||
QualityAlpha float64
|
||||
MinObservations int
|
||||
ObservedWeight float64
|
||||
StrengthBonus float64
|
||||
}
|
||||
|
||||
// resolveBanditParams fills in the built-in defaults for any field
|
||||
// left at its zero value. Centralised so the same defaults apply
|
||||
// across NewQualityTracker, scoreArm, and any future caller.
|
||||
func resolveBanditParams(p BanditParams) BanditParams {
|
||||
if p.QualityAlpha == 0 {
|
||||
p.QualityAlpha = defaultQualityAlpha
|
||||
}
|
||||
if p.MinObservations == 0 {
|
||||
p.MinObservations = defaultMinObservations
|
||||
}
|
||||
if p.ObservedWeight == 0 {
|
||||
p.ObservedWeight = defaultObservedWeight
|
||||
}
|
||||
if p.StrengthBonus == 0 {
|
||||
p.StrengthBonus = defaultStrengthBonus
|
||||
}
|
||||
return p
|
||||
}
|
||||
|
||||
func New(cfg Config) *Router {
|
||||
@@ -84,10 +120,12 @@ func New(cfg Config) *Router {
|
||||
if logger == nil {
|
||||
logger = slog.Default()
|
||||
}
|
||||
params := resolveBanditParams(cfg.Bandit)
|
||||
return &Router{
|
||||
arms: make(map[ArmID]*Arm),
|
||||
logger: logger,
|
||||
quality: NewQualityTracker(),
|
||||
quality: NewQualityTracker(params.QualityAlpha, params.MinObservations),
|
||||
bandit: params,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -172,7 +210,7 @@ func (r *Router) Select(task Task) RoutingDecision {
|
||||
}
|
||||
|
||||
// Select best
|
||||
best := selectBest(r.quality, feasible, task, r.preferPolicy)
|
||||
best := selectBest(r.quality, r.bandit, feasible, task, r.preferPolicy)
|
||||
if best == nil {
|
||||
return RoutingDecision{Error: fmt.Errorf("selection failed")}
|
||||
}
|
||||
|
||||
@@ -262,7 +262,7 @@ func TestSelectBest_PrefersToolSupport(t *testing.T) {
|
||||
}
|
||||
|
||||
task := Task{Type: TaskGeneration, RequiresTools: true, Priority: PriorityNormal}
|
||||
best := selectBest(nil, []*Arm{withoutTools, withTools}, task, PreferAuto)
|
||||
best := selectBest(nil, BanditParams{}, []*Arm{withoutTools, withTools}, task, PreferAuto)
|
||||
|
||||
if best.ID != "a/with-tools" {
|
||||
t.Errorf("should prefer arm with tool support, got %s", best.ID)
|
||||
@@ -282,7 +282,7 @@ func TestSelectBest_PrefersThinkingForPlanning(t *testing.T) {
|
||||
}
|
||||
|
||||
task := Task{Type: TaskPlanning, RequiresTools: true, Priority: PriorityNormal, EstimatedTokens: 5000}
|
||||
best := selectBest(nil, []*Arm{noThinking, thinking}, task, PreferAuto)
|
||||
best := selectBest(nil, BanditParams{}, []*Arm{noThinking, thinking}, task, PreferAuto)
|
||||
|
||||
if best.ID != "a/thinking" {
|
||||
t.Errorf("should prefer thinking model for planning, got %s", best.ID)
|
||||
@@ -625,7 +625,7 @@ func TestSelectBest_SmallArmWinsTrivialTask(t *testing.T) {
|
||||
Capabilities: provider.Capabilities{ToolUse: false},
|
||||
}
|
||||
task := Task{Type: TaskExplain, ComplexityScore: 0.05, RequiresTools: false}
|
||||
got := selectBest(nil, []*Arm{cliArm, smallArm}, task, PreferAuto)
|
||||
got := selectBest(nil, BanditParams{}, []*Arm{cliArm, smallArm}, task, PreferAuto)
|
||||
if got != smallArm {
|
||||
t.Errorf("selectBest = %v, want smallArm", got)
|
||||
}
|
||||
@@ -647,7 +647,7 @@ func TestSelectBest_CLIAgentWinsComplexTask(t *testing.T) {
|
||||
Capabilities: provider.Capabilities{ToolUse: false},
|
||||
}
|
||||
task := Task{Type: TaskRefactor, ComplexityScore: 0.7, RequiresTools: true}
|
||||
got := selectBest(nil, []*Arm{cliArm, smallArm}, task, PreferAuto)
|
||||
got := selectBest(nil, BanditParams{}, []*Arm{cliArm, smallArm}, task, PreferAuto)
|
||||
if got != cliArm {
|
||||
t.Errorf("selectBest = %v, want cliArm", got)
|
||||
}
|
||||
@@ -672,21 +672,21 @@ func TestSelectBest_TierPreference(t *testing.T) {
|
||||
task := Task{Type: TaskGeneration, Priority: PriorityNormal, EstimatedTokens: 1000}
|
||||
|
||||
t.Run("CLI beats local and API", func(t *testing.T) {
|
||||
best := selectBest(nil, []*Arm{apiArm, localArm, cliArm}, task, PreferAuto)
|
||||
best := selectBest(nil, BanditParams{}, []*Arm{apiArm, localArm, cliArm}, task, PreferAuto)
|
||||
if best.ID != "subprocess/claude" {
|
||||
t.Errorf("want subprocess/claude (tier 0), got %s", best.ID)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("local beats API when no CLI", func(t *testing.T) {
|
||||
best := selectBest(nil, []*Arm{apiArm, localArm}, task, PreferAuto)
|
||||
best := selectBest(nil, BanditParams{}, []*Arm{apiArm, localArm}, task, PreferAuto)
|
||||
if best.ID != "ollama/llama3" {
|
||||
t.Errorf("want ollama/llama3 (tier 1), got %s", best.ID)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("API selected when only option", func(t *testing.T) {
|
||||
best := selectBest(nil, []*Arm{apiArm}, task, PreferAuto)
|
||||
best := selectBest(nil, BanditParams{}, []*Arm{apiArm}, task, PreferAuto)
|
||||
if best == nil || best.ID != "mistral/mistral-large" {
|
||||
t.Errorf("want mistral/mistral-large (tier 2), got %v", best)
|
||||
}
|
||||
|
||||
+49
-13
@@ -1,6 +1,7 @@
|
||||
package router
|
||||
|
||||
import (
|
||||
"log/slog"
|
||||
"math"
|
||||
)
|
||||
|
||||
@@ -98,7 +99,7 @@ func armBaseTier(arm *Arm, task Task) int {
|
||||
//
|
||||
// Step 2 (fallback): walk tiers low→high. Within a tier, highest-scoring
|
||||
// arm wins.
|
||||
func selectBest(qt *QualityTracker, arms []*Arm, task Task, prefer PreferPolicy) *Arm {
|
||||
func selectBest(qt *QualityTracker, params BanditParams, arms []*Arm, task Task, prefer PreferPolicy) *Arm {
|
||||
if len(arms) == 0 {
|
||||
return nil
|
||||
}
|
||||
@@ -110,7 +111,7 @@ func selectBest(qt *QualityTracker, arms []*Arm, task Task, prefer PreferPolicy)
|
||||
}
|
||||
}
|
||||
if len(promoted) > 0 {
|
||||
return bestScored(qt, promoted, task, prefer)
|
||||
return bestScored(qt, params, promoted, task, prefer)
|
||||
}
|
||||
|
||||
// Walk tiers low→high. armTier returns up to 5 when prefer is set
|
||||
@@ -124,18 +125,18 @@ func selectBest(qt *QualityTracker, arms []*Arm, task Task, prefer PreferPolicy)
|
||||
}
|
||||
}
|
||||
if len(inTier) > 0 {
|
||||
return bestScored(qt, inTier, task, prefer)
|
||||
return bestScored(qt, params, inTier, task, prefer)
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// bestScored returns the highest-scoring arm within a set.
|
||||
func bestScored(qt *QualityTracker, arms []*Arm, task Task, prefer PreferPolicy) *Arm {
|
||||
func bestScored(qt *QualityTracker, params BanditParams, arms []*Arm, task Task, prefer PreferPolicy) *Arm {
|
||||
var best *Arm
|
||||
bestScore := math.Inf(-1)
|
||||
for _, arm := range arms {
|
||||
score := scoreArm(qt, arm, task) * policyMultiplier(arm, prefer)
|
||||
score := scoreArm(qt, params, arm, task) * policyMultiplier(arm, prefer)
|
||||
if score > bestScore {
|
||||
bestScore = score
|
||||
best = arm
|
||||
@@ -172,13 +173,12 @@ func policyMultiplier(arm *Arm, p PreferPolicy) float64 {
|
||||
}
|
||||
}
|
||||
|
||||
// strengthScoreBonus is added to quality when an arm's Strengths list
|
||||
// matches the incoming task type. Tunable in one place.
|
||||
const strengthScoreBonus = 0.15
|
||||
|
||||
// scoreArm computes a quality/cost score for an arm.
|
||||
// When the quality tracker has sufficient observations, blends observed EMA
|
||||
// (70%) with heuristic (30%). Falls back to pure heuristic otherwise.
|
||||
// (default 70%) with heuristic (default 30%). Falls back to pure heuristic
|
||||
// otherwise. The blend ratio and strength bonus are tunable via
|
||||
// BanditParams (config: [router.bandit]); a zero-valued params falls back
|
||||
// to the built-in defaults.
|
||||
//
|
||||
// Strengths add a fixed bonus to quality when matching task.Type. CostWeight
|
||||
// dampens the cost penalty linearly:
|
||||
@@ -189,16 +189,17 @@ const strengthScoreBonus = 0.15
|
||||
// the original effectiveCost == cost. With CostWeight=0 cost is fully
|
||||
// ignored (effectiveCost = 1.0). Local arms with sub-1 raw costs are not
|
||||
// amplified by fractional weights (the linear formula stays monotone).
|
||||
func scoreArm(qt *QualityTracker, arm *Arm, task Task) float64 {
|
||||
func scoreArm(qt *QualityTracker, params BanditParams, arm *Arm, task Task) float64 {
|
||||
params = resolveBanditParams(params)
|
||||
hq := heuristicQuality(arm, task)
|
||||
quality := hq
|
||||
if qt != nil {
|
||||
if observed, hasData := qt.Quality(arm.ID, task.Type); hasData {
|
||||
quality = 0.7*observed + 0.3*hq
|
||||
quality = params.ObservedWeight*observed + (1-params.ObservedWeight)*hq
|
||||
}
|
||||
}
|
||||
if arm.HasStrength(task.Type) {
|
||||
quality += strengthScoreBonus
|
||||
quality += params.StrengthBonus
|
||||
}
|
||||
value := task.ValueScore()
|
||||
rawCost := effectiveCost(arm, task)
|
||||
@@ -281,20 +282,39 @@ func effectiveCost(arm *Arm, task Task) float64 {
|
||||
// filterFeasible returns arms that can handle the task (tools, pool capacity, quality).
|
||||
// Arms that pass tool and pool checks but fall below the task's minimum quality threshold
|
||||
// are collected separately and used as a last resort if no arm meets the threshold.
|
||||
//
|
||||
// When the result is empty the caller surfaces a generic "no feasible arm"
|
||||
// error; rejection reasons are logged here at slog.Debug per-arm so users
|
||||
// debugging "why did the router reject everything?" with --verbose can see
|
||||
// the actual constraint each arm tripped instead of guessing.
|
||||
func filterFeasible(arms []*Arm, task Task) []*Arm {
|
||||
threshold := DefaultThresholds[task.Type]
|
||||
|
||||
var feasible []*Arm
|
||||
var belowQuality []*Arm // passed tool+pool but scored below minimum quality
|
||||
|
||||
reject := func(arm *Arm, reason string, fields ...any) {
|
||||
base := []any{
|
||||
"arm", arm.ID,
|
||||
"task", task.Type,
|
||||
"complexity", task.ComplexityScore,
|
||||
"reason", reason,
|
||||
}
|
||||
slog.Debug("filterFeasible: rejected", append(base, fields...)...)
|
||||
}
|
||||
|
||||
for _, arm := range arms {
|
||||
// Complexity ceiling: zero means no ceiling (preserves behavior for all existing arms).
|
||||
if arm.MaxComplexity > 0 && task.ComplexityScore > arm.MaxComplexity {
|
||||
reject(arm, "complexity_exceeds_max",
|
||||
"max_complexity", arm.MaxComplexity)
|
||||
continue
|
||||
}
|
||||
|
||||
// Must support tools if task requires them
|
||||
if task.RequiresTools && !arm.SupportsTools() {
|
||||
reject(arm, "tools_required_but_unsupported",
|
||||
"tool_use_capability", arm.Capabilities.ToolUse)
|
||||
continue
|
||||
}
|
||||
|
||||
@@ -303,11 +323,15 @@ func filterFeasible(arms []*Arm, task Task) []*Arm {
|
||||
// cannot consume the image bytes, so degrading to it would silently
|
||||
// drop the image and confuse the model.
|
||||
if task.RequiresVision && !arm.Capabilities.Vision {
|
||||
reject(arm, "vision_required_but_unsupported",
|
||||
"vision_capability", arm.Capabilities.Vision)
|
||||
continue
|
||||
}
|
||||
|
||||
// Must support the required effort level (EffortAuto always passes)
|
||||
if !arm.Capabilities.SupportsEffort(task.RequiredEffort) {
|
||||
reject(arm, "effort_level_unsupported",
|
||||
"required_effort", task.RequiredEffort)
|
||||
continue
|
||||
}
|
||||
|
||||
@@ -316,6 +340,8 @@ func filterFeasible(arms []*Arm, task Task) []*Arm {
|
||||
for _, pool := range arm.Pools {
|
||||
pool.CheckReset()
|
||||
if !pool.CanAfford(arm.ID, task.EstimatedTokens) {
|
||||
reject(arm, "pool_capacity_exceeded",
|
||||
"estimated_tokens", task.EstimatedTokens)
|
||||
poolsOK = false
|
||||
break
|
||||
}
|
||||
@@ -333,6 +359,16 @@ func filterFeasible(arms []*Arm, task Task) []*Arm {
|
||||
feasible = append(feasible, arm)
|
||||
}
|
||||
|
||||
if len(feasible) == 0 && len(belowQuality) == 0 {
|
||||
slog.Debug("filterFeasible: no arms feasible at any quality level",
|
||||
"task", task.Type,
|
||||
"complexity", task.ComplexityScore,
|
||||
"requires_tools", task.RequiresTools,
|
||||
"requires_vision", task.RequiresVision,
|
||||
"arms_considered", len(arms),
|
||||
)
|
||||
}
|
||||
|
||||
// Degrade gracefully: if no arm meets quality threshold, use below-quality ones
|
||||
if len(feasible) == 0 && len(belowQuality) > 0 {
|
||||
return belowQuality
|
||||
|
||||
@@ -65,17 +65,17 @@ func TestScoreArm_CostWeightAffectsArmComparison(t *testing.T) {
|
||||
|
||||
// CostWeight=1.0: cost dominates, cheap arm wins.
|
||||
cheap.CostWeight, expensive.CostWeight = 1.0, 1.0
|
||||
if scoreArm(nil, cheap, task) <= scoreArm(nil, expensive, task) {
|
||||
if scoreArm(nil, BanditParams{}, cheap, task) <= scoreArm(nil, BanditParams{}, expensive, task) {
|
||||
t.Errorf("CostWeight=1.0: cheap arm should beat expensive arm; cheap=%v expensive=%v",
|
||||
scoreArm(nil, cheap, task), scoreArm(nil, expensive, task))
|
||||
scoreArm(nil, BanditParams{}, cheap, task), scoreArm(nil, BanditParams{}, expensive, task))
|
||||
}
|
||||
|
||||
// CostWeight=0.0: cost ignored, quality alone decides → expensive (better
|
||||
// context window) wins.
|
||||
cheap.CostWeight, expensive.CostWeight = 0.001, 0.001
|
||||
if scoreArm(nil, expensive, task) <= scoreArm(nil, cheap, task) {
|
||||
if scoreArm(nil, BanditParams{}, expensive, task) <= scoreArm(nil, BanditParams{}, cheap, task) {
|
||||
t.Errorf("CostWeight~0: higher-quality expensive arm should beat cheap arm; expensive=%v cheap=%v",
|
||||
scoreArm(nil, expensive, task), scoreArm(nil, cheap, task))
|
||||
scoreArm(nil, BanditParams{}, expensive, task), scoreArm(nil, BanditParams{}, cheap, task))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -140,8 +140,8 @@ func TestScoreArm_StrengthBonus(t *testing.T) {
|
||||
}
|
||||
task := Task{Type: TaskSecurityReview, EstimatedTokens: 5000, RequiresTools: true, Priority: PriorityNormal}
|
||||
|
||||
a := scoreArm(nil, withoutStrength, task)
|
||||
b := scoreArm(nil, withStrength, task)
|
||||
a := scoreArm(nil, BanditParams{}, withoutStrength, task)
|
||||
b := scoreArm(nil, BanditParams{}, withStrength, task)
|
||||
if !(b > a) {
|
||||
t.Errorf("strength-tagged arm score (%v) should exceed plain arm score (%v)", b, a)
|
||||
}
|
||||
@@ -160,8 +160,8 @@ func TestScoreArm_StrengthBonusDoesNotApplyToOtherTasks(t *testing.T) {
|
||||
}
|
||||
task := Task{Type: TaskDebug, EstimatedTokens: 5000, RequiresTools: true, Priority: PriorityNormal}
|
||||
|
||||
a := scoreArm(nil, plain, task)
|
||||
b := scoreArm(nil, tagged, task)
|
||||
a := scoreArm(nil, BanditParams{}, plain, task)
|
||||
b := scoreArm(nil, BanditParams{}, tagged, task)
|
||||
if math.Abs(a-b) > 1e-9 {
|
||||
t.Errorf("non-matching task should ignore Strengths: plain=%v tagged=%v", a, b)
|
||||
}
|
||||
@@ -184,7 +184,7 @@ func TestSelectBest_StrengthPromotedArmBeatsCLIAgent(t *testing.T) {
|
||||
}
|
||||
|
||||
task := Task{Type: TaskSecurityReview, EstimatedTokens: 5000, RequiresTools: true, Priority: PriorityNormal}
|
||||
got := selectBest(nil, []*Arm{cliAgent, opus}, task, PreferAuto)
|
||||
got := selectBest(nil, BanditParams{}, []*Arm{cliAgent, opus}, task, PreferAuto)
|
||||
if got == nil {
|
||||
t.Fatal("selectBest returned nil")
|
||||
}
|
||||
@@ -208,7 +208,7 @@ func TestSelectBest_EmptyStrengthsPreservesTierOrder(t *testing.T) {
|
||||
}
|
||||
|
||||
task := Task{Type: TaskSecurityReview, EstimatedTokens: 5000, RequiresTools: true, Priority: PriorityNormal}
|
||||
got := selectBest(nil, []*Arm{cliAgent, opus}, task, PreferAuto)
|
||||
got := selectBest(nil, BanditParams{}, []*Arm{cliAgent, opus}, task, PreferAuto)
|
||||
if got.ID != cliAgent.ID {
|
||||
t.Errorf("without Strengths, CLI-agent tier-1 should win; got %s", got.ID)
|
||||
}
|
||||
@@ -327,7 +327,7 @@ func TestSelectBest_MultiplePromotedArmsBestQualityWins(t *testing.T) {
|
||||
Strengths: []TaskType{TaskSecurityReview},
|
||||
}
|
||||
|
||||
qt := NewQualityTracker()
|
||||
qt := NewQualityTracker(0, 0)
|
||||
// armB has consistently succeeded — minObservations=3 is enough to flip
|
||||
// the score blend.
|
||||
for i := 0; i < 5; i++ {
|
||||
@@ -339,7 +339,7 @@ func TestSelectBest_MultiplePromotedArmsBestQualityWins(t *testing.T) {
|
||||
}
|
||||
|
||||
task := Task{Type: TaskSecurityReview, EstimatedTokens: 5000, RequiresTools: true, Priority: PriorityNormal}
|
||||
got := selectBest(qt, []*Arm{armA, armB}, task, PreferAuto)
|
||||
got := selectBest(qt, BanditParams{}, []*Arm{armA, armB}, task, PreferAuto)
|
||||
if got == nil {
|
||||
t.Fatal("selectBest returned nil")
|
||||
}
|
||||
|
||||
@@ -0,0 +1,121 @@
|
||||
package security
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"log/slog"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"sync"
|
||||
"time"
|
||||
)
|
||||
|
||||
// AuditEvent records a single firewall action (block / redact / sanitize)
|
||||
// in a structured form intended for per-session post-mortem grepping.
|
||||
//
|
||||
// Discipline: this struct must never carry the raw bytes of any matched
|
||||
// secret. The Pattern field names the matcher (e.g. "anthropic_api_key",
|
||||
// "high_entropy"); TokenLen carries the length of the offending token so
|
||||
// the user can recognise it in a transcript without re-leaking it.
|
||||
type AuditEvent struct {
|
||||
// Timestamp is the wall-clock time of the event in UTC.
|
||||
Timestamp time.Time `json:"ts"`
|
||||
// Action is one of: "block", "redact", "warn", "unicode_sanitize".
|
||||
Action string `json:"action"`
|
||||
// Pattern is the human-readable matcher name (regex tag or
|
||||
// "high_entropy" / "unicode"). Never the matched bytes themselves.
|
||||
Pattern string `json:"pattern,omitempty"`
|
||||
// Source describes where in the data flow the event fired —
|
||||
// "message_text", "tool_result", "tool_call_args",
|
||||
// "system_prompt", etc.
|
||||
Source string `json:"source,omitempty"`
|
||||
// TokenLen is the length of the offending token (or chars
|
||||
// changed for unicode_sanitize). Length only, never the bytes.
|
||||
TokenLen int `json:"token_len,omitempty"`
|
||||
}
|
||||
|
||||
// AuditLogger appends AuditEvent records to a per-session JSON Lines
|
||||
// file. Safe for concurrent use. Writes are skipped while incognito
|
||||
// mode is active so the no-persistence contract is honoured.
|
||||
//
|
||||
// A nil *AuditLogger is a valid no-op — callers can use the same
|
||||
// `audit.Record(...)` shape whether or not auditing is configured.
|
||||
type AuditLogger struct {
|
||||
path string
|
||||
incognito *IncognitoMode
|
||||
logger *slog.Logger
|
||||
mu sync.Mutex
|
||||
}
|
||||
|
||||
// AuditLoggerConfig controls how AuditLogger is constructed.
|
||||
type AuditLoggerConfig struct {
|
||||
// Path is the full filesystem path to write JSONL events to.
|
||||
// Parent directories are created lazily on first successful Record.
|
||||
Path string
|
||||
// Incognito gates writes; when active, Record is a no-op.
|
||||
// Optional — pass nil to always persist.
|
||||
Incognito *IncognitoMode
|
||||
// Logger receives one Warn per write failure so the user sees
|
||||
// disk-full / permission errors instead of silently losing
|
||||
// audit records. Defaults to slog.Default() when nil.
|
||||
Logger *slog.Logger
|
||||
}
|
||||
|
||||
// NewAuditLogger builds an AuditLogger. Pass a zero Path to disable
|
||||
// auditing (returns nil).
|
||||
func NewAuditLogger(cfg AuditLoggerConfig) *AuditLogger {
|
||||
if cfg.Path == "" {
|
||||
return nil
|
||||
}
|
||||
logger := cfg.Logger
|
||||
if logger == nil {
|
||||
logger = slog.Default()
|
||||
}
|
||||
return &AuditLogger{
|
||||
path: cfg.Path,
|
||||
incognito: cfg.Incognito,
|
||||
logger: logger,
|
||||
}
|
||||
}
|
||||
|
||||
// Record appends an event to the audit log. Safe to call on a nil
|
||||
// receiver (no-op). Skipped silently when incognito is active.
|
||||
// Write failures are logged at Warn level but do not propagate to
|
||||
// the caller — auditing is best-effort and must not crash the
|
||||
// scanner pipeline.
|
||||
func (a *AuditLogger) Record(ev AuditEvent) {
|
||||
if a == nil {
|
||||
return
|
||||
}
|
||||
if a.incognito != nil && a.incognito.Active() {
|
||||
return
|
||||
}
|
||||
if ev.Timestamp.IsZero() {
|
||||
ev.Timestamp = time.Now().UTC()
|
||||
}
|
||||
|
||||
a.mu.Lock()
|
||||
defer a.mu.Unlock()
|
||||
|
||||
if err := os.MkdirAll(filepath.Dir(a.path), 0o700); err != nil {
|
||||
a.logger.Warn("audit: mkdir failed", "path", a.path, "err", err)
|
||||
return
|
||||
}
|
||||
f, err := os.OpenFile(a.path, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o600)
|
||||
if err != nil {
|
||||
a.logger.Warn("audit: open failed", "path", a.path, "err", err)
|
||||
return
|
||||
}
|
||||
defer f.Close()
|
||||
if err := json.NewEncoder(f).Encode(ev); err != nil {
|
||||
a.logger.Warn("audit: encode failed", "path", a.path, "err", err)
|
||||
}
|
||||
}
|
||||
|
||||
// Path returns the file path the logger writes to. Empty when the
|
||||
// logger is disabled (nil receiver returns "").
|
||||
func (a *AuditLogger) Path() string {
|
||||
if a == nil {
|
||||
return ""
|
||||
}
|
||||
return a.path
|
||||
}
|
||||
@@ -0,0 +1,139 @@
|
||||
package security
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"encoding/json"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func readAuditLines(t *testing.T, path string) []AuditEvent {
|
||||
t.Helper()
|
||||
f, err := os.Open(path)
|
||||
if err != nil {
|
||||
t.Fatalf("open audit log: %v", err)
|
||||
}
|
||||
defer f.Close()
|
||||
var events []AuditEvent
|
||||
sc := bufio.NewScanner(f)
|
||||
for sc.Scan() {
|
||||
var ev AuditEvent
|
||||
if err := json.Unmarshal(sc.Bytes(), &ev); err != nil {
|
||||
t.Fatalf("decode line %q: %v", sc.Text(), err)
|
||||
}
|
||||
events = append(events, ev)
|
||||
}
|
||||
if err := sc.Err(); err != nil {
|
||||
t.Fatalf("scan audit log: %v", err)
|
||||
}
|
||||
return events
|
||||
}
|
||||
|
||||
func TestAuditLogger_NilReceiverIsNoop(t *testing.T) {
|
||||
var a *AuditLogger
|
||||
// Must not panic.
|
||||
a.Record(AuditEvent{Action: "block"})
|
||||
}
|
||||
|
||||
func TestAuditLogger_DisabledWhenPathEmpty(t *testing.T) {
|
||||
a := NewAuditLogger(AuditLoggerConfig{})
|
||||
if a != nil {
|
||||
t.Errorf("expected nil logger for empty path, got %v", a)
|
||||
}
|
||||
}
|
||||
|
||||
func TestAuditLogger_AppendsJSONLines(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "audit.jsonl")
|
||||
a := NewAuditLogger(AuditLoggerConfig{Path: path})
|
||||
if a == nil {
|
||||
t.Fatal("expected non-nil logger")
|
||||
}
|
||||
|
||||
a.Record(AuditEvent{Action: "block", Pattern: "anthropic_api_key", Source: "tool_result", TokenLen: 51})
|
||||
a.Record(AuditEvent{Action: "redact", Pattern: "high_entropy", Source: "message_text", TokenLen: 42})
|
||||
|
||||
events := readAuditLines(t, path)
|
||||
if len(events) != 2 {
|
||||
t.Fatalf("expected 2 events, got %d", len(events))
|
||||
}
|
||||
if events[0].Action != "block" || events[0].Pattern != "anthropic_api_key" {
|
||||
t.Errorf("event 0 = %+v", events[0])
|
||||
}
|
||||
if events[0].Timestamp.IsZero() {
|
||||
t.Error("event 0 missing timestamp")
|
||||
}
|
||||
if events[1].Action != "redact" || events[1].TokenLen != 42 {
|
||||
t.Errorf("event 1 = %+v", events[1])
|
||||
}
|
||||
}
|
||||
|
||||
func TestAuditLogger_SkipsUnderIncognito(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "audit.jsonl")
|
||||
incog := NewIncognitoMode()
|
||||
a := NewAuditLogger(AuditLoggerConfig{Path: path, Incognito: incog})
|
||||
|
||||
incog.Activate()
|
||||
a.Record(AuditEvent{Action: "block", Pattern: "x"})
|
||||
|
||||
if _, err := os.Stat(path); !os.IsNotExist(err) {
|
||||
t.Errorf("expected audit file to not exist under incognito, got err=%v", err)
|
||||
}
|
||||
|
||||
incog.Deactivate()
|
||||
a.Record(AuditEvent{Action: "block", Pattern: "y"})
|
||||
|
||||
events := readAuditLines(t, path)
|
||||
if len(events) != 1 {
|
||||
t.Fatalf("expected 1 event after deactivate, got %d", len(events))
|
||||
}
|
||||
if events[0].Pattern != "y" {
|
||||
t.Errorf("expected pattern=y (incognito event dropped), got %q", events[0].Pattern)
|
||||
}
|
||||
}
|
||||
|
||||
func TestAuditLogger_CreatesParentDir(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "deeply", "nested", "audit.jsonl")
|
||||
a := NewAuditLogger(AuditLoggerConfig{Path: path})
|
||||
a.Record(AuditEvent{Action: "block"})
|
||||
if _, err := os.Stat(path); err != nil {
|
||||
t.Errorf("expected audit file at %s, got err=%v", path, err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFirewall_RecordsRedactionToAudit(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
auditPath := filepath.Join(dir, "audit.jsonl")
|
||||
audit := NewAuditLogger(AuditLoggerConfig{Path: auditPath})
|
||||
|
||||
fw := NewFirewall(FirewallConfig{
|
||||
ScanOutgoing: true,
|
||||
ScanToolResults: true,
|
||||
Audit: audit,
|
||||
})
|
||||
|
||||
// Anthropic key prefix is a built-in redact pattern; emit it
|
||||
// through the tool-result scanning path.
|
||||
cleaned := fw.ScanToolResult("here is the key sk-ant-abcdef1234567890abcdef1234567890abcdef")
|
||||
if !strings.Contains(cleaned, "[REDACTED]") {
|
||||
t.Errorf("expected [REDACTED] in cleaned content, got %q", cleaned)
|
||||
}
|
||||
|
||||
events := readAuditLines(t, auditPath)
|
||||
var sawAnthropicRedact bool
|
||||
for _, ev := range events {
|
||||
if ev.Action == "redact" && ev.Pattern == "anthropic_api_key" && ev.Source == "tool_result" {
|
||||
sawAnthropicRedact = true
|
||||
if ev.TokenLen == 0 {
|
||||
t.Errorf("expected non-zero TokenLen on redact event, got %+v", ev)
|
||||
}
|
||||
}
|
||||
}
|
||||
if !sawAnthropicRedact {
|
||||
t.Errorf("expected an anthropic_api_key redact event in audit log, got %+v", events)
|
||||
}
|
||||
}
|
||||
@@ -14,6 +14,7 @@ type Firewall struct {
|
||||
scanner *Scanner
|
||||
incognito *IncognitoMode
|
||||
logger *slog.Logger
|
||||
audit *AuditLogger // optional; nil = no per-session audit log
|
||||
|
||||
// Config
|
||||
scanOutgoing bool
|
||||
@@ -27,6 +28,11 @@ type FirewallConfig struct {
|
||||
EntropyThreshold float64
|
||||
EntropySafelist []string
|
||||
Logger *slog.Logger
|
||||
// Audit is the optional per-session audit logger. Set via
|
||||
// SetAudit after the session ID is known — the firewall is
|
||||
// typically constructed before the session ID is generated.
|
||||
// nil is safe; auditing simply turns into a no-op.
|
||||
Audit *AuditLogger
|
||||
}
|
||||
|
||||
func NewFirewall(cfg FirewallConfig) *Firewall {
|
||||
@@ -50,11 +56,20 @@ func NewFirewall(cfg FirewallConfig) *Firewall {
|
||||
scanner: scanner,
|
||||
incognito: NewIncognitoMode(),
|
||||
logger: logger,
|
||||
audit: cfg.Audit,
|
||||
scanOutgoing: cfg.ScanOutgoing,
|
||||
scanToolResults: cfg.ScanToolResults,
|
||||
}
|
||||
}
|
||||
|
||||
// SetAudit attaches an AuditLogger after construction. The firewall
|
||||
// is typically built before the session ID exists, so callers usually
|
||||
// construct the AuditLogger later and inject it via this setter.
|
||||
// Pass nil to disable auditing.
|
||||
func (f *Firewall) SetAudit(a *AuditLogger) {
|
||||
f.audit = a
|
||||
}
|
||||
|
||||
// Incognito returns the incognito mode controller.
|
||||
func (f *Firewall) Incognito() *IncognitoMode {
|
||||
return f.incognito
|
||||
@@ -131,7 +146,16 @@ func (f *Firewall) scanMessage(m message.Message) message.Message {
|
||||
|
||||
func (f *Firewall) scanAndRedact(content, source string) string {
|
||||
// Unicode sanitization first
|
||||
originalLen := len(content)
|
||||
content = SanitizeUnicode(content)
|
||||
if delta := originalLen - len(content); delta != 0 {
|
||||
f.audit.Record(AuditEvent{
|
||||
Action: "unicode_sanitize",
|
||||
Pattern: "unicode",
|
||||
Source: source,
|
||||
TokenLen: delta,
|
||||
})
|
||||
}
|
||||
|
||||
// Secret scanning
|
||||
matches := f.scanner.Scan(content)
|
||||
@@ -146,6 +170,12 @@ func (f *Firewall) scanAndRedact(content, source string) string {
|
||||
"pattern", m.Pattern,
|
||||
"source", source,
|
||||
)
|
||||
f.audit.Record(AuditEvent{
|
||||
Action: "block",
|
||||
Pattern: m.Pattern,
|
||||
Source: source,
|
||||
TokenLen: m.End - m.Start,
|
||||
})
|
||||
return "[BLOCKED: content contained a secret]"
|
||||
default:
|
||||
f.logger.Debug("secret redacted",
|
||||
@@ -153,6 +183,12 @@ func (f *Firewall) scanAndRedact(content, source string) string {
|
||||
"action", m.Action,
|
||||
"source", source,
|
||||
)
|
||||
f.audit.Record(AuditEvent{
|
||||
Action: string(m.Action),
|
||||
Pattern: m.Pattern,
|
||||
Source: source,
|
||||
TokenLen: m.End - m.Start,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -14,10 +14,13 @@ import (
|
||||
"somegit.dev/Owlibou/gnoma/internal/stream"
|
||||
)
|
||||
|
||||
// defaultClassifyTimeout — 5 s accommodates thinking-mode models like
|
||||
// Qwen3 distillations (Tiny3.5) that emit reasoning tokens before output.
|
||||
// Non-thinking models complete in well under 1 s.
|
||||
const defaultClassifyTimeout = 5 * time.Second
|
||||
// defaultClassifyTimeout — 15 s accommodates cold-start model loads
|
||||
// (ollama lazily loads on first call, ~2-8s for a 1.5B model on SSD)
|
||||
// combined with thinking-mode first-token latency (Qwen3 distillations
|
||||
// like Tiny3.5 sometimes emit <think> tokens before the JSON output
|
||||
// even with /no_think). Non-thinking warm models complete in well
|
||||
// under 1 s. Tune via [slm].classify_timeout in config.
|
||||
const defaultClassifyTimeout = 15 * time.Second
|
||||
|
||||
const classifySystemPrompt = `Classify the following coding request. /no_think
|
||||
Respond with JSON only, no other text, no reasoning, no thinking tags.
|
||||
@@ -47,14 +50,18 @@ type Classifier struct {
|
||||
|
||||
// NewClassifier creates a Classifier. model is the model name passed to the provider
|
||||
// (llamafile ignores it but openaicompat requires a non-empty value).
|
||||
func NewClassifier(p provider.Provider, model string, logger *slog.Logger) *Classifier {
|
||||
// Pass timeout=0 to use the built-in default (defaultClassifyTimeout).
|
||||
func NewClassifier(p provider.Provider, model string, timeout time.Duration, logger *slog.Logger) *Classifier {
|
||||
if logger == nil {
|
||||
logger = slog.Default()
|
||||
}
|
||||
if timeout <= 0 {
|
||||
timeout = defaultClassifyTimeout
|
||||
}
|
||||
return &Classifier{
|
||||
provider: p,
|
||||
model: model,
|
||||
timeout: defaultClassifyTimeout,
|
||||
timeout: timeout,
|
||||
logger: logger,
|
||||
}
|
||||
}
|
||||
@@ -68,7 +75,11 @@ func (c *Classifier) Classify(ctx context.Context, prompt string, history []mess
|
||||
|
||||
resp, err := c.callSLM(tctx, prompt)
|
||||
if err != nil {
|
||||
c.logger.Debug("slm classify fallback", "error", err)
|
||||
// Warn-level so a first-time misconfiguration (timeout too tight,
|
||||
// wrong endpoint, malformed JSON from the model) surfaces without
|
||||
// requiring --verbose. The fallback path itself is benign; the
|
||||
// signal is that the SLM isn't doing the work it was supposed to.
|
||||
c.logger.Warn("slm classify fallback", "error", err, "timeout", c.timeout)
|
||||
t, ferr := router.HeuristicClassifier{}.Classify(ctx, prompt, history)
|
||||
t.ClassifierSource = router.ClassifierSLMFallback
|
||||
return t, ferr
|
||||
@@ -91,9 +102,25 @@ func (c *Classifier) Classify(ctx context.Context, prompt string, history []mess
|
||||
}
|
||||
|
||||
func (c *Classifier) callSLM(ctx context.Context, prompt string) (*classifyResponse, error) {
|
||||
// Constrain the model toward valid, deterministic JSON output. Without
|
||||
// these settings small models routinely ignore the JSON-only system
|
||||
// prompt, emit reasoning blocks (<think>, <Thought Process>) or just
|
||||
// answer the user's prompt in prose. ResponseFormat=json_object asks
|
||||
// the provider to enforce JSON at decoding time where supported
|
||||
// (ollama 'format=json', llama.cpp grammar, OpenAI json_object). Even
|
||||
// when the provider can't enforce, the explicit signal nudges the
|
||||
// adapter to set the right backend flag.
|
||||
temp := 0.0
|
||||
topP := 1.0
|
||||
req := provider.Request{
|
||||
Model: c.model,
|
||||
SystemPrompt: classifySystemPrompt,
|
||||
Temperature: &temp,
|
||||
TopP: &topP,
|
||||
MaxTokens: 128, // classification output is ~50 tokens; cap to prevent runaway reasoning
|
||||
ResponseFormat: &provider.ResponseFormat{
|
||||
Type: provider.ResponseJSON,
|
||||
},
|
||||
Messages: []message.Message{
|
||||
{
|
||||
Role: message.RoleUser,
|
||||
@@ -127,10 +154,22 @@ func (c *Classifier) callSLM(ctx context.Context, prompt string) (*classifyRespo
|
||||
return &resp, nil
|
||||
}
|
||||
|
||||
// extractJSON pulls the first {...} substring from s, stripping markdown fences if present.
|
||||
// extractJSON pulls the first {...} substring from s, stripping markdown
|
||||
// fences and known thinking-block tags. Small models routinely violate
|
||||
// the JSON-only system prompt by emitting reasoning tokens first, so
|
||||
// the extractor must tolerate prefixes the model wasn't asked to emit.
|
||||
func extractJSON(s string) string {
|
||||
s = strings.TrimSpace(s)
|
||||
|
||||
// Strip known thinking-block tags. Order matters: longer/more-
|
||||
// specific names first so a partial match doesn't shadow a real
|
||||
// one. Seen in the wild on Qwen3 (<think>) and tiny3.5
|
||||
// (<Thought Process>); the others are defensive against similar
|
||||
// fine-tunes.
|
||||
for _, tag := range []string{"Thought Process", "thinking", "reasoning", "thoughts", "think"} {
|
||||
s = stripTagBlock(s, tag)
|
||||
}
|
||||
|
||||
// Strip ```json ... ``` fences.
|
||||
if strings.HasPrefix(s, "```") {
|
||||
end := strings.LastIndex(s, "```")
|
||||
@@ -160,3 +199,28 @@ func extractJSON(s string) string {
|
||||
}
|
||||
return s[start:]
|
||||
}
|
||||
|
||||
// stripTagBlock removes <tag>...</tag> blocks (case-insensitive on the
|
||||
// tag name) from the start of s. Returns the original string if the tag
|
||||
// is not at the start. Idempotent; safe to call repeatedly.
|
||||
func stripTagBlock(s, tag string) string {
|
||||
trimmed := strings.TrimSpace(s)
|
||||
open := "<" + tag
|
||||
lower := strings.ToLower(trimmed)
|
||||
if !strings.HasPrefix(lower, strings.ToLower(open)) {
|
||||
return s
|
||||
}
|
||||
// Find the matching closing tag, case-insensitive.
|
||||
close := "</" + tag + ">"
|
||||
closeIdx := strings.Index(strings.ToLower(trimmed), strings.ToLower(close))
|
||||
if closeIdx < 0 {
|
||||
// Unterminated thinking block — strip up to the first '{'
|
||||
// so we still have a shot at extracting JSON that follows.
|
||||
braceIdx := strings.IndexByte(trimmed, '{')
|
||||
if braceIdx > 0 {
|
||||
return strings.TrimSpace(trimmed[braceIdx:])
|
||||
}
|
||||
return s
|
||||
}
|
||||
return strings.TrimSpace(trimmed[closeIdx+len(close):])
|
||||
}
|
||||
|
||||
@@ -54,7 +54,7 @@ func TestClassifier_HappyPath(t *testing.T) {
|
||||
// SLM complexity 0.55 stays above the Debug floor (0.4), so the SLM
|
||||
// value is preserved verbatim.
|
||||
p := &mockProvider{text: `{"task_type":"Debug","complexity":0.55,"requires_tools":false}`}
|
||||
cls := NewClassifier(p, "default", nil)
|
||||
cls := NewClassifier(p, "default", 0, nil)
|
||||
|
||||
task, err := cls.Classify(context.Background(), "fix the failing test", nil)
|
||||
if err != nil {
|
||||
@@ -76,7 +76,7 @@ func TestClassifier_AppliesTaskTypeFloor(t *testing.T) {
|
||||
// bump ComplexityScore up to the floor so the SLM arm can't be picked
|
||||
// for its own kind of misclassification.
|
||||
p := &mockProvider{text: `{"task_type":"Debug","complexity":0.25,"requires_tools":false}`}
|
||||
cls := NewClassifier(p, "default", nil)
|
||||
cls := NewClassifier(p, "default", 0, nil)
|
||||
|
||||
task, err := cls.Classify(context.Background(), "fix the failing test", nil)
|
||||
if err != nil {
|
||||
@@ -91,7 +91,7 @@ func TestClassifier_AppliesTaskTypeFloor(t *testing.T) {
|
||||
func TestClassifier_BlendHeuristic(t *testing.T) {
|
||||
// SLM returns one type; other Task fields should come from heuristic.
|
||||
p := &mockProvider{text: `{"task_type":"Boilerplate","complexity":0.1,"requires_tools":false}`}
|
||||
cls := NewClassifier(p, "default", nil)
|
||||
cls := NewClassifier(p, "default", 0, nil)
|
||||
|
||||
task, err := cls.Classify(context.Background(), "scaffold a new HTTP handler", nil)
|
||||
if err != nil {
|
||||
@@ -108,7 +108,7 @@ func TestClassifier_BlendHeuristic(t *testing.T) {
|
||||
|
||||
func TestClassifier_FallbackOnBadJSON(t *testing.T) {
|
||||
p := &mockProvider{text: "I cannot classify that."}
|
||||
cls := NewClassifier(p, "default", nil)
|
||||
cls := NewClassifier(p, "default", 0, nil)
|
||||
|
||||
// Should not error — falls back to heuristic.
|
||||
task, err := cls.Classify(context.Background(), "write unit tests for the parser", nil)
|
||||
@@ -123,7 +123,7 @@ func TestClassifier_FallbackOnBadJSON(t *testing.T) {
|
||||
|
||||
func TestClassifier_FallbackOnProviderError(t *testing.T) {
|
||||
p := &mockProvider{err: errors.New("connection refused")}
|
||||
cls := NewClassifier(p, "default", nil)
|
||||
cls := NewClassifier(p, "default", 0, nil)
|
||||
|
||||
task, err := cls.Classify(context.Background(), "explain how generics work", nil)
|
||||
if err != nil {
|
||||
@@ -137,7 +137,7 @@ func TestClassifier_FallbackOnProviderError(t *testing.T) {
|
||||
|
||||
func TestClassifier_FallbackOnTimeout(t *testing.T) {
|
||||
p := &mockProvider{delay: 500 * time.Millisecond}
|
||||
cls := NewClassifier(p, "default", nil)
|
||||
cls := NewClassifier(p, "default", 0, nil)
|
||||
cls.timeout = 50 * time.Millisecond // force timeout
|
||||
|
||||
task, err := cls.Classify(context.Background(), "debug the failing test", nil)
|
||||
@@ -153,7 +153,7 @@ func TestClassifier_FallbackOnTimeout(t *testing.T) {
|
||||
func TestClassifier_FenceStripping(t *testing.T) {
|
||||
fenced := "```json\n{\"task_type\":\"Refactor\",\"complexity\":0.5,\"requires_tools\":true}\n```"
|
||||
p := &mockProvider{text: fenced}
|
||||
cls := NewClassifier(p, "default", nil)
|
||||
cls := NewClassifier(p, "default", 0, nil)
|
||||
|
||||
task, err := cls.Classify(context.Background(), "refactor the auth middleware", nil)
|
||||
if err != nil {
|
||||
@@ -166,7 +166,7 @@ func TestClassifier_FenceStripping(t *testing.T) {
|
||||
|
||||
func TestClassifier_UnknownTaskType_FallsBackToHeuristic(t *testing.T) {
|
||||
p := &mockProvider{text: `{"task_type":"FooBar","complexity":0.3,"requires_tools":false}`}
|
||||
cls := NewClassifier(p, "default", nil)
|
||||
cls := NewClassifier(p, "default", 0, nil)
|
||||
|
||||
task, err := cls.Classify(context.Background(), "implement a binary search function", nil)
|
||||
if err != nil {
|
||||
@@ -178,7 +178,7 @@ func TestClassifier_UnknownTaskType_FallsBackToHeuristic(t *testing.T) {
|
||||
|
||||
func TestClassifier_SetsClassifierSource_OnSuccess(t *testing.T) {
|
||||
p := &mockProvider{text: `{"task_type":"Debug","complexity":0.3,"requires_tools":true}`}
|
||||
cls := NewClassifier(p, "default", nil)
|
||||
cls := NewClassifier(p, "default", 0, nil)
|
||||
task, err := cls.Classify(context.Background(), "fix the failing test", nil)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
@@ -190,7 +190,7 @@ func TestClassifier_SetsClassifierSource_OnSuccess(t *testing.T) {
|
||||
|
||||
func TestClassifier_SetsClassifierSource_OnFallback(t *testing.T) {
|
||||
p := &mockProvider{err: errors.New("backend unreachable")}
|
||||
cls := NewClassifier(p, "default", nil)
|
||||
cls := NewClassifier(p, "default", 0, nil)
|
||||
task, err := cls.Classify(context.Background(), "fix the failing test", nil)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
@@ -202,7 +202,7 @@ func TestClassifier_SetsClassifierSource_OnFallback(t *testing.T) {
|
||||
|
||||
func TestClassifier_ContextPassedToHistory(t *testing.T) {
|
||||
p := &mockProvider{text: `{"task_type":"Explain","complexity":0.2,"requires_tools":false}`}
|
||||
cls := NewClassifier(p, "default", nil)
|
||||
cls := NewClassifier(p, "default", 0, nil)
|
||||
|
||||
history := []message.Message{
|
||||
{Role: message.RoleUser, Content: []message.Content{{Type: message.ContentText, Text: "prior"}}},
|
||||
@@ -215,3 +215,45 @@ func TestClassifier_ContextPassedToHistory(t *testing.T) {
|
||||
t.Errorf("Type = %s, want Explain", task.Type)
|
||||
}
|
||||
}
|
||||
|
||||
func TestExtractJSON_StripsThinkingTags(t *testing.T) {
|
||||
cases := []struct {
|
||||
name string
|
||||
in string
|
||||
want string
|
||||
}{
|
||||
{
|
||||
name: "qwen-think-block",
|
||||
in: `<think>Let me decide</think>{"task_type":"Debug","complexity":0.5,"requires_tools":true}`,
|
||||
want: `{"task_type":"Debug","complexity":0.5,"requires_tools":true}`,
|
||||
},
|
||||
{
|
||||
name: "tiny3.5-thought-process",
|
||||
in: "<Thought Process>\nUser wants debugging help.\n</Thought Process>\n{\"task_type\":\"Debug\",\"complexity\":0.4,\"requires_tools\":true}",
|
||||
want: `{"task_type":"Debug","complexity":0.4,"requires_tools":true}`,
|
||||
},
|
||||
{
|
||||
name: "unterminated-think-falls-back-to-brace",
|
||||
in: `<think>incomplete reasoning {"task_type":"Explain","complexity":0.2,"requires_tools":false}`,
|
||||
want: `{"task_type":"Explain","complexity":0.2,"requires_tools":false}`,
|
||||
},
|
||||
{
|
||||
name: "no-tags-still-works",
|
||||
in: `{"task_type":"Generation","complexity":0.6,"requires_tools":false}`,
|
||||
want: `{"task_type":"Generation","complexity":0.6,"requires_tools":false}`,
|
||||
},
|
||||
{
|
||||
name: "fenced-json-still-works",
|
||||
in: "```json\n{\"task_type\":\"Refactor\",\"complexity\":0.5,\"requires_tools\":true}\n```",
|
||||
want: `{"task_type":"Refactor","complexity":0.5,"requires_tools":true}`,
|
||||
},
|
||||
}
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
got := extractJSON(tc.in)
|
||||
if got != tc.want {
|
||||
t.Errorf("extractJSON(...)\n got: %q\n want: %q", got, tc.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
+36
-1
@@ -1146,6 +1146,15 @@ func (m Model) submitInput(input string) (tea.Model, tea.Cmd) {
|
||||
m.thinkingBuf.Reset()
|
||||
m.streamFilterClose = ""
|
||||
|
||||
// Recover from a prior StateError before submitting a fresh user
|
||||
// prompt. A transient routing or engine failure used to leave the
|
||||
// session in error state, blocking every subsequent prompt with
|
||||
// "session not idle (state: error)" until the user restarted gnoma.
|
||||
// User-initiated sends always carry an intent-to-retry, so resetting
|
||||
// here is the safe default; the /init retry path has its own explicit
|
||||
// ResetError that we leave alone.
|
||||
m.session.ResetError()
|
||||
|
||||
if err := m.session.Send(expandedInput); err != nil {
|
||||
m.messages = append(m.messages, chatMessage{role: "error", content: formatError(err)})
|
||||
m.streaming = false
|
||||
@@ -1403,6 +1412,28 @@ func (m Model) handleCommand(cmd string) (tea.Model, tea.Cmd) {
|
||||
m.injectSystemContext(msg)
|
||||
return m, nil
|
||||
|
||||
case "/router":
|
||||
if m.config.Router == nil {
|
||||
m.messages = append(m.messages, chatMessage{role: "error", content: "router not configured"})
|
||||
return m, nil
|
||||
}
|
||||
if args == "" || args == "help" {
|
||||
current := m.config.Router.PreferPolicy().String()
|
||||
m.messages = append(m.messages, chatMessage{role: "system",
|
||||
content: fmt.Sprintf("router.prefer = %s\nUsage: /router <auto|local|cloud>\n auto — no bias; tier order + Strengths decide\n local — cloud arms demoted; locals win when feasible\n cloud — local arms demoted; cloud arms win (except tier-0 SLM)", current)})
|
||||
return m, nil
|
||||
}
|
||||
policy, err := router.ParsePreferPolicy(args)
|
||||
if err != nil {
|
||||
m.messages = append(m.messages, chatMessage{role: "error", content: err.Error()})
|
||||
return m, nil
|
||||
}
|
||||
m.config.Router.SetPreferPolicy(policy)
|
||||
msg := fmt.Sprintf("router.prefer = %s (runtime override; not written to config)", policy.String())
|
||||
m.messages = append(m.messages, chatMessage{role: "system", content: msg})
|
||||
m.injectSystemContext(msg)
|
||||
return m, nil
|
||||
|
||||
case "/profile":
|
||||
if args == "" {
|
||||
m = m.closeAllPickers()
|
||||
@@ -1472,6 +1503,8 @@ func (m Model) handleCommand(cmd string) (tea.Model, tea.Cmd) {
|
||||
m.initWriteNudged = false
|
||||
|
||||
opts := engine.TurnOptions{}
|
||||
// Recover from prior StateError before /init can submit.
|
||||
m.session.ResetError()
|
||||
if err := m.session.SendWithOptions(prompt, opts); err != nil {
|
||||
m.messages = append(m.messages, chatMessage{role: "error", content: formatError(err)})
|
||||
m.streaming = false
|
||||
@@ -1532,7 +1565,7 @@ func (m Model) handleCommand(cmd string) (tea.Model, tea.Cmd) {
|
||||
return m, nil
|
||||
}
|
||||
m.messages = append(m.messages, chatMessage{role: "system",
|
||||
content: "Commands:\n /init generate or update AGENTS.md project docs\n /clear, /new clear chat and start new conversation\n /config show current config\n /incognito toggle incognito (Ctrl+X)\n /keys show keyboard shortcuts\n /model [name] list/switch models\n /permission [mode] set permission mode (Shift+Tab to cycle)\n /plugins list installed plugins\n /profile [name] list profiles / switch (re-execs gnoma)\n /provider show current provider\n /replay scroll to top to re-read conversation\n /resume [id] list or restore saved sessions\n /shell [cmd] open interactive shell (or run cmd in shell)\n /skills list loaded skills\n /usage show token usage and cost\n /help show this help\n /quit exit gnoma\n\nSkills (use /<name> [args] to invoke):\n Add .md files with YAML front matter to .gnoma/skills/ or ~/.config/gnoma/skills/"})
|
||||
content: "Commands:\n /init generate or update AGENTS.md project docs\n /clear, /new clear chat and start new conversation\n /config show current config\n /incognito toggle incognito (Ctrl+X)\n /keys show keyboard shortcuts\n /model [name] list/switch models\n /permission [mode] set permission mode (Shift+Tab to cycle)\n /plugins list installed plugins\n /profile [name] list profiles / switch (re-execs gnoma)\n /provider show current provider\n /replay scroll to top to re-read conversation\n /resume [id] list or restore saved sessions\n /router [mode] show or set routing preference (auto/local/cloud)\n /shell [cmd] open interactive shell (or run cmd in shell)\n /skills list loaded skills\n /usage show token usage and cost\n /help show this help\n /quit exit gnoma\n\nSkills (use /<name> [args] to invoke):\n Add .md files with YAML front matter to .gnoma/skills/ or ~/.config/gnoma/skills/"})
|
||||
return m, nil
|
||||
|
||||
case "/keys":
|
||||
@@ -1673,6 +1706,8 @@ func (m Model) handleCommand(cmd string) (tea.Model, tea.Cmd) {
|
||||
AllowedTools: sk.Frontmatter.AllowedTools,
|
||||
AllowedPaths: sk.Frontmatter.Paths,
|
||||
}
|
||||
// Recover from prior StateError before the skill submits.
|
||||
m.session.ResetError()
|
||||
if err := m.session.SendWithOptions(rendered, skillOpts); err != nil {
|
||||
m.messages = append(m.messages, chatMessage{role: "error", content: formatError(err)})
|
||||
m.streaming = false
|
||||
|
||||
@@ -22,7 +22,10 @@ var builtinCommands = []cmdEntry{
|
||||
{"/exit", "exit gnoma"},
|
||||
{"/help", "show available commands and shortcuts"},
|
||||
{"/incognito", "toggle incognito mode (no persistence, local-only routing)"},
|
||||
{"/init", "initialize project — create AGENTS.md"},
|
||||
// /init is provided by the bundled skill at
|
||||
// internal/skill/skills/init.md; do not duplicate it here. The dedup
|
||||
// in completionSource() would skip a duplicate entry anyway, but
|
||||
// omitting it keeps the source-of-truth single.
|
||||
{"/keys", "show keyboard shortcuts"},
|
||||
{"/model", "list or switch active model"},
|
||||
{"/new", "start a new conversation"},
|
||||
@@ -34,6 +37,7 @@ var builtinCommands = []cmdEntry{
|
||||
{"/quit", "quit gnoma"},
|
||||
{"/replay", "replay last assistant response"},
|
||||
{"/resume", "browse and resume a saved session"},
|
||||
{"/router", "show or set routing preference (auto/local/cloud)"},
|
||||
{"/shell", "open interactive shell"},
|
||||
{"/theme", "list themes or set active theme"},
|
||||
{"/skills", "list available skills"},
|
||||
@@ -46,11 +50,27 @@ var permissionModes = []string{
|
||||
"auto", "default", "accept_edits", "bypass", "deny", "plan",
|
||||
}
|
||||
|
||||
// completionSource builds a sorted command list from builtins + skills.
|
||||
func completionSource(skills *skill.Registry) []cmdEntry {
|
||||
entries := make([]cmdEntry, len(builtinCommands))
|
||||
copy(entries, builtinCommands)
|
||||
// routerPreferModes lists valid values for /router completion.
|
||||
var routerPreferModes = []string{"auto", "local", "cloud"}
|
||||
|
||||
// completionSource builds a sorted command list from builtins + skills.
|
||||
// Skill names shadow builtin names so a skill (bundled or user-defined)
|
||||
// can replace a static entry without producing a duplicate in the picker.
|
||||
func completionSource(skills *skill.Registry) []cmdEntry {
|
||||
skillNames := make(map[string]struct{})
|
||||
if skills != nil {
|
||||
for _, s := range skills.All() {
|
||||
skillNames["/"+s.Frontmatter.Name] = struct{}{}
|
||||
}
|
||||
}
|
||||
|
||||
entries := make([]cmdEntry, 0, len(builtinCommands)+len(skillNames))
|
||||
for _, c := range builtinCommands {
|
||||
if _, shadowed := skillNames[c.name]; shadowed {
|
||||
continue
|
||||
}
|
||||
entries = append(entries, c)
|
||||
}
|
||||
if skills != nil {
|
||||
for _, s := range skills.All() {
|
||||
desc := s.Frontmatter.Description
|
||||
@@ -150,6 +170,16 @@ func matchArgCompletion(input string, profileNames []string, providerNames []str
|
||||
return cmd + " " + mode
|
||||
}
|
||||
}
|
||||
case "/router":
|
||||
if arg == "" {
|
||||
return ""
|
||||
}
|
||||
lower := strings.ToLower(arg)
|
||||
for _, mode := range routerPreferModes {
|
||||
if strings.HasPrefix(mode, lower) && mode != arg {
|
||||
return cmd + " " + mode
|
||||
}
|
||||
}
|
||||
case "/profile":
|
||||
if arg == "" || len(profileNames) == 0 {
|
||||
return ""
|
||||
|
||||
Reference in New Issue
Block a user