Three polish items surfaced during the maintainer's manual smoke
of the previous safety commit.
env-template precision (false-positive fix):
The "env file" rule matched .env.* universally, which flagged
conventional templates like .env.example / .env.sample /
.env.template / .env.dist / .env.default — these hold variable
NAMES, no values, and are commonly committed. Now skipped.
Real env files (.env, .env.local, .env.production) still match.
New envTemplateSuffixes table + isEnvTemplate helper; check runs
only inside the env-file rule so the suffix denylist is scoped.
Tests added for both directions: 6 templates that must NOT flag,
6 real env files that must.
Banner label alignment:
Field labels were padded to 8 chars except "sensitive" at 9,
producing visible misalignment in the rendered banner:
cwd : /...
provider : ollama / ...
sensitive : 0 matches in cwd <- one extra space
Padded all labels to 9 chars so the ":" separators line up.
Context banner on bypass:
--dangerously-allow-anywhere previously suppressed the entire
safety block, including the informational context banner.
Bypassing the GATE is not the same as opting out of the info —
the user still wants to see cwd / git state / sensitive files
nearby. Restructured the safety block so classification + banner
always run; the bypass only skips the refuse/warn FLOW. The
bypass warning log now also includes the classified tier and
cwd path for diagnostics.
Implements S-1 through S-7 of the startup-safety-banner plan.
Adds a pre-launch safety check that classifies the current working
directory into three tiers and gates the launch:
TierRefuse /, /etc, /sys, /proc, /usr, /var, /bin, /sbin, /boot,
/root, /dev (Linux) and /System, /Library, /private,
/Applications (macOS). Refuses with exit 2 unless
--dangerously-allow-anywhere is passed.
TierWarn $HOME, ~/Desktop, ~/Downloads, ~/Documents, ~/.config,
~/.local, ~/.cache, /tmp, and similar dumping grounds.
Prints a banner and reads a single y/Y from stdin to
confirm; any other input (or EOF, including piped/
scripted invocation) aborts with exit 1.
TierOK Anywhere with a recognized project marker (.gnoma/,
go.mod, package.json, pyproject.toml, Cargo.toml,
Makefile, Dockerfile, build.gradle*, pom.xml) or
inside a git repo. No prompt; banner only.
Project markers and git-repo presence override the TierWarn check —
a project dir inside $HOME stays TierOK. The require_project_marker
config knob can flip that for strict users.
Container detection: when /.dockerenv or /run/.containerenv exists,
TierRefuse downgrades to TierWarn (devcontainers often chroot to /
or similar). Best-effort; false positives only soften the gate.
The context banner is always rendered (TierOK, TierWarn, TierRefuse
alike) and summarizes: cwd, git branch + dirty state, project type,
provider/model, modes (permission, incognito, prefer), and a
top-level sensitive-file inventory. Inventory matches .env,
.env.*, env.local; private-key extensions (.pem, .key, .crt, .p12,
.pfx); SSH key names (id_rsa, id_ed25519, ...); credentials files;
.netrc / .pgpass; KeePass vaults; and .ssh/ .aws/ .kube/ .gcloud/
.azure/ .docker/ directories. Precision-tested: .envrc and
secret_handler.go do NOT match. Bounded at 1000 entries.
Architecture:
- internal/safety/cwd.go — Classification + symlink-resolving tier
classifier with platform-specific roots and container detection.
- internal/safety/sensitive.go — pattern-based top-level scanner,
deterministic ordering, scanLimit guard against pathological dirs.
- internal/safety/banner.go — pure render functions for the warn
prefix, refuse message, and context banner. Safe for golden-string
testing.
- internal/config/config.go — new [safety] section with three
config keys, defaults applied via ResolvedSafety() helper. Pointer
fields distinguish "user omitted" from "user set to false."
- cmd/gnoma/main.go — gate runs after subcommand dispatch (so
`gnoma providers / profile / slm / router` skip the prompt) and
before provider creation. --dangerously-allow-anywhere bypasses
the gate with an explicit log warning.
The runtime keypress reads up to 8 bytes from os.Stdin and accepts
only "y" / "Y" trimmed; EOF returns false (piped invocations
without the flag will abort). Documented in the readYesConfirmation
helper. Manual smoke (per plan):
- `cd / && gnoma -p test` → refuses
- `cd ~ && gnoma` → warns + keypress
- `cd ~/git/some-repo && gnoma` → banner only
- subcommands skip the gate entirely
Linux + macOS classification; Windows path handling deferred per
plan (treated as TierOK there until follow-up).
Refs: docs/superpowers/plans/2026-05-23-startup-safety-banner.md
Implements P-1 through P-6 of the prefer-routing-policy plan.
Adds a config knob that biases routing toward local arms, cloud
arms, or leaves selection unchanged. Default "auto" is
byte-identical to pre-change behavior (the new armTier path with
PreferAuto returns the same value as the old single-arg function).
Mechanism diverged from the plan after empirical testing:
The plan called for a score multiplier applied in bestScored.
Tests revealed the existing cost-floor math (scoreArm divides by
weighted cost which collapses to ~0.001 for free local arms) gives
local arms a ~280x raw-score advantage that a 0.3-0.5 multiplier
can't overcome. A tier-shift in armTier turned out cleaner:
PreferLocal: cloud arms (true API, IsLocal=false && !IsCLIAgent)
get +2 tier shift, landing behind locals.
PreferCloud: IsLocal arms get +2 tier shift, landing behind
cloud. SLM tier-0 arms shift to tier 2 — still
below cloud's tier 3 — so the SLM-protection
semantic (small stuff stays on the small model)
survives PreferCloud. This matches the open
question in the plan, now resolved as: yes, SLMs
keep winning under PreferCloud by design.
The policyMultiplier was kept in bestScored as a within-tier
nudge (mostly cosmetic in practice given the cost-floor dynamics
described above; could matter when costs are calibrated). Worth
revisiting once router-wide cost calibration lands.
Strengths cross-tier promotion is unaffected: the promoted-set
path in selectBest bypasses armTier entirely, so a strongly-tagged
cloud arm still wins SecurityReview tasks under PreferLocal
(validated by TestPreferPolicy_StrengthsBeatsMultiplier).
CLI-agent subprocess arms count as "local" for PreferLocal
purposes — they proxy to cloud but the user-visible behavior is
local. Users who want to exclude them can use --provider X.
Forced arms (--provider X) and incognito take priority over the
policy: forced arm test pins this, incognito-still-wins test pins
the LocalOnly hard filter dominating PreferCloud.
Test coverage (prefer_test.go): ParsePreferPolicy / String round
trips; policyMultiplier table; acceptance scenarios across all
three policies with adjacent-tier arms; SLM-still-wins under
PreferCloud; Strengths beats multiplier; forced-arm bypass;
incognito beats prefer; lone cloud arm wins when no local feasible.
Refs: docs/superpowers/plans/2026-05-23-prefer-routing-policy.md
Closes R-4 and R-5 of the routing-defaults plan.
R-4: Strengths + CostWeight defaults for closed frontier models.
Cloud entries land in the same knownFamilyDefaults table as local
ones, with MaxComplexity intentionally left zero (cloud arms get
no complexity ceiling). CostWeight tuned per the plan's rationale:
claude-opus-4-7 → Planning/SecurityReview/Debug/Refactor, 0.3
claude-sonnet-4-6 → Generation/Refactor/Review, 0.7
gpt-5.5 → Planning/SecurityReview/Generation, 0.3
gpt-5.3-codex → Generation/Refactor/Debug/UnitTest, 0.6
gpt-5.2 → Orchestration/Review, 0.8
gemini-3.1-pro → Planning/Review/Orchestration, 0.5
gemini-3.5-flash → Boilerplate/Explain/Orchestration, 1.2
The 0.3 weight on frontier arms keeps them competitive on
SecurityReview / Planning despite $4+/Mtok; 1.2 on Gemini Flash
penalizes cost more so it only wins when cost is genuinely
decisive (boilerplate, explain).
Mechanism: extracted applyFamilyDefaults into defaults.go and call
it from Router.RegisterArm. Single source of truth — both local
discovery and the primary-provider path in cmd/gnoma/main.go now
flow through the same defaults application. Removed the duplicate
apply block from RegisterDiscoveredModels.
Legacy model IDs (claude-opus-4-20250514, gpt-4o, o3, gemini-2.5-pro,
etc.) intentionally do not match any table entry — keeps users on
pinned older models safe from imposed 2026 Strengths.
R-5: gpt-5.3-codex registration.
- internal/provider/openai/provider.go: added to fallbackModels
and inferOpenAIModelCapabilities (400K context, 32K output).
- internal/provider/ratelimits.go: gpt-5.3-codex and its dated
alias gpt-5.3-codex-2026-02-15 added with the same Tier 1
quotas as gpt-5.2.
Gemini 3.x (3.1-pro-preview, 3.5-flash, 3.1-flash-lite) was already
registered in both google/provider.go and ratelimits.go — no change
needed for that part of R-5.
Test coverage:
- ResolveFamilyDefaults table-driven across all 7 cloud entries
including prefix-sharing (gpt-5.5-pro → gpt-5.5 defaults,
gemini-3.1-pro-preview → gemini-3.1-pro defaults).
- Legacy IDs return !ok.
- RegisterArm applies cloud defaults end-to-end.
- User-supplied Strengths and CostWeight are not overridden.
- ID.Model() fallback works when ModelName is empty (test code
often constructs arms this way).
Refs: docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md
Expands the family-defaults scaffold to 23 entries covering the local
models that currently appear in real Ollama fleets: coder specialists
(qwen3-coder, devstral, qwen2.5-coder, yi-coder, deepseek-coder,
starcoder), reasoners (phi-4, phi-4-mini), Gemma 2/3/4 (including the
"edge" e2b/e4b variants under both Ollama and GGUF naming), Qwen
2.5/3/3.5 with a catch-all qwen entry, Mistral/Ministral (incl. the
24B mistral-small-3), Llama 3.2/4, tiny3.5 (reec's distill family),
Granite, GLM (incl. glm-ocr specialist), and MiniCPM-V.
Five families that span wide parameter ranges (qwen3.5, qwen3,
qwen2.5, ministral-3, tiny3.5) now use SizeCap ladders instead of a
flat MaxComplexity. A new parseSizeFromModelID helper splits the
model ID on :/-_/ and matches pure <N>b/<N>m tokens, correctly
ignoring qwen3.5 version strings, e2b edge tags, a3b MoE active
params, and v0.3 version suffixes.
ResolveMaxComplexity wraps ResolveFamilyDefaults plus the SizeCap
traversal, falling back to the smallest cap when size parsing fails
(conservative). Discovery's apply path now goes through it so
SizeCap entries actually take effect.
Test coverage:
- parseSizeFromModelID (11 cases)
- ResolveFamilyDefaults longest-prefix discipline (19 cases)
- Unknown-family fallback returns !ok
- ResolveMaxComplexity size-keyed ladder (13 cases)
- Size-parse-failure fallback
- knownFamilyDefaults invariants: SizeCaps ordered largest-first,
SizeCaps and MaxComplexity mutually exclusive per entry
- Routing-payoff integration: 3 arms (tiny3.5:1.5b, phi-4:14b,
qwen3-coder:30b) get picked for TaskGeneration / TaskPlanning /
TaskBoilerplate respectively, without any [[arms]] config
- Local fleet visibility: the maintainer's actual `ollama ls`
inventory registers correctly with expected MaxComplexity and
Strengths; embeddinggemma stays filtered out
The Planning sub-case surfaced a separate issue worth flagging:
heuristicQuality floors out at 0.55 for a generic 14B local model
without ThinkingModes, below TaskPlanning's 0.60 threshold. The test
mutates phi-4's capabilities post-registration to reflect reality
(phi-4 is reasoning-tuned). A discovery-side thinking-capability
detection is out of scope for this plan but flagged in the test
comment for follow-up.
Refs: docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md
Discovery previously registered every model returned by Ollama as a
chat arm, including embeddings, ASR, TTS, audio realtime, and
rerankers — which then failed at inference time when the router
selected them. Local arms also shipped with all-zero defaults, so
selection between e.g. tiny3.5:1.5b, phi-4:14b, and qwen3-coder:30b
was effectively random.
This change covers tasks R-1, R-2, R-6 from the routing-defaults plan.
- nonChatModelPatterns + isNonChatModel substring matcher; matched
IDs are skipped during RegisterDiscoveredModels. Covers whisper,
moonshine, kokoros, vibevoice, -asr, -tts, -audio, -embedding,
embeddinggemma, -reranker, lfm2.
- knownVisionModelPrefixes gains gemma4, gemma-4, glm-ocr. gemma3
and minicpm-v entries stay for regression coverage.
- New internal/router/defaults.go with FamilyDefaults struct,
knownFamilyDefaults map, and ResolveFamilyDefaults longest-prefix
lookup (with org/-namespace stripping so reecdev/tiny3.5:1.5b
resolves to "tiny3.5"). Single entry for now: functiongemma is
registered with Disabled=true and MaxComplexity=0.40, reserved for
the future ArmRoleToolRouter path. Table will grow in R-3.
- RegisterDiscoveredModels consults ResolveFamilyDefaults and only
populates fields that are still zero on the arm, so user [[arms]]
overrides keep priority.
Plans:
- docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md
- docs/superpowers/plans/2026-05-23-tool-router-specialization.md
TODO.md surfaces both as in-flight items.
codex 0.133.0 emits two token-accounting fields at top level that
we previously dropped:
cached_input_tokens — subset of input_tokens that hit the prompt
cache (cheaper, but still counted in
input_tokens per OpenAI Responses API
semantics)
reasoning_output_tokens — separately reported billable thinking
tokens on reasoning-capable models
Map cached_input_tokens to message.Usage.CacheReadTokens and subtract
it from InputTokens. message.Usage.Add() sums InputTokens and
CacheReadTokens as peers, so the uncached residual goes in
InputTokens — matches the anthropic provider's convention and keeps
cumulative usage tracking arithmetically correct.
Fold reasoning_output_tokens into OutputTokens for accurate cost
tracking. The top-level peer positioning (vs nested in
output_tokens_details) implies a separately counted billable
quantity, not a subset of output_tokens.
Defensive clamp at zero in case a future codex build reports
cached > input due to schema drift. Includes a verbatim regression
guard against the live 2026-05-22 codex 0.133.0 output to catch
schema changes early.
Add a deterministic pre-extractor that skips known-safe token shapes
before they reach the entropy scorer. Targets the false-positive
regime that bites under lowered entropy_threshold or
redact_high_entropy = true — UUIDs (~3.4 bits), SHA hex digests
(~3.9 bits), ISO-8601 timestamps, and HTTP(S) URLs.
Config knob lives under the existing security section to match
entropy_threshold / redact_high_entropy convention:
[security]
entropy_safelist = ["uuid", "sha_hex", "iso8601", "url"]
Empty / unset preserves pre-F-1 behaviour exactly — users opt in.
Per-pattern Debug telemetry fires on every skip (pattern name +
token length, never the token bytes). This is the data F-2's
go/no-go gate depends on; the plan literally specifies it.
NewFirewall validates names at the config boundary and emits a
Warn for unknown entries so a typo like "uid" instead of "uuid"
surfaces loudly instead of silently disabling FP reduction.
Tests cover: UUID/SHA-1/SHA-256 skipped at lowered threshold,
mixed payload (safe shape + real secret) preserves the secret,
secret-adjacent-to-UUID regression guard, empty safelist preserves
pre-F-1 behaviour, unknown name silently dropped at scanner level
but warned at firewall level, end-to-end FirewallConfig wiring,
and the skip-telemetry log line.
F-2 remains gated on real-workload FP-rate observations.
The worktree commit 12a6b83 dropped the "Native agy JSON output"
backlog item alongside removing the agy agent. Since we restored
agy in this branch, the TODO is relevant again — agy v1.0.0 still
emits plain text and the prompt-augmentation fallback should be
replaced by --output-format stream-json once the CLI supports it.
Switch TestTryLoadOAuthCredentials_Formats to t.TempDir() to drop
the unchecked os.RemoveAll defer that golangci-lint's errcheck
caught after the merge.
Brings in the Google auth precedence work (agy > gemini > ADC
credential walk, fileTokenProvider expiry handling, slog-backed
error reporting), the Codex CLI integration as a new subprocess
agent, and the restoration of the agy subprocess agent that was
accidentally removed by the initial codex commit. Sandbox-bypass
flags on both agy and codex are now opt-out via env vars
(GNOMA_AGY_BYPASS_PERMISSIONS, GNOMA_CODEX_BYPASS_SANDBOX).
Includes review-driven fixes:
- ADC fallback now uses real DetectOptions (cloud-platform scope)
- fileTokenProvider returns an error on expired tokens instead
of shipping a known-dead bearer
- TestNew_Precedence asserts which credential was actually picked
- codex parser tolerates non-JSON banner / debug lines on stdout
- codex usage takes max(input_tokens, prompt_tokens) so accounting
can't silently undercount
No conflicts expected with the dev image-content feature: the
worktree branch only touches the google and subprocess provider
families.
The original commit on this branch replaced the agy subprocess agent
with codex (overwriting the slot in knownAgents, deleting agy_test.go
and the agyParser). That was unintentional — agy (antigravity) is a
distinct CLI from codex (OpenAI's). Antigravity will replace gemini
when gemini retires on 2026-06-16, so it needs to keep its own slot.
Restored: FormatAgyText constant, agyParser with newAgyParser and
the line-delimited text parser, the agy CLIAgent entry in
knownAgents with PromptResponseFormat:true, agy_test.go, and the
agy case in newParser. Sourced from the parent commit so behavior
matches what shipped before the codex change.
Sandbox bypass: both agy (--dangerously-skip-permissions) and codex
(--dangerously-bypass-approvals-and-sandbox) need a flag to run
non-interactively (their stdin is closed; without it they block on
approval prompts nobody can answer). Both default to ON for
out-of-box behavior; operators with pre-approved trust config can
opt out via GNOMA_AGY_BYPASS_PERMISSIONS=0 or
GNOMA_CODEX_BYPASS_SANDBOX=0. Tests cover the on / opt-out / unknown
value branches.
TestKnownAgents_ValidFormats updated to accept the restored
FormatAgyText.
Codex emits banner / debug / "starting turn" lines to stdout
interleaved with the JSON event stream. The parser previously
returned an error on any line that wasn't a JSON object, which
subprocessStream.Next treats as terminal — one stray banner
aborted the whole turn. Skip lines that don't start with `{`
after whitespace trim, and downgrade unparseable JSON-looking
lines to a slog.Debug so they don't kill the stream either.
Token accounting: usage payloads from newer codex builds
occasionally carry both input_tokens and prompt_tokens (and
likewise output / completion) with slightly different values.
Always use the larger of the two so we can't silently undercount.
Tests cover non-JSON banner skipping, malformed-JSON
non-fatal-skip, and the max() behavior with both token
fields populated.
credentials.DetectDefault(nil) always returns "options must be
provided", which made the ADC branch unreachable. Pass an explicit
DetectOptions with the cloud-platform scope so users with
GOOGLE_APPLICATION_CREDENTIALS or `gcloud auth application-default
login` actually flow through ADC instead of falling out as
"no credentials found".
fileTokenProvider.Token used to return expired tokens unchanged.
We don't perform an OAuth refresh exchange (the upstream CLI does
that out-of-band into the file we read), so when the file isn't
fresh the only safe move is to fail loudly with an actionable
message rather than ship a known-dead bearer that genai forwards
to Vertex AI and gets back a confusing 401.
tryLoadOAuthCredentials previously swallowed all errors equally,
so the precedence walker silently skipped past misconfigured files
(chmod 0600 on the wrong user, half-written JSON, etc.). Now
os.IsNotExist is silent (normal walking), everything else gets a
slog.Warn with the path so an unreadable file is visible.
selectOAuthCredentials extracts the precedence chain into a
testable helper that also returns a CredentialSource tag
identifying which path was chosen. The previous precedence test
only asserted err == nil; the new test verifies that the agy file
wins when both are present and that the fallback to gemini
actually loads the gemini token.
Ctrl+V image paste used to write the file to .gnoma/pasted_image_*.png
under the project root, which polluted the workdir and risked
committing screenshots that may contain sensitive content.
Now writes to os.UserCacheDir() / gnoma / pasted-images/ (XDG cache
on Linux, ~/Library/Caches on macOS, %LocalAppData% on Windows).
The directory is created at 0700 and files at 0600 since pasted
content can be sensitive.
Each paste prunes entries older than 2 hours best-effort, so the
cache doesn't accumulate across sessions. The 2h window safely
covers any single turn including provider retries and slow
subprocess CLIs that need the file to still exist on disk when
they ingest the path.
.gitignore: cover the legacy `.gnoma/pasted_image_*` location for
old checkouts; add log.txt and codex_out.jsonl which were tracked
as runtime artifacts during the recent work.
Tests cover cache-path placement, restrictive perms on both the
directory and the file, the no-pollution-of-cwd invariant, and the
prune behavior (stale removed, fresh kept, missing dir no-op).
When the user message has at least one ImageContent block, build a
ChatCompletionContentPartUnionParam array with text + image_url
parts instead of the string content path. Image bytes are inlined
as a base64 data URL (data:<media-type>;base64,...). Adjacent text
blocks are merged into a single TextContentPart. Pure-text user
messages stay on the existing string fast path.
This covers OpenAI direct + every openaicompat backend (Ollama,
llama.cpp, llamafile) since they all share the same provider.
Tests: pure text uses OfString; image present emits 2 content parts
(text + image_url with the expected base64 payload); nil-Image
blocks are dropped and adjacent text merges correctly.
buildUserMessage replaces the unconditional NewUserText wrap inside
SubmitWithOptions. When the active model advertises Vision and the
input contains [Image: /path] markers, the markers are inlined as
ImageContent blocks carrying the file bytes; otherwise the input is
passed through as a single text block (legacy behavior preserved
for subprocess CLIs that auto-ingest paths, e.g. gemini-cli).
image_input.go:
- imageMarkerRe extracts each [Image: ...] occurrence.
- Per marker: validates absolute path, file (not dir), size cap of
10 MiB, image/* media type via http.DetectContentType.
- On any validation failure, the marker is left as literal text and
a warning is recorded — the turn still proceeds.
Routing: latestUserHasImages drives task.RequiresVision in both the
primary stream attempt and the retryOnTransient path, so failover
arms also respect the vision requirement.
Tests cover: no markers (single text block), single image
(bytes captured into Image.Data, MediaType set), missing file
(literal fallback + warning), relative path rejection, oversized
rejection, non-image file rejection, multiple images interleaved
with text.
Task gains a RequiresVision bool; filterFeasible enforces it on
both the primary feasibility pass and the last-resort fallback
(no degradation to a non-vision arm — the model literally cannot
consume image bytes).
Ollama discovery now probes /api/show for vision capability:
- details.families containing "clip" / "mllama" / "*vl"
- capabilities array containing "vision" (newer Ollama)
- name-prefix fallback for releases that predate either
(llava, qwen2.5-vl, llama3.2-vision, moondream, pixtral, etc.)
OllamaProbeResult replaces the map[string]bool tool cache so the
single /api/show call can populate tools + vision + ctx-size in
one probe. DiscoverOllama / DiscoverLocalModels signatures updated;
nil-cache callers in cmd/gnoma keep working unchanged.
RegisterDiscoveredModels propagates SupportsVision into the arm's
Capabilities.Vision.
Tests cover RequiresVision filtering in both the happy path
(vision-only arm chosen when image present) and the fallback path
(non-vision arm rejected even as last resort).
Extends the Content discriminated union with a fifth variant for
inline image payloads. Image carries the raw bytes (captured at
user-input time so the message snapshot is self-contained and
survives source-file deletion), the IANA media type for the
provider's image part, and the original path for logging.
HasImages() lets providers decide whether to fall back to a
text-only representation; providers that don't know about
ContentImage will simply skip those blocks via TextContent().
Bundles the pending TUI work into a coherent batch. Bug fixes from
external review:
* expandPlaceholders: single-pass alternation regex over the original
input prevents `#p\d+` / `#img\d+` tokens inside pasted content from
being re-expanded after the bracket form is inlined.
* /incognito: gate savePromptHistory and the Ctrl+V image-write branch
on `!m.incognito` so the no-persistence contract holds.
* history.txt: write at mode 0600 (chmod existing 0644 files), create
parent dir at 0700, truncate to 500 entries on every save, slog.Warn
on errors instead of swallowing.
* triggerPickerAction: guard m.config.Engine before SetModel, matching
the /model handler.
* Picker key handler: navigation/enter/q consume, escape/ctrl+c close
the picker AND fall through to global handlers (so streaming cancel
and double-tap quit work with an overlay open), default swallows
stray input.
* Paste line count: report total non-empty lines instead of newline
count, ignoring trailing newlines (no more "+0 lines" for "abc").
* Ctrl+O restored to expand-output; Ctrl+Y is the new copy-response
bind. /keys help text updated; picker help entries reordered.
* Tighter perms on .gnoma/pasted_image_*.png (0600).
Race-safety refactor: ApplyTheme used to mutate ~25 package-level
lipgloss styles in place. Replaced with an immutable themeStyles
snapshot and atomic.Pointer[themeStyles] swap. Readers go through a
theme() helper (one atomic load) instead of touching package vars
directly. No locks, no nested-RLock risk if rendering ever moves
off-thread.
Includes pre-existing in-flight work: TUISection in config with
persistent theme/vim settings; /copy /theme /vim slash commands;
provider-name completion; session.SetProvider for the provider picker.
Tests: placeholder_test.go (6 regression + happy-path cases including
the pasted-content collision), history_test.go (5 cases covering perms
on new and existing files, on-disk truncation, blank-input, newline
flattening), provider_test.go (provider switching + picker transitions
+ SLM gating).
`internal/mcp/transport.go` used syscall.Setpgid and syscall.Kill
unconditionally, both Unix-only. Split the platform bits into
`transport_unix.go` (build tag `!windows`) keeping the existing
process-group semantics, and `transport_windows.go` (build tag
`windows`) falling back to `os.Process.Kill` (kills only the
immediate process — full process-tree kill on Windows would need
golang.org/x/sys/windows + job objects, deferred).
Caught by `goreleaser release --snapshot` cross-compiling for
windows/amd64 and windows/arm64.
Bump hard-coded provider defaults to the May 2026 lineup:
- Anthropic: claude-sonnet-4-6 (default); Opus 4.7 and Haiku 4.5 in
the fallback list. 4.6/4.7 generation has 1M context standard.
- OpenAI: gpt-5.5 (default); 5.5-pro / 5.2 / 5.2-chat-latest in
fallback. ThinkingModes now baseline on GPT-5.x.
- Google: gemini-3.5-flash (default); 3.1 Pro / Flash Lite in fallback.
- Mistral: mistral-large-latest unchanged (Mistral Large 3); add
mistral-medium-3.5, mistral-medium-2511, mistral-large-2512 to the
rate-limit map.
Legacy dated IDs retained in fallback lists and ratelimits maps so
configs pinned to claude-sonnet-4-20250514 / gpt-4o / gemini-2.5-flash
keep resolving. Capability tables (ContextWindow, MaxOutput,
ThinkingModes) updated to match each generation. CLI help text in
cmd/gnoma/main.go also updated.
Apply gofmt -w across the codebase (struct field comment realignment
only — no semantic changes) and silence two errcheck warnings on
fmt.Sscanf / fmt.Fprintf return values in internal/router/discovery
with explicit `_, _ =` discards. Required so `make check` is green
before tagging v0.1.0.
When a stream errors out before producing any user-visible content
(text, thinking, or tool calls), the engine now transparently retries
on the next-best arm instead of bubbling the error to the TUI. Covers
the case from the post-SLM screenshot: subprocess CLI agents that
exit non-zero on auth/config failures, network drops mid-stream,
rate-limited arms whose error surfaces after Stream() already returned.
Mechanism: the stream-create + consume blocks are wrapped in a labeled
streamLoop. On s.Err() != nil with empty accumulator, the engine emits
a new EventFailover ("↻ <failed_arm> failed (<reason>) — retrying on
another arm"), excludes the failed arm via task.ExcludedArms, and
re-enters the loop. Cap of 4 failovers per round.
Guards:
- !acc.HasContent() — if text/tool calls already streamed, fail loud
rather than duplicate visible output on retry.
- isFailoverable(err) — deny-list approach: context.Canceled/Deadline
and HTTP 400/413 are fatal; everything else (auth, rate limit, 5xx,
subprocess exit, network) is failoverable.
- Router.ForcedArm() == "" — when the user pinned an arm via --provider,
failover is disabled by design.
- failoverAttempt < maxFailovers — bounded retry budget.
TUI renders EventFailover under the existing "cost" role styling.
shortFailReason strips the subprocess wrapper envelope so the user sees
"Invalid API key. Try again." instead of
"subprocess: exit status 1: Error: Invalid API key. Try again.".
Tests cover the classifier (isFailoverable, shortFailReason), end-to-end
auth-error failover, content-already-streamed guard, and context-cancel
guard. Deterministic across 10x -race runs by giving the failing arm
IsCLIAgent=true to anchor it in tier 0 ahead of the API-tier backup.
The router.SecureProvider interface previously required a public
IsSecure() bool method. Any test mock — or future production type —
could satisfy it by returning true, defeating the W1 "only wrapped
providers may flow past the boundary" contract through convention
rather than at the type level.
Replaces IsSecure() bool with an unexported security.Marker interface
that has a single secured() method. Go's method-set semantics key
unexported methods by their defining package, so only types declared in
internal/security can satisfy Marker. *SafeProvider gets the lone
secured() implementation; router.SecureProvider embeds Marker.
The seal forces every test mock that previously implemented IsSecure()
to either (a) be wrapped with security.WrapProvider(mp, nil) at the use
site, or (b) drop the method entirely if the mock never flows through
SecureProvider. 93 use sites across 11 test files were updated via a
per-package secureMock helper. WrapProvider with a nil firewall ref is
a no-op pass-through, so test behavior is unchanged.
Empirically: a type from outside internal/security can declare
`secured()` but the compiler will reject assigning it to
router.SecureProvider because the unexported method belongs to the
other package's namespace. Convention → compile-time guarantee.
3c87527 added engine/paths.go:resolveCanonical, duplicating the
ancestor-walk + EvalSymlinks algorithm that already lived in
fs/guard.go:ResolveWrite. Two implementations of the same TOCTOU defense
is exactly the wrong shape for security code — a bug fix in one would
silently miss the other.
Extracts the shared algorithm to security.CanonicalizePath. Both call
sites become thin wrappers that pre-anchor relative paths against the
appropriate root (cwd for engine, workspace root for guard). The
"hit-root" defensive branch in engine's version (commented "highly
unlikely") is tightened to match guard's error behavior.
Adds focused unit tests for the helper covering existing path,
non-existent leaf, non-existent mid-component, symlinked ancestor, and
relative-path rejection.
3c87527 rewrote DiscoverLlamaCPP to hit /props and emit a single hardcoded
"default" entry. That breaks two cases:
1. Multi-model llama.cpp deployments (llama-swap, model-routing proxies)
are collapsed to a single arm with a placeholder ID.
2. Single-model deployments lose the real model name — arms are
registered as llamacpp/default instead of llamacpp/<actual-id>.
Restores enumeration via /v1/models (the OpenAI-compatible endpoint
llama-server exposes) while keeping the concrete n_ctx read from /props.
/props is now best-effort: failure or missing n_ctx falls back to the
documented default rather than aborting discovery.
Adds three tests: multi-model enumeration with shared context, /props
unreachable, and the empty-/v1/models error path.
3c87527 refactored DiscoverOllama and DiscoverLlamaCPP and dropped two
behaviors:
1. The Ollama toolCache prune loop. Without it, the cache grows
unbounded across reconcile cycles and stale entries linger; a
model that disappears and reappears replays an out-of-date
tool-support verdict because the cache hit skips re-probing.
2. Sensible context-size defaults. Both probes can yield
ContextSize=0 (Ollama: no num_ctx in /api/show parameters;
llama.cpp: /props default_generation_settings without n_ctx).
Registering an arm with ContextWindow=0 misroutes — the post-SLM
two-stage path treats it as a tiny model.
Restores the prune loop, applies 32768 (ollama) / 8192 (llama.cpp) as
fallbacks at discovery time, and adds three tests covering each path.
The full-block private_key regex (BEGIN…END span) added in 3c87527 fails
to match when the END marker is missing — log slices, buffered streams,
or partial dumps that contain only the header and key body would leak
the body. Adds a private_key_header pattern that matches the header
plus the trailing base64 body. Redact merges the overlapping spans into
a single placeholder when both fire on a complete block.
Covered by TestScanner_DetectsTruncatedPrivateKey (no END marker) and
TestRedact_PrivateKeyOverlap_SinglePlaceholder (overlap merge).
- Drop unverified JSONOutput/Vision capability claims on agy (no native
stream-json, no image-input path on v1.0.0).
- Replace agent.Name == "agy" check with PromptResponseFormat flag on
CLIAgent so the prompt-augmented JSON fallback scales to future agents.
- Pass --dangerously-skip-permissions in agy PromptArgs to parallel
gemini --yolo / vibe --trust; required for non-interactive runs.
- Nil-guard JSONSchema and Schema bytes in buildPrompt (previously
panicked when ResponseJSON was requested without a schema).
- Rename misleading TestAgyProvider_StreamAugmentation to
TestAgyParser_EmitsLineDeltas; add coverage for nil-schema path and
non-augmenting agents.
Closes the last remaining 2026-05-19 audit finding by documenting the
existing transitive guarantee rather than restructuring the hook
contract.
The audit observed that PostToolUse hooks receive raw tool output
before the firewall scan runs, and proposed reordering or splitting
the event into raw-local-only and redacted-for-LLM variants. After
Wave 1 (SafeProvider boundary at every router arm + non-engine
provider consumer), the audit's threat model is closed transitively:
- Shell hooks see raw output but never reach an LLM.
- Prompt hooks route Stream calls through routerStreamer → router →
arm.Provider, every arm.Provider is now *SafeProvider, outgoing
messages are scanned at the boundary.
- Agent hooks spawn an elf whose engine has Firewall set;
buildRequest scans inline.
Reordering would regress legitimate shell-hook use cases (audit,
forensic, local alert) that need raw access. Splitting the contract
forces every existing hook config to migrate and introduces a
wrong-variant footgun. Neither is justified by the residual risk.
Three changes ship with the ADR:
- ADR-004 records the decision and the conditions for re-opening it.
- Doc comments on hook.PostToolUse and the dispatcher call site in
the engine point at the ADR.
- internal/hook/posttooluse_redaction_test.go locks in the invariant:
a prompt PostToolUse hook firing on a secret-bearing tool result
produces a redacted prompt at the inner provider. If this test
fails, ADR-004's Position A is no longer correct and the audit
finding re-opens.
Closes the cluster of audit findings where gnoma's incognito promise
('no persistence, no learning, local-only routing') silently broke
because state was duplicated across the CLI flag, the firewall's
IncognitoMode, the router's localOnly flag, and the TUI's local
m.incognito field. Wave 2 makes security.IncognitoMode the canonical
source of truth.
W2-1 Router.Select rejects forced non-local arms when localOnly is on
rather than short-circuiting and silently routing to cloud. Main
fails fast when --incognito + --provider <cloud> are combined; the
TUI toggle (Ctrl+X, /incognito, config panel) refuses with an
actionable message when a non-local arm is pinned. Factored the
three duplicated toggle sites into Model.attemptIncognitoToggle.
W2-2 persist.Store.Save consults an IncognitoGate (local interface,
*security.IncognitoMode satisfies it). nil gate = always persist
(legacy behaviour for tests); non-nil gate is consulted on every
Save so TUI runtime toggles take effect without reconstructing the
store. File mode 0o600, dir mode 0o700.
W2-3 tui.New seeds m.incognito from cfg.Firewall.Incognito().Active().
Fixes the Ctrl+X-on-launch-with-incognito case where the first
toggle silently turned the firewall OFF because the local flag
started false out of sync with the firewall.
W2-4 saveQuality gates on both *incognito (defensive, covers the
window before fwRef.Set fires) and fw.Incognito().ShouldLearn() (so
TUI Ctrl+X suppresses the snapshot on exit). Quality restore skipped
under --incognito. Quality file written 0o600 in dir 0o700.
engine.reportOutcome and elf.Manager.ReportResult both gate on
fw.Incognito().ShouldLearn() — bandit signal no longer leaks out of
incognito sessions.
W2-5 session files written 0o600 in dirs 0o700 (was 0o644 / 0o755).
W2-6 IncognitoMode.LocalOnly dropped — dead field with no readers;
routing local-only state lives on the router, not the firewall.
Also wires rtr.SetLocalOnly(true) when --incognito at launch — main
previously activated the firewall's flag but never told the router to
filter, so even without the forced-arm bug, launching with
--incognito alone gave you 'incognito badge but full arm pool'.
Advisor flagged that engine.Config.Provider stayed raw, so the safety
property was 'every call goes through buildRequest' instead of the
stronger 'every Stream call routes through a SafeProvider.' Wrap it
even though buildRequest still scans inline — at worst this costs one
extra idempotent scan pass; it removes the 'someone adds a fifth engine
Stream site that skips buildRequest' failure mode.
Engine.SetProvider gets a doc comment establishing the wrap contract
for callers. No active callers today, but documenting it now prevents
the future bypass.
Confirmed elf engines inherit the wrap automatically:
- elf.Manager.Spawn passes arm.Provider (already *SafeProvider after
W1-3a)
- elf.Manager.SpawnWithProvider has no callers — dead code path
Added the Wave 1 plan to TODO.md under active plans.
Construct security.FirewallRef early in main() and Set it immediately
after security.NewFirewall returns. Wrap every provider that may be
called outside engine.buildRequest():
- primary provider arm (limitedProvider)
- discovered local models (RegisterDiscoveredModels factory)
- CLI agent arms (subprocprov.New)
- background-discovery factory (StartDiscoveryLoop)
- SLM arm + classifier transport
- summarizer (gnomactx.NewSummarizeStrategy)
routerStreamer and hook PromptExecutor inherit redaction automatically
once every router arm is wrapped — they dispatch through router.Stream
→ arm.Provider.Stream.
engine.Config.Provider stays raw because the engine still scans inline
at buildRequest(); per the Wave 1 plan, removing that scan is deferred
one release as belt-and-suspenders.
Integration tests in internal/security/integration_test.go verify the
boundary end-to-end: a router arm wrapped with WrapProvider redacts an
'sk-ant-...' literal before the inner provider sees it, and the
pre-Set / post-Set transition works as documented (pass-through until
the FirewallRef has a Firewall installed).
Introduces internal/security/SafeProvider — a provider.Provider decorator
that scans outgoing messages and the system prompt through the firewall
before delegating to the inner provider. Tool-result redaction stays in
the engine because it needs per-tool context the boundary lacks.
FirewallRef provides a late-binding atomic.Pointer[Firewall] so the
wrapper can be installed before NewFirewall runs in main. A nil or
unset ref makes SafeProvider a pass-through — preserves the current
init order without lock contention or panics.
Wave 1 of the post-audit hardening plan
(docs/superpowers/plans/2026-05-19-security-wave1-safeprovider.md).
Closes the architectural critique that secret scanning only ran inside
engine.buildRequest(), leaving SLM/summarizer/hook/routerStreamer paths
to send raw payloads. This commit only ships the wrapper; W1-2 and W1-3
will wire it through main and the four bypass sites.
Adds the in-TUI surface for the profile system:
- Status bar carries " · profile: <name>" next to the SLM badge when
profile mode is engaged (renders nothing in legacy single-config
installations).
- /profile (no args) shows the active profile and lists available ones.
- /profile <name> switches by re-executing gnoma via syscall.Exec under
--profile <name>. Critical cleanups (quality.json snapshot, SLM
backend Close, session.Close) fire explicitly before exec since
defers don't run after exec replaces the process image. Using
syscall.Exec rather than a child process avoids stacking a process
level on every switch and propagates the new gnoma's exit code
directly to the shell.
- Autocomplete after "/profile " offers configured profile names; the
completion source is threaded from main.go via tui.Config.
Conversation history is not preserved across a switch — profile change
implies different context, different keys, different permission mode,
so a clean reset is the correct semantic.
`profile list` enumerates configured profiles and marks default + active.
`profile show <name>` prints the merged effective config the profile
would produce — sections, configured key names (values never), CLI
agent overrides, arms, hooks, MCP servers, per-profile quality and
session paths.
Both commands work as a recovery affordance when profile resolution
is broken: list flags a missing-default explicitly with
"<name> (default, missing)", and the dispatcher falls back to a
base-only load (new gnomacfg.LoadBase) so the diagnostics still run.
API key values are filtered out of `profile show` — the output is safe
to paste in a help channel or attach to a bug report.
Adds opt-in user profiles for swapping API keys, CLI binaries, and
permission modes between contexts (work/private/experiment/...).
Profile mode engages only when ~/.config/gnoma/profiles/ exists, so
existing single-config installations are untouched. Selection order:
--profile flag → default_profile in base config → fatal error.
Layering: defaults → ~/.config/gnoma/config.toml → profiles/<name>.toml
→ <projectRoot>/.gnoma/config.toml → env. Map sections merge per-key;
[[arms]] and [[mcp_servers]] merge by id/name; [[hooks]] appends.
Per-profile data: quality-<name>.json and sessions/<name>/ keep the
bandit and session list from cross-contaminating between profiles.
Profile names restricted to [A-Za-z0-9_-] to block --profile=../foo
path traversal into derived paths.
Plan D from docs/superpowers/plans/2026-05-19-post-slm-unlock.md
(static portion; dynamic bandit-driven promotion deferred to D-2).
Routing previously let tier ordering (CLI > local > API) dominate
selection — Opus, in tier 3, would lose to a tier-1 CLI agent for
SecurityReview even though Opus is empirically stronger at that task.
This change introduces explicit per-arm overrides:
[[arms]]
id = "anthropic/claude-opus-4-7"
strengths = ["security_review", "planning"]
cost_weight = 0.3
Strengths gate cross-tier promotion: arms matching task.Type bypass
the tier loop and compete with each other directly. Promotion is a
preference, not a pin — if no strength-tagged arm is feasible
(backoff, pool capacity, tool support), selection falls through to
the default tier order.
CostWeight linearly dampens the cost penalty in scoreArm via
effectiveCost = 1 + CostWeight * (cost - 1)
CostWeight=1.0 (or unset) preserves current behavior; lower values
trade cheapness for quality. The earlier draft used cost^CostWeight
which inverts direction for sub-1 local-arm costs (raising a
fraction <1 to a fractional power makes it bigger, not smaller); a
monotonicity regression test prevents that drift.
- internal/router/arm.go: Strengths []TaskType, CostWeight float64,
HasStrength(), ResolvedCostWeight() (zero → 1.0).
- internal/router/selector.go: scoreArm strength bonus const
(strengthScoreBonus = 0.15) + linear cost dampening; selectBest
cross-tier promotion before tier loop.
- internal/router/router.go: ArmOverride type + ApplyArmOverrides()
returns unknown IDs; unknown strength names skipped with per-name
warning via slog.
- internal/router/task.go: ParseTaskTypeStrict() returns ok bool;
ParseTaskType now delegates so the two switches stay in sync.
- internal/config/config.go: ArmConfig + [[arms]] TOML wiring.
- cmd/gnoma/main.go: applies overrides after all initial arms
register; logs a warning when an [[arms]] id has no matching
registered arm.
Tests cover: predicate helpers, scoring direction across two arms,
linear-formula monotonicity on both sides of cost=1, cross-tier
promotion, empty-Strengths preserves tier order, promoted arm in
backoff falls through via full Router.Select path, observed-quality
tiebreak between two strength-tagged arms, ApplyArmOverrides happy
path + unknown-ID reporting + unknown-strength skipping.
Plan B from docs/superpowers/plans/2026-05-19-post-slm-unlock.md.
Users with aliased CLI binaries (claude-priv, claude-work,
gemini-personal) can now point gnoma's auto-discovery at them
without renaming. The override flows through to the actual subprocess
spawn at internal/provider/subprocess/provider.go:56, so routing
through the alias is functional, not cosmetic.
Config:
[cli_agents]
claude = "claude-priv" # discovery uses claude-priv instead of claude
gemini = "" # empty value = no override (fall back to canonical)
# vibe is absent = canonical name used
- internal/config/config.go: CLIAgentsSection map[string]string;
TOML [cli_agents] key.
- internal/provider/subprocess/agent.go:
- Package-level lookPath = exec.LookPath for test injection.
- resolveAgentBinary(canonical, override) → (path, binName, err).
Override='' falls back to canonical. Override set but missing from
PATH returns an error (no silent fallback — masks user typos).
- DiscoveredAgent.OverrideBinary records the override binary name
when one was used; empty otherwise.
- DiscoverCLIAgents(ctx, overrides) signature; warning logged when
an override is configured but the binary isn't on PATH.
- cmd/gnoma/main.go: both call sites pass cfg.CLIAgents. The
`gnoma providers` listing renders `claude-priv (via [cli_agents].claude)`
when an override is in effect.
Tests cover: 5 resolver cases (no override, override set, empty
override falls back, override missing, canonical missing); 4
discovery cases (no overrides, override resolves alias, empty value
falls back, override missing skips agent); 2 config round-trip cases.
The Debug floor (0.4) added in eb0583f was bumping the SLM-returned
0.25 up, breaking the HappyPath assertion. Bump the SLM value to 0.55
so the test still verifies "SLM value preserved" (its original
intent), and add a dedicated TestClassifier_AppliesTaskTypeFloor that
exercises the under-reporting case the floor was added to handle.
Plan A from docs/superpowers/plans/2026-05-19-post-slm-unlock.md.
Small local SLMs (<=16k context) waste ~1500 tokens per turn on the
full tool catalogue. Two-stage routing replaces round-1 tools with a
single synthetic select_category schema; round-2+ sends only the
selected category's real tool schemas plus select_category for
re-selection.
- internal/tool/category.go: Category type, optional Categorized
interface, CategoryOf() with meta fallback. fs.read/fs.ls -> read,
fs.write/fs.edit -> write, fs.glob/fs.grep -> search, bash -> exec.
- internal/engine/twostage.go: synthetic select_category tool,
intercept helper, per-turn selectedCategory state under e.mu.
- Engine round 1 forces ToolChoiceRequired so SLMs don't fall back to
prose. State resets at the top and end of every runLoop.
- Activates automatically on a forced local arm with ContextWindow
<=16384, or via [router].force_two_stage TOML key.
- Integration test drives a 3-round trip and asserts: round 1 emits
exactly one schema (synthetic) with ToolChoiceRequired, round 2
contains only write-category schemas + select_category, real
fs.write executes. Invalid-category fallback round-trips back to
round-1 mode.
Two routing bugs were keeping the SLM out of every real prompt and,
once it was eligible, pulling complex tasks into it as well.
Bug 1: ForceArm was called unconditionally when a primary provider was
configured (cmd/gnoma/main.go:378). That short-circuited the entire
router — every prompt went straight to whatever was set as
[provider].default, regardless of tier, score, or feasibility. The SLM
arm appeared in `gnoma router stats` registration logs but had zero
observations after dozens of prompts.
Fix: only pin when the user passed --provider on the command line.
Config defaults register the arm but don't force it; the router picks
freely. Verified end-to-end — trivial prompts now reach slm/ollama
via the tier-0 priority.
Bug 2: A short prompt like "refactor the SLM module" classifies as
TaskRefactor with complexity 0.015 — well under the SLM arm's 0.3
ceiling. The arm became eligible despite the task being inherently
non-trivial. Once eligible, tier-0 priority then pulled it in over
the CLI agents.
Fix: add MinComplexityForType, applied in both ClassifyTask
(heuristic path) and slm.Classifier.Classify (SLM-overlay path). The
floor is per-task-type:
- TaskSecurityReview, TaskOrchestration → 0.60
- TaskRefactor, TaskPlanning, TaskDebug → 0.40
- TaskUnitTest, TaskReview → 0.35
Tasks like Explain/Generation/Boilerplate keep their organic
complexity score so trivial knowledge prompts (≤0.15) still fall to
the SLM. Tasks that imply existing code or multi-step reasoning are
clamped above the SLM's MaxComplexity, naturally routing them to a
bigger arm.
After both fixes, observed routing in a clean run:
What is 2+2? → slm/ollama (complexity 0.015)
Define a closure → slm/ollama (complexity 0.015)
What is HTTP? → slm/ollama (complexity 0.015)
Refactor the SLM module → subprocess/gemini (complexity 0.40)
Audit for race conditions → subprocess/gemini (complexity 0.35)
Plan a migration → subprocess/gemini (complexity 0.40)
The TUI gave no indication that an SLM was configured or active.
You'd see the primary provider on the status line and nothing else,
even with [slm].enabled=true and a successfully booted backend.
Two surfaces added:
1. Status-bar SLM badge. The left side of the status line gains a
dim " · slm: <model> ⚙" suffix when the backend booted, " · slm: ✗"
when it failed, and nothing when SLM is disabled. The ⚙ marker
indicates the model advertises tool support.
2. Per-turn classifier visibility. The existing routing event already
produced "routed → <arm> (task: <type>)" lines in the chat history;
it now also reports which classifier made the decision, e.g.
"routed → ollama/ministral-3:3b (task: explain, by: slm_fallback)".
Lets you tell in real time whether the SLM is actually classifying
or falling back to the keyword heuristic.
Plumbing:
- new tui.SLMInfo struct on tui.Config
- main.go populates it after StartBackend returns
- stream.Event gains RoutingClassifier; engine.runLoop fills it from
task.ClassifierSource on the first round
The SLM had two intended jobs — classify every prompt and execute the
small ones itself — but in practice three independent gates kept it
out of nearly all real work:
1. llamafile cold-start blocked pipe-mode runs (always faster than
the 15 s health check)
2. ClassifyTask defaulted RequiresTools=true, excluding the SLM arm
(ToolUse=false) from 9/10 task types
3. armTier hard-coded CLI agents > local > API, so even when the SLM
arm was feasible a CLI agent won
Each gate is addressed below. The result is an SLM that actually does
its job — small stuff stays local, complex stuff routes up — gated by
arm capability rather than by accidents of the boot order.
Backend layer (the bigger change)
The original implementation hard-coded llamafile. That's fine if you
have nothing else, but most users with a local model setup already run
Ollama or llama.cpp. The new factory at internal/slm/backend.go picks
between:
- ollama (any local Ollama daemon)
- llamacpp (any llama.cpp server)
- llamafile (gnoma-managed, current behaviour)
- openaicompat (LM Studio, vLLM, remote API)
- auto (probes in order, picks first reachable)
- disabled
[slm].backend in config.toml selects which. Documented in
docs/slm-backends.md with copy-paste presets for each. The factory
probes the underlying model's actual capabilities (Ollama /api/show,
llama.cpp /props) and sets the SLM arm's ToolUse accordingly — so the
arm picks up simple file-read style tasks on tool-capable models and
stays knowledge-only on completion-only models.
Trivial-prompt heuristic (Gate 2)
ClassifyTask now flips RequiresTools=false for short, low-complexity
prompts whose task type doesn't imply existing code (Explain,
Generation, Boilerplate). Tool-needing tokens (read, write, run, test,
file, …) keep RequiresTools=true even when the prompt is brief.
Complexity-aware tier ordering (Gate 3)
armTier takes a Task and returns tier 0 for arms whose MaxComplexity
ceiling fits the task. CLI agents drop to tier 1, local to 2, API to 3.
For trivial tasks the SLM arm wins; for complex tasks the SLM falls
out of the feasible set (MaxComplexity exclusion) and the original
ordering reasserts.
Eager boot with user-facing wait (Gate 1)
Removed the original goroutine-only path. SLM startup now blocks
synchronously inside the factory; for llamafile that means up to
[slm].startup_timeout (default 5 s) of waiting on the first
invocation, with "Starting SLM…" → "SLM ready (backend, model, tools,
boot=N)" / "SLM unavailable: …" messages on stderr. Ollama / llamacpp
backends boot instantly because the daemon is already running.
waitHealthy() now respects the caller's context deadline instead of
its old hardcoded 15 s ceiling.
Classifier reliability
Classifier timeout bumped 2 s → 5 s for thinking-mode models like
Qwen3-distilled Tiny3.5. System prompt includes /no_think directive
for the same family. These help but don't eliminate small-model
JSON-contract failures — see the docs section on picking a model.
Probe + telemetry surfaces
gnoma slm status now prints the configured backend + model + a live
probe result (✓/✗) instead of just the llamafile manifest state.
`gnoma router stats` already (from the previous commit) shows the
classifier-source mix; with this change you can finally see slm /
slm_fallback / heuristic share rise from "always heuristic" to
something reflecting real SLM activity.
Tests
- 9 new backend-factory tests (httptest-backed Ollama probe, error
paths, auto-detection, capability flags)
- Tier-ordering tests cover the new "specialised small arm wins
trivial task" path
- Trivial-prompt heuristic tested for both halves (knowledge-only
flips RequiresTools=false; debug/file/run keeps it true)
Deletes the dead SLMManager field from the TUI Config — it was
declared but never read.