gnoma

Author	SHA1	Message	Date
vikingowl	8ba77c1685	fix(safety): env-template precision, label alignment, banner on bypass Three polish items surfaced during the maintainer's manual smoke of the previous safety commit. env-template precision (false-positive fix): The "env file" rule matched .env.* universally, which flagged conventional templates like .env.example / .env.sample / .env.template / .env.dist / .env.default — these hold variable NAMES, no values, and are commonly committed. Now skipped. Real env files (.env, .env.local, .env.production) still match. New envTemplateSuffixes table + isEnvTemplate helper; check runs only inside the env-file rule so the suffix denylist is scoped. Tests added for both directions: 6 templates that must NOT flag, 6 real env files that must. Banner label alignment: Field labels were padded to 8 chars except "sensitive" at 9, producing visible misalignment in the rendered banner: cwd : /... provider : ollama / ... sensitive : 0 matches in cwd <- one extra space Padded all labels to 9 chars so the ":" separators line up. Context banner on bypass: --dangerously-allow-anywhere previously suppressed the entire safety block, including the informational context banner. Bypassing the GATE is not the same as opting out of the info — the user still wants to see cwd / git state / sensitive files nearby. Restructured the safety block so classification + banner always run; the bypass only skips the refuse/warn FLOW. The bypass warning log now also includes the classified tier and cwd path for diagnostics.	2026-05-23 22:32:26 +02:00
vikingowl	3eeb5b46d7	feat(safety): pre-launch cwd classifier + context banner Implements S-1 through S-7 of the startup-safety-banner plan. Adds a pre-launch safety check that classifies the current working directory into three tiers and gates the launch: TierRefuse /, /etc, /sys, /proc, /usr, /var, /bin, /sbin, /boot, /root, /dev (Linux) and /System, /Library, /private, /Applications (macOS). Refuses with exit 2 unless --dangerously-allow-anywhere is passed. TierWarn $HOME, ~/Desktop, ~/Downloads, ~/Documents, ~/.config, ~/.local, ~/.cache, /tmp, and similar dumping grounds. Prints a banner and reads a single y/Y from stdin to confirm; any other input (or EOF, including piped/ scripted invocation) aborts with exit 1. TierOK Anywhere with a recognized project marker (.gnoma/, go.mod, package.json, pyproject.toml, Cargo.toml, Makefile, Dockerfile, build.gradle, pom.xml) or inside a git repo. No prompt; banner only. Project markers and git-repo presence override the TierWarn check — a project dir inside $HOME stays TierOK. The require_project_marker config knob can flip that for strict users. Container detection: when /.dockerenv or /run/.containerenv exists, TierRefuse downgrades to TierWarn (devcontainers often chroot to / or similar). Best-effort; false positives only soften the gate. The context banner is always rendered (TierOK, TierWarn, TierRefuse alike) and summarizes: cwd, git branch + dirty state, project type, provider/model, modes (permission, incognito, prefer), and a top-level sensitive-file inventory. Inventory matches .env, .env., env.local; private-key extensions (.pem, .key, .crt, .p12, .pfx); SSH key names (id_rsa, id_ed25519, ...); credentials files; .netrc / .pgpass; KeePass vaults; and .ssh/ .aws/ .kube/ .gcloud/ .azure/ .docker/ directories. Precision-tested: .envrc and secret_handler.go do NOT match. Bounded at 1000 entries. Architecture: - internal/safety/cwd.go — Classification + symlink-resolving tier classifier with platform-specific roots and container detection. - internal/safety/sensitive.go — pattern-based top-level scanner, deterministic ordering, scanLimit guard against pathological dirs. - internal/safety/banner.go — pure render functions for the warn prefix, refuse message, and context banner. Safe for golden-string testing. - internal/config/config.go — new [safety] section with three config keys, defaults applied via ResolvedSafety() helper. Pointer fields distinguish "user omitted" from "user set to false." - cmd/gnoma/main.go — gate runs after subcommand dispatch (so `gnoma providers / profile / slm / router` skip the prompt) and before provider creation. --dangerously-allow-anywhere bypasses the gate with an explicit log warning. The runtime keypress reads up to 8 bytes from os.Stdin and accepts only "y" / "Y" trimmed; EOF returns false (piped invocations without the flag will abort). Documented in the readYesConfirmation helper. Manual smoke (per plan): - `cd / && gnoma -p test` → refuses - `cd ~ && gnoma` → warns + keypress - `cd ~/git/some-repo && gnoma` → banner only - subcommands skip the gate entirely Linux + macOS classification; Windows path handling deferred per plan (treated as TierOK there until follow-up). Refs: docs/superpowers/plans/2026-05-23-startup-safety-banner.md	2026-05-23 22:19:39 +02:00
vikingowl	f9094f68f3	feat(router): [router].prefer = local \| cloud \| auto Implements P-1 through P-6 of the prefer-routing-policy plan. Adds a config knob that biases routing toward local arms, cloud arms, or leaves selection unchanged. Default "auto" is byte-identical to pre-change behavior (the new armTier path with PreferAuto returns the same value as the old single-arg function). Mechanism diverged from the plan after empirical testing: The plan called for a score multiplier applied in bestScored. Tests revealed the existing cost-floor math (scoreArm divides by weighted cost which collapses to ~0.001 for free local arms) gives local arms a ~280x raw-score advantage that a 0.3-0.5 multiplier can't overcome. A tier-shift in armTier turned out cleaner: PreferLocal: cloud arms (true API, IsLocal=false && !IsCLIAgent) get +2 tier shift, landing behind locals. PreferCloud: IsLocal arms get +2 tier shift, landing behind cloud. SLM tier-0 arms shift to tier 2 — still below cloud's tier 3 — so the SLM-protection semantic (small stuff stays on the small model) survives PreferCloud. This matches the open question in the plan, now resolved as: yes, SLMs keep winning under PreferCloud by design. The policyMultiplier was kept in bestScored as a within-tier nudge (mostly cosmetic in practice given the cost-floor dynamics described above; could matter when costs are calibrated). Worth revisiting once router-wide cost calibration lands. Strengths cross-tier promotion is unaffected: the promoted-set path in selectBest bypasses armTier entirely, so a strongly-tagged cloud arm still wins SecurityReview tasks under PreferLocal (validated by TestPreferPolicy_StrengthsBeatsMultiplier). CLI-agent subprocess arms count as "local" for PreferLocal purposes — they proxy to cloud but the user-visible behavior is local. Users who want to exclude them can use --provider X. Forced arms (--provider X) and incognito take priority over the policy: forced arm test pins this, incognito-still-wins test pins the LocalOnly hard filter dominating PreferCloud. Test coverage (prefer_test.go): ParsePreferPolicy / String round trips; policyMultiplier table; acceptance scenarios across all three policies with adjacent-tier arms; SLM-still-wins under PreferCloud; Strengths beats multiplier; forced-arm bypass; incognito beats prefer; lone cloud arm wins when no local feasible. Refs: docs/superpowers/plans/2026-05-23-prefer-routing-policy.md	2026-05-23 22:13:26 +02:00
vikingowl	2f8d4c412f	feat(router): cloud-arm defaults, gpt-5.3-codex registration Closes R-4 and R-5 of the routing-defaults plan. R-4: Strengths + CostWeight defaults for closed frontier models. Cloud entries land in the same knownFamilyDefaults table as local ones, with MaxComplexity intentionally left zero (cloud arms get no complexity ceiling). CostWeight tuned per the plan's rationale: claude-opus-4-7 → Planning/SecurityReview/Debug/Refactor, 0.3 claude-sonnet-4-6 → Generation/Refactor/Review, 0.7 gpt-5.5 → Planning/SecurityReview/Generation, 0.3 gpt-5.3-codex → Generation/Refactor/Debug/UnitTest, 0.6 gpt-5.2 → Orchestration/Review, 0.8 gemini-3.1-pro → Planning/Review/Orchestration, 0.5 gemini-3.5-flash → Boilerplate/Explain/Orchestration, 1.2 The 0.3 weight on frontier arms keeps them competitive on SecurityReview / Planning despite $4+/Mtok; 1.2 on Gemini Flash penalizes cost more so it only wins when cost is genuinely decisive (boilerplate, explain). Mechanism: extracted applyFamilyDefaults into defaults.go and call it from Router.RegisterArm. Single source of truth — both local discovery and the primary-provider path in cmd/gnoma/main.go now flow through the same defaults application. Removed the duplicate apply block from RegisterDiscoveredModels. Legacy model IDs (claude-opus-4-20250514, gpt-4o, o3, gemini-2.5-pro, etc.) intentionally do not match any table entry — keeps users on pinned older models safe from imposed 2026 Strengths. R-5: gpt-5.3-codex registration. - internal/provider/openai/provider.go: added to fallbackModels and inferOpenAIModelCapabilities (400K context, 32K output). - internal/provider/ratelimits.go: gpt-5.3-codex and its dated alias gpt-5.3-codex-2026-02-15 added with the same Tier 1 quotas as gpt-5.2. Gemini 3.x (3.1-pro-preview, 3.5-flash, 3.1-flash-lite) was already registered in both google/provider.go and ratelimits.go — no change needed for that part of R-5. Test coverage: - ResolveFamilyDefaults table-driven across all 7 cloud entries including prefix-sharing (gpt-5.5-pro → gpt-5.5 defaults, gemini-3.1-pro-preview → gemini-3.1-pro defaults). - Legacy IDs return !ok. - RegisterArm applies cloud defaults end-to-end. - User-supplied Strengths and CostWeight are not overridden. - ID.Model() fallback works when ModelName is empty (test code often constructs arms this way). Refs: docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md	2026-05-23 21:39:48 +02:00
vikingowl	9bb775a4aa	feat(router): full local family defaults table with size-keyed ceilings Expands the family-defaults scaffold to 23 entries covering the local models that currently appear in real Ollama fleets: coder specialists (qwen3-coder, devstral, qwen2.5-coder, yi-coder, deepseek-coder, starcoder), reasoners (phi-4, phi-4-mini), Gemma 2/3/4 (including the "edge" e2b/e4b variants under both Ollama and GGUF naming), Qwen 2.5/3/3.5 with a catch-all qwen entry, Mistral/Ministral (incl. the 24B mistral-small-3), Llama 3.2/4, tiny3.5 (reec's distill family), Granite, GLM (incl. glm-ocr specialist), and MiniCPM-V. Five families that span wide parameter ranges (qwen3.5, qwen3, qwen2.5, ministral-3, tiny3.5) now use SizeCap ladders instead of a flat MaxComplexity. A new parseSizeFromModelID helper splits the model ID on :/-_/ and matches pure <N>b/<N>m tokens, correctly ignoring qwen3.5 version strings, e2b edge tags, a3b MoE active params, and v0.3 version suffixes. ResolveMaxComplexity wraps ResolveFamilyDefaults plus the SizeCap traversal, falling back to the smallest cap when size parsing fails (conservative). Discovery's apply path now goes through it so SizeCap entries actually take effect. Test coverage: - parseSizeFromModelID (11 cases) - ResolveFamilyDefaults longest-prefix discipline (19 cases) - Unknown-family fallback returns !ok - ResolveMaxComplexity size-keyed ladder (13 cases) - Size-parse-failure fallback - knownFamilyDefaults invariants: SizeCaps ordered largest-first, SizeCaps and MaxComplexity mutually exclusive per entry - Routing-payoff integration: 3 arms (tiny3.5:1.5b, phi-4:14b, qwen3-coder:30b) get picked for TaskGeneration / TaskPlanning / TaskBoilerplate respectively, without any [[arms]] config - Local fleet visibility: the maintainer's actual `ollama ls` inventory registers correctly with expected MaxComplexity and Strengths; embeddinggemma stays filtered out The Planning sub-case surfaced a separate issue worth flagging: heuristicQuality floors out at 0.55 for a generic 14B local model without ThinkingModes, below TaskPlanning's 0.60 threshold. The test mutates phi-4's capabilities post-registration to reflect reality (phi-4 is reasoning-tuned). A discovery-side thinking-capability detection is out of scope for this plan but flagged in the test comment for follow-up. Refs: docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md	2026-05-23 21:34:09 +02:00
vikingowl	a79e99199d	feat(router): non-chat exclude, vision prefixes, family-defaults scaffold Discovery previously registered every model returned by Ollama as a chat arm, including embeddings, ASR, TTS, audio realtime, and rerankers — which then failed at inference time when the router selected them. Local arms also shipped with all-zero defaults, so selection between e.g. tiny3.5:1.5b, phi-4:14b, and qwen3-coder:30b was effectively random. This change covers tasks R-1, R-2, R-6 from the routing-defaults plan. - nonChatModelPatterns + isNonChatModel substring matcher; matched IDs are skipped during RegisterDiscoveredModels. Covers whisper, moonshine, kokoros, vibevoice, -asr, -tts, -audio, -embedding, embeddinggemma, -reranker, lfm2. - knownVisionModelPrefixes gains gemma4, gemma-4, glm-ocr. gemma3 and minicpm-v entries stay for regression coverage. - New internal/router/defaults.go with FamilyDefaults struct, knownFamilyDefaults map, and ResolveFamilyDefaults longest-prefix lookup (with org/-namespace stripping so reecdev/tiny3.5:1.5b resolves to "tiny3.5"). Single entry for now: functiongemma is registered with Disabled=true and MaxComplexity=0.40, reserved for the future ArmRoleToolRouter path. Table will grow in R-3. - RegisterDiscoveredModels consults ResolveFamilyDefaults and only populates fields that are still zero on the arm, so user [[arms]] overrides keep priority. Plans: - docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md - docs/superpowers/plans/2026-05-23-tool-router-specialization.md TODO.md surfaces both as in-flight items.	2026-05-23 21:24:59 +02:00
vikingowl	1606d19366	feat(subprocess/codex): account for cached and reasoning tokens codex 0.133.0 emits two token-accounting fields at top level that we previously dropped: cached_input_tokens — subset of input_tokens that hit the prompt cache (cheaper, but still counted in input_tokens per OpenAI Responses API semantics) reasoning_output_tokens — separately reported billable thinking tokens on reasoning-capable models Map cached_input_tokens to message.Usage.CacheReadTokens and subtract it from InputTokens. message.Usage.Add() sums InputTokens and CacheReadTokens as peers, so the uncached residual goes in InputTokens — matches the anthropic provider's convention and keeps cumulative usage tracking arithmetically correct. Fold reasoning_output_tokens into OutputTokens for accurate cost tracking. The top-level peer positioning (vs nested in output_tokens_details) implies a separately counted billable quantity, not a subset of output_tokens. Defensive clamp at zero in case a future codex build reports cached > input due to schema drift. Includes a verbatim regression guard against the live 2026-05-22 codex 0.133.0 output to catch schema changes early.	2026-05-22 13:35:57 +02:00
vikingowl	49d80cf847	feat(security): format-aware entropy safelist (Phase F-1) Add a deterministic pre-extractor that skips known-safe token shapes before they reach the entropy scorer. Targets the false-positive regime that bites under lowered entropy_threshold or redact_high_entropy = true — UUIDs (~3.4 bits), SHA hex digests (~3.9 bits), ISO-8601 timestamps, and HTTP(S) URLs. Config knob lives under the existing security section to match entropy_threshold / redact_high_entropy convention: [security] entropy_safelist = ["uuid", "sha_hex", "iso8601", "url"] Empty / unset preserves pre-F-1 behaviour exactly — users opt in. Per-pattern Debug telemetry fires on every skip (pattern name + token length, never the token bytes). This is the data F-2's go/no-go gate depends on; the plan literally specifies it. NewFirewall validates names at the config boundary and emits a Warn for unknown entries so a typo like "uid" instead of "uuid" surfaces loudly instead of silently disabling FP reduction. Tests cover: UUID/SHA-1/SHA-256 skipped at lowered threshold, mixed payload (safe shape + real secret) preserves the secret, secret-adjacent-to-UUID regression guard, empty safelist preserves pre-F-1 behaviour, unknown name silently dropped at scanner level but warned at firewall level, end-to-end FirewallConfig wiring, and the skip-telemetry log line. F-2 remains gated on real-workload FP-rate observations.	2026-05-22 12:39:10 +02:00
vikingowl	ea1a5361e2	chore: restore agy JSON-output TODO; idiomatic t.TempDir() in google test The worktree commit `12a6b83` dropped the "Native agy JSON output" backlog item alongside removing the agy agent. Since we restored agy in this branch, the TODO is relevant again — agy v1.0.0 still emits plain text and the prompt-augmentation fallback should be replaced by --output-format stream-json once the CLI supports it. Switch TestTryLoadOAuthCredentials_Formats to t.TempDir() to drop the unchecked os.RemoveAll defer that golangci-lint's errcheck caught after the merge.	2026-05-22 12:17:10 +02:00
vikingowl	246997c4be	Merge branch 'feat/agy-sdk-integration' into dev Brings in the Google auth precedence work (agy > gemini > ADC credential walk, fileTokenProvider expiry handling, slog-backed error reporting), the Codex CLI integration as a new subprocess agent, and the restoration of the agy subprocess agent that was accidentally removed by the initial codex commit. Sandbox-bypass flags on both agy and codex are now opt-out via env vars (GNOMA_AGY_BYPASS_PERMISSIONS, GNOMA_CODEX_BYPASS_SANDBOX). Includes review-driven fixes: - ADC fallback now uses real DetectOptions (cloud-platform scope) - fileTokenProvider returns an error on expired tokens instead of shipping a known-dead bearer - TestNew_Precedence asserts which credential was actually picked - codex parser tolerates non-JSON banner / debug lines on stdout - codex usage takes max(input_tokens, prompt_tokens) so accounting can't silently undercount No conflicts expected with the dev image-content feature: the worktree branch only touches the google and subprocess provider families.	2026-05-22 12:15:32 +02:00
vikingowl	afc31b0af4	fix(subprocess): restore agy alongside codex; env-gate sandbox bypass The original commit on this branch replaced the agy subprocess agent with codex (overwriting the slot in knownAgents, deleting agy_test.go and the agyParser). That was unintentional — agy (antigravity) is a distinct CLI from codex (OpenAI's). Antigravity will replace gemini when gemini retires on 2026-06-16, so it needs to keep its own slot. Restored: FormatAgyText constant, agyParser with newAgyParser and the line-delimited text parser, the agy CLIAgent entry in knownAgents with PromptResponseFormat:true, agy_test.go, and the agy case in newParser. Sourced from the parent commit so behavior matches what shipped before the codex change. Sandbox bypass: both agy (--dangerously-skip-permissions) and codex (--dangerously-bypass-approvals-and-sandbox) need a flag to run non-interactively (their stdin is closed; without it they block on approval prompts nobody can answer). Both default to ON for out-of-box behavior; operators with pre-approved trust config can opt out via GNOMA_AGY_BYPASS_PERMISSIONS=0 or GNOMA_CODEX_BYPASS_SANDBOX=0. Tests cover the on / opt-out / unknown value branches. TestKnownAgents_ValidFormats updated to accept the restored FormatAgyText.	2026-05-22 12:14:54 +02:00
vikingowl	1717f9f567	fix(subprocess/codex): tolerate non-JSON stdout, max-of-token-paths Codex emits banner / debug / "starting turn" lines to stdout interleaved with the JSON event stream. The parser previously returned an error on any line that wasn't a JSON object, which subprocessStream.Next treats as terminal — one stray banner aborted the whole turn. Skip lines that don't start with `{` after whitespace trim, and downgrade unparseable JSON-looking lines to a slog.Debug so they don't kill the stream either. Token accounting: usage payloads from newer codex builds occasionally carry both input_tokens and prompt_tokens (and likewise output / completion) with slightly different values. Always use the larger of the two so we can't silently undercount. Tests cover non-JSON banner skipping, malformed-JSON non-fatal-skip, and the max() behavior with both token fields populated.	2026-05-22 12:08:32 +02:00
vikingowl	f83ace7ad6	fix(google): real ADC scopes, expired-token rejection, error reporting credentials.DetectDefault(nil) always returns "options must be provided", which made the ADC branch unreachable. Pass an explicit DetectOptions with the cloud-platform scope so users with GOOGLE_APPLICATION_CREDENTIALS or `gcloud auth application-default login` actually flow through ADC instead of falling out as "no credentials found". fileTokenProvider.Token used to return expired tokens unchanged. We don't perform an OAuth refresh exchange (the upstream CLI does that out-of-band into the file we read), so when the file isn't fresh the only safe move is to fail loudly with an actionable message rather than ship a known-dead bearer that genai forwards to Vertex AI and gets back a confusing 401. tryLoadOAuthCredentials previously swallowed all errors equally, so the precedence walker silently skipped past misconfigured files (chmod 0600 on the wrong user, half-written JSON, etc.). Now os.IsNotExist is silent (normal walking), everything else gets a slog.Warn with the path so an unreadable file is visible. selectOAuthCredentials extracts the precedence chain into a testable helper that also returns a CredentialSource tag identifying which path was chosen. The previous precedence test only asserted err == nil; the new test verifies that the agy file wins when both are present and that the fallback to gemini actually loads the gemini token.	2026-05-22 12:08:22 +02:00
vikingowl	bd41d76e32	refactor(tui): store pasted images in user cache, not project workdir Ctrl+V image paste used to write the file to .gnoma/pasted_image_.png under the project root, which polluted the workdir and risked committing screenshots that may contain sensitive content. Now writes to os.UserCacheDir() / gnoma / pasted-images/ (XDG cache on Linux, ~/Library/Caches on macOS, %LocalAppData% on Windows). The directory is created at 0700 and files at 0600 since pasted content can be sensitive. Each paste prunes entries older than 2 hours best-effort, so the cache doesn't accumulate across sessions. The 2h window safely covers any single turn including provider retries and slow subprocess CLIs that need the file to still exist on disk when they ingest the path. .gitignore: cover the legacy `.gnoma/pasted_image_` location for old checkouts; add log.txt and codex_out.jsonl which were tracked as runtime artifacts during the recent work. Tests cover cache-path placement, restrictive perms on both the directory and the file, the no-pollution-of-cwd invariant, and the prune behavior (stale removed, fresh kept, missing dir no-op).	2026-05-22 11:56:04 +02:00
vikingowl	c5cc98ed8a	feat(provider/openai): translate user image content to image_url parts When the user message has at least one ImageContent block, build a ChatCompletionContentPartUnionParam array with text + image_url parts instead of the string content path. Image bytes are inlined as a base64 data URL (data:<media-type>;base64,...). Adjacent text blocks are merged into a single TextContentPart. Pure-text user messages stay on the existing string fast path. This covers OpenAI direct + every openaicompat backend (Ollama, llama.cpp, llamafile) since they all share the same provider. Tests: pure text uses OfString; image present emits 2 content parts (text + image_url with the expected base64 payload); nil-Image blocks are dropped and adjacent text merges correctly.	2026-05-22 11:50:55 +02:00
vikingowl	bc137182d4	feat(engine): parse [Image: /path] markers, gate on Vision capability buildUserMessage replaces the unconditional NewUserText wrap inside SubmitWithOptions. When the active model advertises Vision and the input contains [Image: /path] markers, the markers are inlined as ImageContent blocks carrying the file bytes; otherwise the input is passed through as a single text block (legacy behavior preserved for subprocess CLIs that auto-ingest paths, e.g. gemini-cli). image_input.go: - imageMarkerRe extracts each [Image: ...] occurrence. - Per marker: validates absolute path, file (not dir), size cap of 10 MiB, image/* media type via http.DetectContentType. - On any validation failure, the marker is left as literal text and a warning is recorded — the turn still proceeds. Routing: latestUserHasImages drives task.RequiresVision in both the primary stream attempt and the retryOnTransient path, so failover arms also respect the vision requirement. Tests cover: no markers (single text block), single image (bytes captured into Image.Data, MediaType set), missing file (literal fallback + warning), relative path rejection, oversized rejection, non-image file rejection, multiple images interleaved with text.	2026-05-22 11:50:45 +02:00
vikingowl	a2b7f8eb3f	feat(router): vision capability gating and Ollama vision detection Task gains a RequiresVision bool; filterFeasible enforces it on both the primary feasibility pass and the last-resort fallback (no degradation to a non-vision arm — the model literally cannot consume image bytes). Ollama discovery now probes /api/show for vision capability: - details.families containing "clip" / "mllama" / "*vl" - capabilities array containing "vision" (newer Ollama) - name-prefix fallback for releases that predate either (llava, qwen2.5-vl, llama3.2-vision, moondream, pixtral, etc.) OllamaProbeResult replaces the map[string]bool tool cache so the single /api/show call can populate tools + vision + ctx-size in one probe. DiscoverOllama / DiscoverLocalModels signatures updated; nil-cache callers in cmd/gnoma keep working unchanged. RegisterDiscoveredModels propagates SupportsVision into the arm's Capabilities.Vision. Tests cover RequiresVision filtering in both the happy path (vision-only arm chosen when image present) and the fallback path (non-vision arm rejected even as last resort).	2026-05-22 11:50:33 +02:00
vikingowl	d37cc2dad3	feat(message): add ContentImage type for inline image bytes Extends the Content discriminated union with a fifth variant for inline image payloads. Image carries the raw bytes (captured at user-input time so the message snapshot is self-contained and survives source-file deletion), the IANA media type for the provider's image part, and the original path for logging. HasImages() lets providers decide whether to fall back to a text-only representation; providers that don't know about ContentImage will simply skip those blocks via TextContent().	2026-05-22 11:50:20 +02:00
vikingowl	e38cce5f1f	fix(tui): security hardening, race-safety, and event handling fixes Bundles the pending TUI work into a coherent batch. Bug fixes from external review: * expandPlaceholders: single-pass alternation regex over the original input prevents `#p\d+` / `#img\d+` tokens inside pasted content from being re-expanded after the bracket form is inlined. * /incognito: gate savePromptHistory and the Ctrl+V image-write branch on `!m.incognito` so the no-persistence contract holds. * history.txt: write at mode 0600 (chmod existing 0644 files), create parent dir at 0700, truncate to 500 entries on every save, slog.Warn on errors instead of swallowing. * triggerPickerAction: guard m.config.Engine before SetModel, matching the /model handler. * Picker key handler: navigation/enter/q consume, escape/ctrl+c close the picker AND fall through to global handlers (so streaming cancel and double-tap quit work with an overlay open), default swallows stray input. * Paste line count: report total non-empty lines instead of newline count, ignoring trailing newlines (no more "+0 lines" for "abc"). * Ctrl+O restored to expand-output; Ctrl+Y is the new copy-response bind. /keys help text updated; picker help entries reordered. * Tighter perms on .gnoma/pasted_image_*.png (0600). Race-safety refactor: ApplyTheme used to mutate ~25 package-level lipgloss styles in place. Replaced with an immutable themeStyles snapshot and atomic.Pointer[themeStyles] swap. Readers go through a theme() helper (one atomic load) instead of touching package vars directly. No locks, no nested-RLock risk if rendering ever moves off-thread. Includes pre-existing in-flight work: TUISection in config with persistent theme/vim settings; /copy /theme /vim slash commands; provider-name completion; session.SetProvider for the provider picker. Tests: placeholder_test.go (6 regression + happy-path cases including the pasted-content collision), history_test.go (5 cases covering perms on new and existing files, on-disk truncation, blank-input, newline flattening), provider_test.go (provider switching + picker transitions + SLM gating).	2026-05-22 11:50:12 +02:00
vikingowl	12a6b83cc9	feat: implement Google auth precedence and Codex integration	2026-05-22 00:21:32 +02:00
vikingowl	244ecd97e5	fix: security hardening (bash redirection, unicode sanitization, edit tool resolver)	2026-05-21 23:29:48 +02:00
vikingowl	67948df8cb	fix(mcp): make transport cross-compile on Windows `internal/mcp/transport.go` used syscall.Setpgid and syscall.Kill unconditionally, both Unix-only. Split the platform bits into `transport_unix.go` (build tag `!windows`) keeping the existing process-group semantics, and `transport_windows.go` (build tag `windows`) falling back to `os.Process.Kill` (kills only the immediate process — full process-tree kill on Windows would need golang.org/x/sys/windows + job objects, deferred). Caught by `goreleaser release --snapshot` cross-compiling for windows/amd64 and windows/arm64.	2026-05-20 03:34:00 +02:00
vikingowl	99fa0ff08e	refactor(providers): refresh defaults to current 2026 model lineup Bump hard-coded provider defaults to the May 2026 lineup: - Anthropic: claude-sonnet-4-6 (default); Opus 4.7 and Haiku 4.5 in the fallback list. 4.6/4.7 generation has 1M context standard. - OpenAI: gpt-5.5 (default); 5.5-pro / 5.2 / 5.2-chat-latest in fallback. ThinkingModes now baseline on GPT-5.x. - Google: gemini-3.5-flash (default); 3.1 Pro / Flash Lite in fallback. - Mistral: mistral-large-latest unchanged (Mistral Large 3); add mistral-medium-3.5, mistral-medium-2511, mistral-large-2512 to the rate-limit map. Legacy dated IDs retained in fallback lists and ratelimits maps so configs pinned to claude-sonnet-4-20250514 / gpt-4o / gemini-2.5-flash keep resolving. Capability tables (ContextWindow, MaxOutput, ThinkingModes) updated to match each generation. CLI help text in cmd/gnoma/main.go also updated.	2026-05-20 03:13:21 +02:00
vikingowl	c4fde583f5	chore(lint): gofmt sweep + errcheck cleanups in router discovery Apply gofmt -w across the codebase (struct field comment realignment only — no semantic changes) and silence two errcheck warnings on fmt.Sscanf / fmt.Fprintf return values in internal/router/discovery with explicit `_, _ =` discards. Required so `make check` is green before tagging v0.1.0.	2026-05-20 03:13:05 +02:00
vikingowl	aca830e7db	feat(engine): consumption-time stream-error failover When a stream errors out before producing any user-visible content (text, thinking, or tool calls), the engine now transparently retries on the next-best arm instead of bubbling the error to the TUI. Covers the case from the post-SLM screenshot: subprocess CLI agents that exit non-zero on auth/config failures, network drops mid-stream, rate-limited arms whose error surfaces after Stream() already returned. Mechanism: the stream-create + consume blocks are wrapped in a labeled streamLoop. On s.Err() != nil with empty accumulator, the engine emits a new EventFailover ("↻ <failed_arm> failed (<reason>) — retrying on another arm"), excludes the failed arm via task.ExcludedArms, and re-enters the loop. Cap of 4 failovers per round. Guards: - !acc.HasContent() — if text/tool calls already streamed, fail loud rather than duplicate visible output on retry. - isFailoverable(err) — deny-list approach: context.Canceled/Deadline and HTTP 400/413 are fatal; everything else (auth, rate limit, 5xx, subprocess exit, network) is failoverable. - Router.ForcedArm() == "" — when the user pinned an arm via --provider, failover is disabled by design. - failoverAttempt < maxFailovers — bounded retry budget. TUI renders EventFailover under the existing "cost" role styling. shortFailReason strips the subprocess wrapper envelope so the user sees "Invalid API key. Try again." instead of "subprocess: exit status 1: Error: Invalid API key. Try again.". Tests cover the classifier (isFailoverable, shortFailReason), end-to-end auth-error failover, content-already-streamed guard, and context-cancel guard. Deterministic across 10x -race runs by giving the failing arm IsCLIAgent=true to anchor it in tier 0 ahead of the API-tier backup.	2026-05-20 02:20:00 +02:00
vikingowl	fb42202834	refactor(security): seal SecureProvider via unexported marker method The router.SecureProvider interface previously required a public IsSecure() bool method. Any test mock — or future production type — could satisfy it by returning true, defeating the W1 "only wrapped providers may flow past the boundary" contract through convention rather than at the type level. Replaces IsSecure() bool with an unexported security.Marker interface that has a single secured() method. Go's method-set semantics key unexported methods by their defining package, so only types declared in internal/security can satisfy Marker. *SafeProvider gets the lone secured() implementation; router.SecureProvider embeds Marker. The seal forces every test mock that previously implemented IsSecure() to either (a) be wrapped with security.WrapProvider(mp, nil) at the use site, or (b) drop the method entirely if the mock never flows through SecureProvider. 93 use sites across 11 test files were updated via a per-package secureMock helper. WrapProvider with a nil firewall ref is a no-op pass-through, so test behavior is unchanged. Empirically: a type from outside internal/security can declare `secured()` but the compiler will reject assigning it to router.SecureProvider because the unexported method belongs to the other package's namespace. Convention → compile-time guarantee.	2026-05-20 02:04:07 +02:00
vikingowl	9853a522e6	refactor(security): consolidate TOCTOU-safe path canonicalization `3c87527` added engine/paths.go:resolveCanonical, duplicating the ancestor-walk + EvalSymlinks algorithm that already lived in fs/guard.go:ResolveWrite. Two implementations of the same TOCTOU defense is exactly the wrong shape for security code — a bug fix in one would silently miss the other. Extracts the shared algorithm to security.CanonicalizePath. Both call sites become thin wrappers that pre-anchor relative paths against the appropriate root (cwd for engine, workspace root for guard). The "hit-root" defensive branch in engine's version (commented "highly unlikely") is tightened to match guard's error behavior. Adds focused unit tests for the helper covering existing path, non-existent leaf, non-existent mid-component, symlinked ancestor, and relative-path rejection.	2026-05-20 01:50:38 +02:00
vikingowl	f6f8801040	fix(router): restore llama.cpp model enumeration; keep /props for n_ctx `3c87527` rewrote DiscoverLlamaCPP to hit /props and emit a single hardcoded "default" entry. That breaks two cases: 1. Multi-model llama.cpp deployments (llama-swap, model-routing proxies) are collapsed to a single arm with a placeholder ID. 2. Single-model deployments lose the real model name — arms are registered as llamacpp/default instead of llamacpp/<actual-id>. Restores enumeration via /v1/models (the OpenAI-compatible endpoint llama-server exposes) while keeping the concrete n_ctx read from /props. /props is now best-effort: failure or missing n_ctx falls back to the documented default rather than aborting discovery. Adds three tests: multi-model enumeration with shared context, /props unreachable, and the empty-/v1/models error path.	2026-05-20 01:45:54 +02:00
vikingowl	8539426a46	fix(router): restore Ollama cache prune + provider-specific context defaults `3c87527` refactored DiscoverOllama and DiscoverLlamaCPP and dropped two behaviors: 1. The Ollama toolCache prune loop. Without it, the cache grows unbounded across reconcile cycles and stale entries linger; a model that disappears and reappears replays an out-of-date tool-support verdict because the cache hit skips re-probing. 2. Sensible context-size defaults. Both probes can yield ContextSize=0 (Ollama: no num_ctx in /api/show parameters; llama.cpp: /props default_generation_settings without n_ctx). Registering an arm with ContextWindow=0 misroutes — the post-SLM two-stage path treats it as a tiny model. Restores the prune loop, applies 32768 (ollama) / 8192 (llama.cpp) as fallbacks at discovery time, and adds three tests covering each path.	2026-05-20 01:42:14 +02:00
vikingowl	8f13ed78a9	fix(security): redact truncated private keys via header-fallback pattern The full-block private_key regex (BEGIN…END span) added in `3c87527` fails to match when the END marker is missing — log slices, buffered streams, or partial dumps that contain only the header and key body would leak the body. Adds a private_key_header pattern that matches the header plus the trailing base64 body. Redact merges the overlapping spans into a single placeholder when both fire on a complete block. Covered by TestScanner_DetectsTruncatedPrivateKey (no END marker) and TestRedact_PrivateKeyOverlap_SinglePlaceholder (overlap merge).	2026-05-20 01:37:16 +02:00
vikingowl	c8813768d5	fix(subprocess): harden agy CLI integration - Drop unverified JSONOutput/Vision capability claims on agy (no native stream-json, no image-input path on v1.0.0). - Replace agent.Name == "agy" check with PromptResponseFormat flag on CLIAgent so the prompt-augmented JSON fallback scales to future agents. - Pass --dangerously-skip-permissions in agy PromptArgs to parallel gemini --yolo / vibe --trust; required for non-interactive runs. - Nil-guard JSONSchema and Schema bytes in buildPrompt (previously panicked when ResponseJSON was requested without a schema). - Rename misleading TestAgyProvider_StreamAugmentation to TestAgyParser_EmitsLineDeltas; add coverage for nil-schema path and non-augmenting agents.	2026-05-20 01:29:05 +02:00
vikingowl	6322d10686	test: fix compilation errors in main and mcp tests after secure provider refactor	2026-05-20 01:21:52 +02:00
vikingowl	3c875276c9	feat(security): implement multi-wave audit remediation and agy provider support Implemented full security remediation following Universal Security Pilot protocol: - W1: Enforced SecureProvider at router and engine boundaries to prevent bypasses. - W1: Implemented path-sensitive policy for MCP tools. - W2: Added SHA256 hash verification for SLM downloads (llamafile). - W3: Enhanced secret redaction for private keys (full body) and high-entropy strings. - W4: Fixed symlink-based filesystem sandbox escapes in paths and grep. - W4: Documented CLI agent trust boundaries. Also added 'agy' (Antigravity) as a subprocess CLI provider with plain-text JSON schema support.	2026-05-20 01:13:13 +02:00
vikingowl	129d4f1ea6	chore: remove TinyLlama and set tiny3.5 (Qwen2.5 0.5B) as default SLM	2026-05-20 00:26:58 +02:00
vikingowl	17d83f2e2a	feat: add agy CLI provider and support structured output via prompt augmentation	2026-05-20 00:21:03 +02:00
vikingowl	f8c85a26e9	docs(security): ADR-004 PostToolUse hook ordering + invariant test Closes the last remaining 2026-05-19 audit finding by documenting the existing transitive guarantee rather than restructuring the hook contract. The audit observed that PostToolUse hooks receive raw tool output before the firewall scan runs, and proposed reordering or splitting the event into raw-local-only and redacted-for-LLM variants. After Wave 1 (SafeProvider boundary at every router arm + non-engine provider consumer), the audit's threat model is closed transitively: - Shell hooks see raw output but never reach an LLM. - Prompt hooks route Stream calls through routerStreamer → router → arm.Provider, every arm.Provider is now *SafeProvider, outgoing messages are scanned at the boundary. - Agent hooks spawn an elf whose engine has Firewall set; buildRequest scans inline. Reordering would regress legitimate shell-hook use cases (audit, forensic, local alert) that need raw access. Splitting the contract forces every existing hook config to migrate and introduces a wrong-variant footgun. Neither is justified by the residual risk. Three changes ship with the ADR: - ADR-004 records the decision and the conditions for re-opening it. - Doc comments on hook.PostToolUse and the dispatcher call site in the engine point at the ADR. - internal/hook/posttooluse_redaction_test.go locks in the invariant: a prompt PostToolUse hook firing on a secret-bearing tool result produces a redacted prompt at the inner provider. If this test fails, ADR-004's Position A is no longer correct and the audit finding re-opens.	2026-05-19 23:28:25 +02:00
vikingowl	34f6f1c786	feat(security): incognito coherence across firewall/router/persist (Wave 2) Closes the cluster of audit findings where gnoma's incognito promise ('no persistence, no learning, local-only routing') silently broke because state was duplicated across the CLI flag, the firewall's IncognitoMode, the router's localOnly flag, and the TUI's local m.incognito field. Wave 2 makes security.IncognitoMode the canonical source of truth. W2-1 Router.Select rejects forced non-local arms when localOnly is on rather than short-circuiting and silently routing to cloud. Main fails fast when --incognito + --provider <cloud> are combined; the TUI toggle (Ctrl+X, /incognito, config panel) refuses with an actionable message when a non-local arm is pinned. Factored the three duplicated toggle sites into Model.attemptIncognitoToggle. W2-2 persist.Store.Save consults an IncognitoGate (local interface, security.IncognitoMode satisfies it). nil gate = always persist (legacy behaviour for tests); non-nil gate is consulted on every Save so TUI runtime toggles take effect without reconstructing the store. File mode 0o600, dir mode 0o700. W2-3 tui.New seeds m.incognito from cfg.Firewall.Incognito().Active(). Fixes the Ctrl+X-on-launch-with-incognito case where the first toggle silently turned the firewall OFF because the local flag started false out of sync with the firewall. W2-4 saveQuality gates on both incognito (defensive, covers the window before fwRef.Set fires) and fw.Incognito().ShouldLearn() (so TUI Ctrl+X suppresses the snapshot on exit). Quality restore skipped under --incognito. Quality file written 0o600 in dir 0o700. engine.reportOutcome and elf.Manager.ReportResult both gate on fw.Incognito().ShouldLearn() — bandit signal no longer leaks out of incognito sessions. W2-5 session files written 0o600 in dirs 0o700 (was 0o644 / 0o755). W2-6 IncognitoMode.LocalOnly dropped — dead field with no readers; routing local-only state lives on the router, not the firewall. Also wires rtr.SetLocalOnly(true) when --incognito at launch — main previously activated the firewall's flag but never told the router to filter, so even without the forced-arm bug, launching with --incognito alone gave you 'incognito badge but full arm pool'.	2026-05-19 22:57:36 +02:00
vikingowl	d6614545a9	feat(security): wrap engine.Config.Provider + SetProvider doc (W1 follow-up) Advisor flagged that engine.Config.Provider stayed raw, so the safety property was 'every call goes through buildRequest' instead of the stronger 'every Stream call routes through a SafeProvider.' Wrap it even though buildRequest still scans inline — at worst this costs one extra idempotent scan pass; it removes the 'someone adds a fifth engine Stream site that skips buildRequest' failure mode. Engine.SetProvider gets a doc comment establishing the wrap contract for callers. No active callers today, but documenting it now prevents the future bypass. Confirmed elf engines inherit the wrap automatically: - elf.Manager.Spawn passes arm.Provider (already *SafeProvider after W1-3a) - elf.Manager.SpawnWithProvider has no callers — dead code path Added the Wave 1 plan to TODO.md under active plans.	2026-05-19 22:37:24 +02:00
vikingowl	dc084d5a82	feat(security): wire SafeProvider into all provider sites (W1-2/3/4) Construct security.FirewallRef early in main() and Set it immediately after security.NewFirewall returns. Wrap every provider that may be called outside engine.buildRequest(): - primary provider arm (limitedProvider) - discovered local models (RegisterDiscoveredModels factory) - CLI agent arms (subprocprov.New) - background-discovery factory (StartDiscoveryLoop) - SLM arm + classifier transport - summarizer (gnomactx.NewSummarizeStrategy) routerStreamer and hook PromptExecutor inherit redaction automatically once every router arm is wrapped — they dispatch through router.Stream → arm.Provider.Stream. engine.Config.Provider stays raw because the engine still scans inline at buildRequest(); per the Wave 1 plan, removing that scan is deferred one release as belt-and-suspenders. Integration tests in internal/security/integration_test.go verify the boundary end-to-end: a router arm wrapped with WrapProvider redacts an 'sk-ant-...' literal before the inner provider sees it, and the pre-Set / post-Set transition works as documented (pass-through until the FirewallRef has a Firewall installed).	2026-05-19 22:33:24 +02:00
vikingowl	8dcca64e41	feat(security): add SafeProvider boundary wrapper (W1-1) Introduces internal/security/SafeProvider — a provider.Provider decorator that scans outgoing messages and the system prompt through the firewall before delegating to the inner provider. Tool-result redaction stays in the engine because it needs per-tool context the boundary lacks. FirewallRef provides a late-binding atomic.Pointer[Firewall] so the wrapper can be installed before NewFirewall runs in main. A nil or unset ref makes SafeProvider a pass-through — preserves the current init order without lock contention or panics. Wave 1 of the post-audit hardening plan (docs/superpowers/plans/2026-05-19-security-wave1-safeprovider.md). Closes the architectural critique that secret scanning only ran inside engine.buildRequest(), leaving SLM/summarizer/hook/routerStreamer paths to send raw payloads. This commit only ships the wrapper; W1-2 and W1-3 will wire it through main and the four bypass sites.	2026-05-19 22:28:46 +02:00
vikingowl	d84b295da2	feat(tui): /profile slash command + status-bar profile badge (Phase C-3) Adds the in-TUI surface for the profile system: - Status bar carries " · profile: <name>" next to the SLM badge when profile mode is engaged (renders nothing in legacy single-config installations). - /profile (no args) shows the active profile and lists available ones. - /profile <name> switches by re-executing gnoma via syscall.Exec under --profile <name>. Critical cleanups (quality.json snapshot, SLM backend Close, session.Close) fire explicitly before exec since defers don't run after exec replaces the process image. Using syscall.Exec rather than a child process avoids stacking a process level on every switch and propagates the new gnoma's exit code directly to the shell. - Autocomplete after "/profile " offers configured profile names; the completion source is threaded from main.go via tui.Config. Conversation history is not preserved across a switch — profile change implies different context, different keys, different permission mode, so a clean reset is the correct semantic.	2026-05-19 21:59:11 +02:00
vikingowl	8450005b31	feat(cli): gnoma profile list/show subcommands (Phase C-2) `profile list` enumerates configured profiles and marks default + active. `profile show <name>` prints the merged effective config the profile would produce — sections, configured key names (values never), CLI agent overrides, arms, hooks, MCP servers, per-profile quality and session paths. Both commands work as a recovery affordance when profile resolution is broken: list flags a missing-default explicitly with "<name> (default, missing)", and the dispatcher falls back to a base-only load (new gnomacfg.LoadBase) so the diagnostics still run. API key values are filtered out of `profile show` — the output is safe to paste in a help channel or attach to a bug report.	2026-05-19 21:44:50 +02:00
vikingowl	635dad660c	feat(config): per-profile config layering with --profile flag (Phase C-1) Adds opt-in user profiles for swapping API keys, CLI binaries, and permission modes between contexts (work/private/experiment/...). Profile mode engages only when ~/.config/gnoma/profiles/ exists, so existing single-config installations are untouched. Selection order: --profile flag → default_profile in base config → fatal error. Layering: defaults → ~/.config/gnoma/config.toml → profiles/<name>.toml → <projectRoot>/.gnoma/config.toml → env. Map sections merge per-key; [[arms]] and [[mcp_servers]] merge by id/name; [[hooks]] appends. Per-profile data: quality-<name>.json and sessions/<name>/ keep the bandit and session list from cross-contaminating between profiles. Profile names restricted to [A-Za-z0-9_-] to block --profile=../foo path traversal into derived paths.	2026-05-19 21:35:33 +02:00
vikingowl	0aabd19906	feat(router): per-arm strengths + cost weight (Phase D) Plan D from docs/superpowers/plans/2026-05-19-post-slm-unlock.md (static portion; dynamic bandit-driven promotion deferred to D-2). Routing previously let tier ordering (CLI > local > API) dominate selection — Opus, in tier 3, would lose to a tier-1 CLI agent for SecurityReview even though Opus is empirically stronger at that task. This change introduces explicit per-arm overrides: [[arms]] id = "anthropic/claude-opus-4-7" strengths = ["security_review", "planning"] cost_weight = 0.3 Strengths gate cross-tier promotion: arms matching task.Type bypass the tier loop and compete with each other directly. Promotion is a preference, not a pin — if no strength-tagged arm is feasible (backoff, pool capacity, tool support), selection falls through to the default tier order. CostWeight linearly dampens the cost penalty in scoreArm via effectiveCost = 1 + CostWeight * (cost - 1) CostWeight=1.0 (or unset) preserves current behavior; lower values trade cheapness for quality. The earlier draft used cost^CostWeight which inverts direction for sub-1 local-arm costs (raising a fraction <1 to a fractional power makes it bigger, not smaller); a monotonicity regression test prevents that drift. - internal/router/arm.go: Strengths []TaskType, CostWeight float64, HasStrength(), ResolvedCostWeight() (zero → 1.0). - internal/router/selector.go: scoreArm strength bonus const (strengthScoreBonus = 0.15) + linear cost dampening; selectBest cross-tier promotion before tier loop. - internal/router/router.go: ArmOverride type + ApplyArmOverrides() returns unknown IDs; unknown strength names skipped with per-name warning via slog. - internal/router/task.go: ParseTaskTypeStrict() returns ok bool; ParseTaskType now delegates so the two switches stay in sync. - internal/config/config.go: ArmConfig + [[arms]] TOML wiring. - cmd/gnoma/main.go: applies overrides after all initial arms register; logs a warning when an [[arms]] id has no matching registered arm. Tests cover: predicate helpers, scoring direction across two arms, linear-formula monotonicity on both sides of cost=1, cross-tier promotion, empty-Strengths preserves tier order, promoted arm in backoff falls through via full Router.Select path, observed-quality tiebreak between two strength-tagged arms, ApplyArmOverrides happy path + unknown-ID reporting + unknown-strength skipping.	2026-05-19 21:14:45 +02:00
vikingowl	b331dcd61a	feat(subprocess): per-agent binary override via [cli_agents] config Plan B from docs/superpowers/plans/2026-05-19-post-slm-unlock.md. Users with aliased CLI binaries (claude-priv, claude-work, gemini-personal) can now point gnoma's auto-discovery at them without renaming. The override flows through to the actual subprocess spawn at internal/provider/subprocess/provider.go:56, so routing through the alias is functional, not cosmetic. Config: [cli_agents] claude = "claude-priv" # discovery uses claude-priv instead of claude gemini = "" # empty value = no override (fall back to canonical) # vibe is absent = canonical name used - internal/config/config.go: CLIAgentsSection map[string]string; TOML [cli_agents] key. - internal/provider/subprocess/agent.go: - Package-level lookPath = exec.LookPath for test injection. - resolveAgentBinary(canonical, override) → (path, binName, err). Override='' falls back to canonical. Override set but missing from PATH returns an error (no silent fallback — masks user typos). - DiscoveredAgent.OverrideBinary records the override binary name when one was used; empty otherwise. - DiscoverCLIAgents(ctx, overrides) signature; warning logged when an override is configured but the binary isn't on PATH. - cmd/gnoma/main.go: both call sites pass cfg.CLIAgents. The `gnoma providers` listing renders `claude-priv (via [cli_agents].claude)` when an override is in effect. Tests cover: 5 resolver cases (no override, override set, empty override falls back, override missing, canonical missing); 4 discovery cases (no overrides, override resolves alias, empty value falls back, override missing skips agent); 2 config round-trip cases.	2026-05-19 21:02:16 +02:00
vikingowl	342b3903e1	test(slm): align HappyPath with task-type complexity floor The Debug floor (0.4) added in `eb0583f` was bumping the SLM-returned 0.25 up, breaking the HappyPath assertion. Bump the SLM value to 0.55 so the test still verifies "SLM value preserved" (its original intent), and add a dedicated TestClassifier_AppliesTaskTypeFloor that exercises the under-reporting case the floor was added to handle.	2026-05-19 20:54:27 +02:00
vikingowl	43ea2e562d	feat(engine): two-stage tool routing for small local arms Plan A from docs/superpowers/plans/2026-05-19-post-slm-unlock.md. Small local SLMs (<=16k context) waste ~1500 tokens per turn on the full tool catalogue. Two-stage routing replaces round-1 tools with a single synthetic select_category schema; round-2+ sends only the selected category's real tool schemas plus select_category for re-selection. - internal/tool/category.go: Category type, optional Categorized interface, CategoryOf() with meta fallback. fs.read/fs.ls -> read, fs.write/fs.edit -> write, fs.glob/fs.grep -> search, bash -> exec. - internal/engine/twostage.go: synthetic select_category tool, intercept helper, per-turn selectedCategory state under e.mu. - Engine round 1 forces ToolChoiceRequired so SLMs don't fall back to prose. State resets at the top and end of every runLoop. - Activates automatically on a forced local arm with ContextWindow <=16384, or via [router].force_two_stage TOML key. - Integration test drives a 3-round trip and asserts: round 1 emits exactly one schema (synthetic) with ToolChoiceRequired, round 2 contains only write-category schemas + select_category, real fs.write executes. Invalid-category fallback round-trips back to round-1 mode.	2026-05-19 20:53:21 +02:00
vikingowl	eb0583f606	fix(router): unpin config-default provider + complexity floor by task type Two routing bugs were keeping the SLM out of every real prompt and, once it was eligible, pulling complex tasks into it as well. Bug 1: ForceArm was called unconditionally when a primary provider was configured (cmd/gnoma/main.go:378). That short-circuited the entire router — every prompt went straight to whatever was set as [provider].default, regardless of tier, score, or feasibility. The SLM arm appeared in `gnoma router stats` registration logs but had zero observations after dozens of prompts. Fix: only pin when the user passed --provider on the command line. Config defaults register the arm but don't force it; the router picks freely. Verified end-to-end — trivial prompts now reach slm/ollama via the tier-0 priority. Bug 2: A short prompt like "refactor the SLM module" classifies as TaskRefactor with complexity 0.015 — well under the SLM arm's 0.3 ceiling. The arm became eligible despite the task being inherently non-trivial. Once eligible, tier-0 priority then pulled it in over the CLI agents. Fix: add MinComplexityForType, applied in both ClassifyTask (heuristic path) and slm.Classifier.Classify (SLM-overlay path). The floor is per-task-type: - TaskSecurityReview, TaskOrchestration → 0.60 - TaskRefactor, TaskPlanning, TaskDebug → 0.40 - TaskUnitTest, TaskReview → 0.35 Tasks like Explain/Generation/Boilerplate keep their organic complexity score so trivial knowledge prompts (≤0.15) still fall to the SLM. Tasks that imply existing code or multi-step reasoning are clamped above the SLM's MaxComplexity, naturally routing them to a bigger arm. After both fixes, observed routing in a clean run: What is 2+2? → slm/ollama (complexity 0.015) Define a closure → slm/ollama (complexity 0.015) What is HTTP? → slm/ollama (complexity 0.015) Refactor the SLM module → subprocess/gemini (complexity 0.40) Audit for race conditions → subprocess/gemini (complexity 0.35) Plan a migration → subprocess/gemini (complexity 0.40)	2026-05-19 19:22:16 +02:00
vikingowl	0b4de6054d	feat(tui): surface SLM backend + per-turn classifier in status bar The TUI gave no indication that an SLM was configured or active. You'd see the primary provider on the status line and nothing else, even with [slm].enabled=true and a successfully booted backend. Two surfaces added: 1. Status-bar SLM badge. The left side of the status line gains a dim " · slm: <model> ⚙" suffix when the backend booted, " · slm: ✗" when it failed, and nothing when SLM is disabled. The ⚙ marker indicates the model advertises tool support. 2. Per-turn classifier visibility. The existing routing event already produced "routed → <arm> (task: <type>)" lines in the chat history; it now also reports which classifier made the decision, e.g. "routed → ollama/ministral-3:3b (task: explain, by: slm_fallback)". Lets you tell in real time whether the SLM is actually classifying or falling back to the keyword heuristic. Plumbing: - new tui.SLMInfo struct on tui.Config - main.go populates it after StartBackend returns - stream.Event gains RoutingClassifier; engine.runLoop fills it from task.ClassifierSource on the first round	2026-05-19 19:06:26 +02:00
vikingowl	a14fe8b504	feat(slm): pluggable backends + trivial-prompt routing The SLM had two intended jobs — classify every prompt and execute the small ones itself — but in practice three independent gates kept it out of nearly all real work: 1. llamafile cold-start blocked pipe-mode runs (always faster than the 15 s health check) 2. ClassifyTask defaulted RequiresTools=true, excluding the SLM arm (ToolUse=false) from 9/10 task types 3. armTier hard-coded CLI agents > local > API, so even when the SLM arm was feasible a CLI agent won Each gate is addressed below. The result is an SLM that actually does its job — small stuff stays local, complex stuff routes up — gated by arm capability rather than by accidents of the boot order. Backend layer (the bigger change) The original implementation hard-coded llamafile. That's fine if you have nothing else, but most users with a local model setup already run Ollama or llama.cpp. The new factory at internal/slm/backend.go picks between: - ollama (any local Ollama daemon) - llamacpp (any llama.cpp server) - llamafile (gnoma-managed, current behaviour) - openaicompat (LM Studio, vLLM, remote API) - auto (probes in order, picks first reachable) - disabled [slm].backend in config.toml selects which. Documented in docs/slm-backends.md with copy-paste presets for each. The factory probes the underlying model's actual capabilities (Ollama /api/show, llama.cpp /props) and sets the SLM arm's ToolUse accordingly — so the arm picks up simple file-read style tasks on tool-capable models and stays knowledge-only on completion-only models. Trivial-prompt heuristic (Gate 2) ClassifyTask now flips RequiresTools=false for short, low-complexity prompts whose task type doesn't imply existing code (Explain, Generation, Boilerplate). Tool-needing tokens (read, write, run, test, file, …) keep RequiresTools=true even when the prompt is brief. Complexity-aware tier ordering (Gate 3) armTier takes a Task and returns tier 0 for arms whose MaxComplexity ceiling fits the task. CLI agents drop to tier 1, local to 2, API to 3. For trivial tasks the SLM arm wins; for complex tasks the SLM falls out of the feasible set (MaxComplexity exclusion) and the original ordering reasserts. Eager boot with user-facing wait (Gate 1) Removed the original goroutine-only path. SLM startup now blocks synchronously inside the factory; for llamafile that means up to [slm].startup_timeout (default 5 s) of waiting on the first invocation, with "Starting SLM…" → "SLM ready (backend, model, tools, boot=N)" / "SLM unavailable: …" messages on stderr. Ollama / llamacpp backends boot instantly because the daemon is already running. waitHealthy() now respects the caller's context deadline instead of its old hardcoded 15 s ceiling. Classifier reliability Classifier timeout bumped 2 s → 5 s for thinking-mode models like Qwen3-distilled Tiny3.5. System prompt includes /no_think directive for the same family. These help but don't eliminate small-model JSON-contract failures — see the docs section on picking a model. Probe + telemetry surfaces gnoma slm status now prints the configured backend + model + a live probe result (✓/✗) instead of just the llamafile manifest state. `gnoma router stats` already (from the previous commit) shows the classifier-source mix; with this change you can finally see slm / slm_fallback / heuristic share rise from "always heuristic" to something reflecting real SLM activity. Tests - 9 new backend-factory tests (httptest-backed Ollama probe, error paths, auto-detection, capability flags) - Tier-ordering tests cover the new "specialised small arm wins trivial task" path - Trivial-prompt heuristic tested for both halves (knowledge-only flips RequiresTools=false; debug/file/run keeps it true) Deletes the dead SLMManager field from the TUI Config — it was declared but never read.	2026-05-19 18:53:32 +02:00

1 2 3 4

177 Commits