gnoma

Author	SHA1	Message	Date
vikingowl	fd327107df	fix(router/discovery): always probe ollama capabilities, cache is optional DiscoverOllama() interpreted a nil probeCache as 'skip probing entirely' rather than 'probe but don't cache.' cmd/gnoma/main.go's synchronous discovery path passes nil, so every ollama-discovered model got SupportsTools=false (the Go zero value), regardless of what ollama actually reported in its capabilities field. The symptom: filterFeasible rejected every ollama arm for any tool-requiring task with reason=tools_required_but_unsupported, even when ollama itself reported the model as tool-capable. Verified via curl: qwen3:14b advertises capabilities=[completion, tools, thinking] and has 'tools' in its template, but the gnoma arm shipped with tool_use_capability=false. Fix: always run probeOllamaModel; treat probeCache as an optional memoisation aid only. nil cache now means 'no caching across calls' not 'no probing.' For users with many models, passing a real cache still avoids redundant HTTP calls — semantics for that path are unchanged. Surfaced via the new filterFeasible Debug logging from the previous commit, which made the per-arm rejection reasons visible.	2026-05-25 02:28:05 +02:00
vikingowl	0d3d190a8b	fix(slm,session,router): classifier-only SLMs + session error recovery + feasibility diagnostics Three coupled fixes that surfaced from a single FunctionGemma test session where the SLM-as-execution-arm assumption broke down and every subsequent prompt failed with 'session not idle (state: error)'. (A) [slm].register_as_arm config. The SLM has always been unconditionally registered as both classifier AND tier-0 execution arm. Fine for general-purpose models (ministral, qwen3-chat); breaks for task-specialised models (FunctionGemma emits function-call syntax instead of prose; embedding models can't generate). New pointer-bool config: nil/absent preserves the historical default (true), explicit false makes the SLM classifier-only and the execution path skips the slm/* arm. Three table tests cover absent / explicit-false / explicit-true decode paths. (B) Session error recovery. After any routing or engine error, the session moved to StateError and stayed there until restart — every new user prompt got rejected with 'session not idle (state: error)'. ResetError() was already wired for the /init retry path, but the general user-input and slash-command paths didn't call it. Added ResetError() before every user-initiated Send in the TUI so a fresh prompt always represents intent-to-retry. The /init internal retry already had its own ResetError; left alone. (C) filterFeasible per-arm rejection logging. Today's 'no feasible arm for task X' error tells you THAT every arm was rejected but nothing about WHY. Added slog.Debug per rejection (arm, task, complexity, reason, the specific violated constraint) plus a summary line when zero arms are feasible at any quality. Visible with --verbose; quiet otherwise. Surface area expansion only — no behaviour change for users not chasing a bug.	2026-05-25 01:57:16 +02:00
vikingowl	eea26a262e	feat(router): surface bandit knobs as [router.bandit] config Four hardcoded constants in the selector and feedback tracker are now user-tunable via [router.bandit]: - quality_alpha (EMA smoothing, default 0.3) - min_observations (samples before observed overrides heuristic, default 3) - observed_weight (observed/heuristic blend ratio, default 0.7) - strength_bonus (quality bonus for Strengths-tagged arms, default 0.15) Each field treats 0 as 'use default', so an empty TOML block is byte-identical to pre-config behaviour. BanditParams is plumbed via router.Config{Bandit: ...} and resolveBanditParams() centralises the fallback so every call site shares the same defaults. QualityTracker, scoreArm, bestScored, and selectBest signatures now take the configured values directly rather than reaching for package- level constants. Tests updated to pass BanditParams{} (defaults) or explicit overrides where they validate the new tuning paths. Tracks item #3 from the 'Bandit selector — design decisions deferred' TODO entry — ships independently of the EMA vs SLM strategic decision.	2026-05-24 22:42:34 +02:00
vikingowl	a23eb6b92c	style: gofmt drift from prior commits Pure whitespace cleanup surfaced when 'make check' ran gofmt over the tree. Mostly struct-field column alignment in internal/safety/banner.go (SessionInfo) and the var(...) flag block in cmd/gnoma/main.go after --dangerously-allow-anywhere was added without realignment. Verified zero substantive changes via 'git diff --ignore-all-space --ignore-blank-lines'.	2026-05-24 16:33:17 +02:00
vikingowl	f9094f68f3	feat(router): [router].prefer = local \| cloud \| auto Implements P-1 through P-6 of the prefer-routing-policy plan. Adds a config knob that biases routing toward local arms, cloud arms, or leaves selection unchanged. Default "auto" is byte-identical to pre-change behavior (the new armTier path with PreferAuto returns the same value as the old single-arg function). Mechanism diverged from the plan after empirical testing: The plan called for a score multiplier applied in bestScored. Tests revealed the existing cost-floor math (scoreArm divides by weighted cost which collapses to ~0.001 for free local arms) gives local arms a ~280x raw-score advantage that a 0.3-0.5 multiplier can't overcome. A tier-shift in armTier turned out cleaner: PreferLocal: cloud arms (true API, IsLocal=false && !IsCLIAgent) get +2 tier shift, landing behind locals. PreferCloud: IsLocal arms get +2 tier shift, landing behind cloud. SLM tier-0 arms shift to tier 2 — still below cloud's tier 3 — so the SLM-protection semantic (small stuff stays on the small model) survives PreferCloud. This matches the open question in the plan, now resolved as: yes, SLMs keep winning under PreferCloud by design. The policyMultiplier was kept in bestScored as a within-tier nudge (mostly cosmetic in practice given the cost-floor dynamics described above; could matter when costs are calibrated). Worth revisiting once router-wide cost calibration lands. Strengths cross-tier promotion is unaffected: the promoted-set path in selectBest bypasses armTier entirely, so a strongly-tagged cloud arm still wins SecurityReview tasks under PreferLocal (validated by TestPreferPolicy_StrengthsBeatsMultiplier). CLI-agent subprocess arms count as "local" for PreferLocal purposes — they proxy to cloud but the user-visible behavior is local. Users who want to exclude them can use --provider X. Forced arms (--provider X) and incognito take priority over the policy: forced arm test pins this, incognito-still-wins test pins the LocalOnly hard filter dominating PreferCloud. Test coverage (prefer_test.go): ParsePreferPolicy / String round trips; policyMultiplier table; acceptance scenarios across all three policies with adjacent-tier arms; SLM-still-wins under PreferCloud; Strengths beats multiplier; forced-arm bypass; incognito beats prefer; lone cloud arm wins when no local feasible. Refs: docs/superpowers/plans/2026-05-23-prefer-routing-policy.md	2026-05-23 22:13:26 +02:00
vikingowl	2f8d4c412f	feat(router): cloud-arm defaults, gpt-5.3-codex registration Closes R-4 and R-5 of the routing-defaults plan. R-4: Strengths + CostWeight defaults for closed frontier models. Cloud entries land in the same knownFamilyDefaults table as local ones, with MaxComplexity intentionally left zero (cloud arms get no complexity ceiling). CostWeight tuned per the plan's rationale: claude-opus-4-7 → Planning/SecurityReview/Debug/Refactor, 0.3 claude-sonnet-4-6 → Generation/Refactor/Review, 0.7 gpt-5.5 → Planning/SecurityReview/Generation, 0.3 gpt-5.3-codex → Generation/Refactor/Debug/UnitTest, 0.6 gpt-5.2 → Orchestration/Review, 0.8 gemini-3.1-pro → Planning/Review/Orchestration, 0.5 gemini-3.5-flash → Boilerplate/Explain/Orchestration, 1.2 The 0.3 weight on frontier arms keeps them competitive on SecurityReview / Planning despite $4+/Mtok; 1.2 on Gemini Flash penalizes cost more so it only wins when cost is genuinely decisive (boilerplate, explain). Mechanism: extracted applyFamilyDefaults into defaults.go and call it from Router.RegisterArm. Single source of truth — both local discovery and the primary-provider path in cmd/gnoma/main.go now flow through the same defaults application. Removed the duplicate apply block from RegisterDiscoveredModels. Legacy model IDs (claude-opus-4-20250514, gpt-4o, o3, gemini-2.5-pro, etc.) intentionally do not match any table entry — keeps users on pinned older models safe from imposed 2026 Strengths. R-5: gpt-5.3-codex registration. - internal/provider/openai/provider.go: added to fallbackModels and inferOpenAIModelCapabilities (400K context, 32K output). - internal/provider/ratelimits.go: gpt-5.3-codex and its dated alias gpt-5.3-codex-2026-02-15 added with the same Tier 1 quotas as gpt-5.2. Gemini 3.x (3.1-pro-preview, 3.5-flash, 3.1-flash-lite) was already registered in both google/provider.go and ratelimits.go — no change needed for that part of R-5. Test coverage: - ResolveFamilyDefaults table-driven across all 7 cloud entries including prefix-sharing (gpt-5.5-pro → gpt-5.5 defaults, gemini-3.1-pro-preview → gemini-3.1-pro defaults). - Legacy IDs return !ok. - RegisterArm applies cloud defaults end-to-end. - User-supplied Strengths and CostWeight are not overridden. - ID.Model() fallback works when ModelName is empty (test code often constructs arms this way). Refs: docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md	2026-05-23 21:39:48 +02:00
vikingowl	9bb775a4aa	feat(router): full local family defaults table with size-keyed ceilings Expands the family-defaults scaffold to 23 entries covering the local models that currently appear in real Ollama fleets: coder specialists (qwen3-coder, devstral, qwen2.5-coder, yi-coder, deepseek-coder, starcoder), reasoners (phi-4, phi-4-mini), Gemma 2/3/4 (including the "edge" e2b/e4b variants under both Ollama and GGUF naming), Qwen 2.5/3/3.5 with a catch-all qwen entry, Mistral/Ministral (incl. the 24B mistral-small-3), Llama 3.2/4, tiny3.5 (reec's distill family), Granite, GLM (incl. glm-ocr specialist), and MiniCPM-V. Five families that span wide parameter ranges (qwen3.5, qwen3, qwen2.5, ministral-3, tiny3.5) now use SizeCap ladders instead of a flat MaxComplexity. A new parseSizeFromModelID helper splits the model ID on :/-_/ and matches pure <N>b/<N>m tokens, correctly ignoring qwen3.5 version strings, e2b edge tags, a3b MoE active params, and v0.3 version suffixes. ResolveMaxComplexity wraps ResolveFamilyDefaults plus the SizeCap traversal, falling back to the smallest cap when size parsing fails (conservative). Discovery's apply path now goes through it so SizeCap entries actually take effect. Test coverage: - parseSizeFromModelID (11 cases) - ResolveFamilyDefaults longest-prefix discipline (19 cases) - Unknown-family fallback returns !ok - ResolveMaxComplexity size-keyed ladder (13 cases) - Size-parse-failure fallback - knownFamilyDefaults invariants: SizeCaps ordered largest-first, SizeCaps and MaxComplexity mutually exclusive per entry - Routing-payoff integration: 3 arms (tiny3.5:1.5b, phi-4:14b, qwen3-coder:30b) get picked for TaskGeneration / TaskPlanning / TaskBoilerplate respectively, without any [[arms]] config - Local fleet visibility: the maintainer's actual `ollama ls` inventory registers correctly with expected MaxComplexity and Strengths; embeddinggemma stays filtered out The Planning sub-case surfaced a separate issue worth flagging: heuristicQuality floors out at 0.55 for a generic 14B local model without ThinkingModes, below TaskPlanning's 0.60 threshold. The test mutates phi-4's capabilities post-registration to reflect reality (phi-4 is reasoning-tuned). A discovery-side thinking-capability detection is out of scope for this plan but flagged in the test comment for follow-up. Refs: docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md	2026-05-23 21:34:09 +02:00
vikingowl	a79e99199d	feat(router): non-chat exclude, vision prefixes, family-defaults scaffold Discovery previously registered every model returned by Ollama as a chat arm, including embeddings, ASR, TTS, audio realtime, and rerankers — which then failed at inference time when the router selected them. Local arms also shipped with all-zero defaults, so selection between e.g. tiny3.5:1.5b, phi-4:14b, and qwen3-coder:30b was effectively random. This change covers tasks R-1, R-2, R-6 from the routing-defaults plan. - nonChatModelPatterns + isNonChatModel substring matcher; matched IDs are skipped during RegisterDiscoveredModels. Covers whisper, moonshine, kokoros, vibevoice, -asr, -tts, -audio, -embedding, embeddinggemma, -reranker, lfm2. - knownVisionModelPrefixes gains gemma4, gemma-4, glm-ocr. gemma3 and minicpm-v entries stay for regression coverage. - New internal/router/defaults.go with FamilyDefaults struct, knownFamilyDefaults map, and ResolveFamilyDefaults longest-prefix lookup (with org/-namespace stripping so reecdev/tiny3.5:1.5b resolves to "tiny3.5"). Single entry for now: functiongemma is registered with Disabled=true and MaxComplexity=0.40, reserved for the future ArmRoleToolRouter path. Table will grow in R-3. - RegisterDiscoveredModels consults ResolveFamilyDefaults and only populates fields that are still zero on the arm, so user [[arms]] overrides keep priority. Plans: - docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md - docs/superpowers/plans/2026-05-23-tool-router-specialization.md TODO.md surfaces both as in-flight items.	2026-05-23 21:24:59 +02:00
vikingowl	a2b7f8eb3f	feat(router): vision capability gating and Ollama vision detection Task gains a RequiresVision bool; filterFeasible enforces it on both the primary feasibility pass and the last-resort fallback (no degradation to a non-vision arm — the model literally cannot consume image bytes). Ollama discovery now probes /api/show for vision capability: - details.families containing "clip" / "mllama" / "*vl" - capabilities array containing "vision" (newer Ollama) - name-prefix fallback for releases that predate either (llava, qwen2.5-vl, llama3.2-vision, moondream, pixtral, etc.) OllamaProbeResult replaces the map[string]bool tool cache so the single /api/show call can populate tools + vision + ctx-size in one probe. DiscoverOllama / DiscoverLocalModels signatures updated; nil-cache callers in cmd/gnoma keep working unchanged. RegisterDiscoveredModels propagates SupportsVision into the arm's Capabilities.Vision. Tests cover RequiresVision filtering in both the happy path (vision-only arm chosen when image present) and the fallback path (non-vision arm rejected even as last resort).	2026-05-22 11:50:33 +02:00
vikingowl	c4fde583f5	chore(lint): gofmt sweep + errcheck cleanups in router discovery Apply gofmt -w across the codebase (struct field comment realignment only — no semantic changes) and silence two errcheck warnings on fmt.Sscanf / fmt.Fprintf return values in internal/router/discovery with explicit `_, _ =` discards. Required so `make check` is green before tagging v0.1.0.	2026-05-20 03:13:05 +02:00
vikingowl	fb42202834	refactor(security): seal SecureProvider via unexported marker method The router.SecureProvider interface previously required a public IsSecure() bool method. Any test mock — or future production type — could satisfy it by returning true, defeating the W1 "only wrapped providers may flow past the boundary" contract through convention rather than at the type level. Replaces IsSecure() bool with an unexported security.Marker interface that has a single secured() method. Go's method-set semantics key unexported methods by their defining package, so only types declared in internal/security can satisfy Marker. *SafeProvider gets the lone secured() implementation; router.SecureProvider embeds Marker. The seal forces every test mock that previously implemented IsSecure() to either (a) be wrapped with security.WrapProvider(mp, nil) at the use site, or (b) drop the method entirely if the mock never flows through SecureProvider. 93 use sites across 11 test files were updated via a per-package secureMock helper. WrapProvider with a nil firewall ref is a no-op pass-through, so test behavior is unchanged. Empirically: a type from outside internal/security can declare `secured()` but the compiler will reject assigning it to router.SecureProvider because the unexported method belongs to the other package's namespace. Convention → compile-time guarantee.	2026-05-20 02:04:07 +02:00
vikingowl	f6f8801040	fix(router): restore llama.cpp model enumeration; keep /props for n_ctx `3c87527` rewrote DiscoverLlamaCPP to hit /props and emit a single hardcoded "default" entry. That breaks two cases: 1. Multi-model llama.cpp deployments (llama-swap, model-routing proxies) are collapsed to a single arm with a placeholder ID. 2. Single-model deployments lose the real model name — arms are registered as llamacpp/default instead of llamacpp/<actual-id>. Restores enumeration via /v1/models (the OpenAI-compatible endpoint llama-server exposes) while keeping the concrete n_ctx read from /props. /props is now best-effort: failure or missing n_ctx falls back to the documented default rather than aborting discovery. Adds three tests: multi-model enumeration with shared context, /props unreachable, and the empty-/v1/models error path.	2026-05-20 01:45:54 +02:00
vikingowl	8539426a46	fix(router): restore Ollama cache prune + provider-specific context defaults `3c87527` refactored DiscoverOllama and DiscoverLlamaCPP and dropped two behaviors: 1. The Ollama toolCache prune loop. Without it, the cache grows unbounded across reconcile cycles and stale entries linger; a model that disappears and reappears replays an out-of-date tool-support verdict because the cache hit skips re-probing. 2. Sensible context-size defaults. Both probes can yield ContextSize=0 (Ollama: no num_ctx in /api/show parameters; llama.cpp: /props default_generation_settings without n_ctx). Registering an arm with ContextWindow=0 misroutes — the post-SLM two-stage path treats it as a tiny model. Restores the prune loop, applies 32768 (ollama) / 8192 (llama.cpp) as fallbacks at discovery time, and adds three tests covering each path.	2026-05-20 01:42:14 +02:00
vikingowl	3c875276c9	feat(security): implement multi-wave audit remediation and agy provider support Implemented full security remediation following Universal Security Pilot protocol: - W1: Enforced SecureProvider at router and engine boundaries to prevent bypasses. - W1: Implemented path-sensitive policy for MCP tools. - W2: Added SHA256 hash verification for SLM downloads (llamafile). - W3: Enhanced secret redaction for private keys (full body) and high-entropy strings. - W4: Fixed symlink-based filesystem sandbox escapes in paths and grep. - W4: Documented CLI agent trust boundaries. Also added 'agy' (Antigravity) as a subprocess CLI provider with plain-text JSON schema support.	2026-05-20 01:13:13 +02:00
vikingowl	129d4f1ea6	chore: remove TinyLlama and set tiny3.5 (Qwen2.5 0.5B) as default SLM	2026-05-20 00:26:58 +02:00
vikingowl	34f6f1c786	feat(security): incognito coherence across firewall/router/persist (Wave 2) Closes the cluster of audit findings where gnoma's incognito promise ('no persistence, no learning, local-only routing') silently broke because state was duplicated across the CLI flag, the firewall's IncognitoMode, the router's localOnly flag, and the TUI's local m.incognito field. Wave 2 makes security.IncognitoMode the canonical source of truth. W2-1 Router.Select rejects forced non-local arms when localOnly is on rather than short-circuiting and silently routing to cloud. Main fails fast when --incognito + --provider <cloud> are combined; the TUI toggle (Ctrl+X, /incognito, config panel) refuses with an actionable message when a non-local arm is pinned. Factored the three duplicated toggle sites into Model.attemptIncognitoToggle. W2-2 persist.Store.Save consults an IncognitoGate (local interface, security.IncognitoMode satisfies it). nil gate = always persist (legacy behaviour for tests); non-nil gate is consulted on every Save so TUI runtime toggles take effect without reconstructing the store. File mode 0o600, dir mode 0o700. W2-3 tui.New seeds m.incognito from cfg.Firewall.Incognito().Active(). Fixes the Ctrl+X-on-launch-with-incognito case where the first toggle silently turned the firewall OFF because the local flag started false out of sync with the firewall. W2-4 saveQuality gates on both incognito (defensive, covers the window before fwRef.Set fires) and fw.Incognito().ShouldLearn() (so TUI Ctrl+X suppresses the snapshot on exit). Quality restore skipped under --incognito. Quality file written 0o600 in dir 0o700. engine.reportOutcome and elf.Manager.ReportResult both gate on fw.Incognito().ShouldLearn() — bandit signal no longer leaks out of incognito sessions. W2-5 session files written 0o600 in dirs 0o700 (was 0o644 / 0o755). W2-6 IncognitoMode.LocalOnly dropped — dead field with no readers; routing local-only state lives on the router, not the firewall. Also wires rtr.SetLocalOnly(true) when --incognito at launch — main previously activated the firewall's flag but never told the router to filter, so even without the forced-arm bug, launching with --incognito alone gave you 'incognito badge but full arm pool'.	2026-05-19 22:57:36 +02:00
vikingowl	0aabd19906	feat(router): per-arm strengths + cost weight (Phase D) Plan D from docs/superpowers/plans/2026-05-19-post-slm-unlock.md (static portion; dynamic bandit-driven promotion deferred to D-2). Routing previously let tier ordering (CLI > local > API) dominate selection — Opus, in tier 3, would lose to a tier-1 CLI agent for SecurityReview even though Opus is empirically stronger at that task. This change introduces explicit per-arm overrides: [[arms]] id = "anthropic/claude-opus-4-7" strengths = ["security_review", "planning"] cost_weight = 0.3 Strengths gate cross-tier promotion: arms matching task.Type bypass the tier loop and compete with each other directly. Promotion is a preference, not a pin — if no strength-tagged arm is feasible (backoff, pool capacity, tool support), selection falls through to the default tier order. CostWeight linearly dampens the cost penalty in scoreArm via effectiveCost = 1 + CostWeight * (cost - 1) CostWeight=1.0 (or unset) preserves current behavior; lower values trade cheapness for quality. The earlier draft used cost^CostWeight which inverts direction for sub-1 local-arm costs (raising a fraction <1 to a fractional power makes it bigger, not smaller); a monotonicity regression test prevents that drift. - internal/router/arm.go: Strengths []TaskType, CostWeight float64, HasStrength(), ResolvedCostWeight() (zero → 1.0). - internal/router/selector.go: scoreArm strength bonus const (strengthScoreBonus = 0.15) + linear cost dampening; selectBest cross-tier promotion before tier loop. - internal/router/router.go: ArmOverride type + ApplyArmOverrides() returns unknown IDs; unknown strength names skipped with per-name warning via slog. - internal/router/task.go: ParseTaskTypeStrict() returns ok bool; ParseTaskType now delegates so the two switches stay in sync. - internal/config/config.go: ArmConfig + [[arms]] TOML wiring. - cmd/gnoma/main.go: applies overrides after all initial arms register; logs a warning when an [[arms]] id has no matching registered arm. Tests cover: predicate helpers, scoring direction across two arms, linear-formula monotonicity on both sides of cost=1, cross-tier promotion, empty-Strengths preserves tier order, promoted arm in backoff falls through via full Router.Select path, observed-quality tiebreak between two strength-tagged arms, ApplyArmOverrides happy path + unknown-ID reporting + unknown-strength skipping.	2026-05-19 21:14:45 +02:00
vikingowl	eb0583f606	fix(router): unpin config-default provider + complexity floor by task type Two routing bugs were keeping the SLM out of every real prompt and, once it was eligible, pulling complex tasks into it as well. Bug 1: ForceArm was called unconditionally when a primary provider was configured (cmd/gnoma/main.go:378). That short-circuited the entire router — every prompt went straight to whatever was set as [provider].default, regardless of tier, score, or feasibility. The SLM arm appeared in `gnoma router stats` registration logs but had zero observations after dozens of prompts. Fix: only pin when the user passed --provider on the command line. Config defaults register the arm but don't force it; the router picks freely. Verified end-to-end — trivial prompts now reach slm/ollama via the tier-0 priority. Bug 2: A short prompt like "refactor the SLM module" classifies as TaskRefactor with complexity 0.015 — well under the SLM arm's 0.3 ceiling. The arm became eligible despite the task being inherently non-trivial. Once eligible, tier-0 priority then pulled it in over the CLI agents. Fix: add MinComplexityForType, applied in both ClassifyTask (heuristic path) and slm.Classifier.Classify (SLM-overlay path). The floor is per-task-type: - TaskSecurityReview, TaskOrchestration → 0.60 - TaskRefactor, TaskPlanning, TaskDebug → 0.40 - TaskUnitTest, TaskReview → 0.35 Tasks like Explain/Generation/Boilerplate keep their organic complexity score so trivial knowledge prompts (≤0.15) still fall to the SLM. Tasks that imply existing code or multi-step reasoning are clamped above the SLM's MaxComplexity, naturally routing them to a bigger arm. After both fixes, observed routing in a clean run: What is 2+2? → slm/ollama (complexity 0.015) Define a closure → slm/ollama (complexity 0.015) What is HTTP? → slm/ollama (complexity 0.015) Refactor the SLM module → subprocess/gemini (complexity 0.40) Audit for race conditions → subprocess/gemini (complexity 0.35) Plan a migration → subprocess/gemini (complexity 0.40)	2026-05-19 19:22:16 +02:00
vikingowl	a14fe8b504	feat(slm): pluggable backends + trivial-prompt routing The SLM had two intended jobs — classify every prompt and execute the small ones itself — but in practice three independent gates kept it out of nearly all real work: 1. llamafile cold-start blocked pipe-mode runs (always faster than the 15 s health check) 2. ClassifyTask defaulted RequiresTools=true, excluding the SLM arm (ToolUse=false) from 9/10 task types 3. armTier hard-coded CLI agents > local > API, so even when the SLM arm was feasible a CLI agent won Each gate is addressed below. The result is an SLM that actually does its job — small stuff stays local, complex stuff routes up — gated by arm capability rather than by accidents of the boot order. Backend layer (the bigger change) The original implementation hard-coded llamafile. That's fine if you have nothing else, but most users with a local model setup already run Ollama or llama.cpp. The new factory at internal/slm/backend.go picks between: - ollama (any local Ollama daemon) - llamacpp (any llama.cpp server) - llamafile (gnoma-managed, current behaviour) - openaicompat (LM Studio, vLLM, remote API) - auto (probes in order, picks first reachable) - disabled [slm].backend in config.toml selects which. Documented in docs/slm-backends.md with copy-paste presets for each. The factory probes the underlying model's actual capabilities (Ollama /api/show, llama.cpp /props) and sets the SLM arm's ToolUse accordingly — so the arm picks up simple file-read style tasks on tool-capable models and stays knowledge-only on completion-only models. Trivial-prompt heuristic (Gate 2) ClassifyTask now flips RequiresTools=false for short, low-complexity prompts whose task type doesn't imply existing code (Explain, Generation, Boilerplate). Tool-needing tokens (read, write, run, test, file, …) keep RequiresTools=true even when the prompt is brief. Complexity-aware tier ordering (Gate 3) armTier takes a Task and returns tier 0 for arms whose MaxComplexity ceiling fits the task. CLI agents drop to tier 1, local to 2, API to 3. For trivial tasks the SLM arm wins; for complex tasks the SLM falls out of the feasible set (MaxComplexity exclusion) and the original ordering reasserts. Eager boot with user-facing wait (Gate 1) Removed the original goroutine-only path. SLM startup now blocks synchronously inside the factory; for llamafile that means up to [slm].startup_timeout (default 5 s) of waiting on the first invocation, with "Starting SLM…" → "SLM ready (backend, model, tools, boot=N)" / "SLM unavailable: …" messages on stderr. Ollama / llamacpp backends boot instantly because the daemon is already running. waitHealthy() now respects the caller's context deadline instead of its old hardcoded 15 s ceiling. Classifier reliability Classifier timeout bumped 2 s → 5 s for thinking-mode models like Qwen3-distilled Tiny3.5. System prompt includes /no_think directive for the same family. These help but don't eliminate small-model JSON-contract failures — see the docs section on picking a model. Probe + telemetry surfaces gnoma slm status now prints the configured backend + model + a live probe result (✓/✗) instead of just the llamafile manifest state. `gnoma router stats` already (from the previous commit) shows the classifier-source mix; with this change you can finally see slm / slm_fallback / heuristic share rise from "always heuristic" to something reflecting real SLM activity. Tests - 9 new backend-factory tests (httptest-backed Ollama probe, error paths, auto-detection, capability flags) - Tier-ordering tests cover the new "specialised small arm wins trivial task" path - Trivial-prompt heuristic tested for both halves (knowledge-only flips RequiresTools=false; debug/file/run keeps it true) Deletes the dead SLMManager field from the TUI Config — it was declared but never read.	2026-05-19 18:53:32 +02:00
vikingowl	58beb7ce3c	feat(router): classifier-source telemetry + router stats command Phase 4 routing decisions depend on knowing whether the SLM classifier is actually firing or whether the heuristic is silently doing all the work. Adds the instrumentation to make that observable. router.ClassifierSource enum (heuristic / slm / slm_fallback) is set on Task by every classifier: - HeuristicClassifier → ClassifierHeuristic - slm.Classifier → ClassifierSLM on success, ClassifierSLMFallback when the SLM call fails or returns unparseable output The source is plumbed through router.Outcome to QualityTracker, which now maintains per-source counters alongside the existing per-arm × task EMA scores. QualitySnapshot serializes both (classifier_counts is omitempty for back-compat with pre-feature quality.json files). lazyClassifier logs at INFO the first time it falls back to heuristic because the SLM hasn't booted yet — distinguishes operational fallback from an unconfigured-SLM run. slm.Manager.Start() now records elapsed-to-healthy and the main.go goroutine logs it as part of the "SLM ready" event. Confirms whether short-lived runs are racing the boot cycle. New `gnoma router stats` subcommand prints both tables (arm × task quality, classifier source breakdown) from quality.json with a Phase 4 trust hint when the data is too sparse or the SLM share is low. 6 new tests cover ClassifierSource string/enum, heuristic + SLM source propagation, QualityTracker counter round-trip, and back-compat restore from a legacy quality.json without classifier_counts.	2026-05-19 18:18:22 +02:00
vikingowl	ec9433d783	chore(lint): clear remaining errcheck and staticcheck findings Brings the project to a clean `make lint` baseline (0 issues). Mechanical: - Wrap deferred resp.Body.Close() in closures (router/discovery.go, router/probe.go) so the unchecked return surfaces as `_ = ...`. - Apply `_ = ...` (single or multi-return blank) to test-file calls that intentionally ignore errors: os.MkdirAll / os.WriteFile / os.Chdir in setup paths, Close / Shutdown in teardown, Submit / Spawn / Send / LoadDir in tests that assert on side effects. Structural: - engine.handleRequestTooLarge drops the unused req parameter and rebuilds the request from compacted history (SA4009 — argument was overwritten before first use). - provider.ClassifyHTTPStatus and google.applyCapabilityOverrides switch to tagged switches over the discriminator (QF1002). - tui.app.go MouseWheel + inputMode and cmd/gnoma main slm-status use tagged switches in place of equality chains (QF1003). - cmd/gnoma main.go merges a var decl with its immediate assignment (S1021). - Three empty-branch sites (dispatcher_test, loader_test, coordinator_test) become real assertions or get the dead `if` removed (SA9003).	2026-05-19 17:53:42 +02:00
vikingowl	135c8afe80	feat: various improvements to engine, router, and TUI - engine/loop: enhanced loop handling - router: dynamic model discovery and task improvements - tui: suggestion box, input mode indicator, completions enhancements	2026-05-07 22:51:50 +02:00
vikingowl	a9213ec382	feat(slm): Wave C — SLM classifier, MaxComplexity routing, CLI subcommands, TUI status - slm.Classifier: openaicompat → llamafile, 2s timeout + heuristic fallback, heuristic baseline blended so Priority/RequiredEffort are never zeroed, extractJSON strips markdown fences from small-model responses - router.ParseTaskType: case-insensitive string → TaskType, unknown → TaskGeneration - router.Arm.MaxComplexity: zero = no ceiling (preserves existing arm behavior); filterFeasible excludes arms when task.ComplexityScore > MaxComplexity - config.SLMSection: [slm] enabled / model_url / data_dir - openaicompat.NewLlamafile: no API key, model = "default", no retries - slm.Manager: DefaultDataDir() (XDG), Manifest() accessor - cmd/gnoma: `gnoma slm setup` / `gnoma slm status` subcommands; SLM arm registered with MaxComplexity=0.3 when enabled + set up - tui: /config shows slm status (ready/missing/not set up + base URL if running) - docs: roadmap updated to reflect llamafile pivot from Ollama	2026-05-07 16:44:32 +02:00
vikingowl	8b2202e8ec	feat(classifier): Wave A — TaskClassifier interface + HeuristicClassifier - internal/router/classifier.go: TaskClassifier interface with Classify(ctx, prompt, history) signature. HeuristicClassifier wraps the existing ClassifyTask() with zero behavior change. - engine.Config.Classifier: injectable TaskClassifier; nil defaults to HeuristicClassifier. Engine.classify() helper handles nil + error fallback transparently. - loop.go: all four router.ClassifyTask() call sites replaced with e.classify(ctx, prompt). SLMClassifier slots in without further changes to the engine.	2026-05-07 16:11:20 +02:00
vikingowl	6883c2a041	feat(router): tier-based routing — CLI > local > API, disabled arms Adds explicit tier preference to arm selection so the router deterministically prefers lower-cost arms before falling back: tier 0: CLI agents (IsCLIAgent=true, subprocess/claude\|gemini\|vibe) tier 1: local models (IsLocal=true, ollama/llamacpp) tier 2: API providers (everything else) Within a tier, quality/cost scoring still applies. filterFeasible still gates on quality thresholds, so a low-quality local arm won't beat a high-quality API arm when the task's minimum threshold rules it out. Also adds Arm.Disabled: arms with Disabled=true are excluded from auto-routing but remain selectable via ForceArm. Implementation: armTier helper + selectBest refactored to try tiers in order, bestScored picks within a tier. router.Select skips disabled arms in allArms collection (forced arm bypasses disable check).	2026-05-07 14:36:36 +02:00
vikingowl	7fbb5454ee	feat(router): normalize effort/thinking abstraction across providers Add EffortLevel (auto/low/medium/high) as a provider-agnostic reasoning control, replacing the Capabilities.Thinking bool. Each provider maps the level to its native parameter: Anthropic budget tokens (1K/8K/16K), OpenAI reasoning_effort (low/medium/high), Google thinking budget (1K/8K/16K). Task classification auto-infers effort from TaskType and complexity; filterFeasible excludes arms that lack the required level.	2026-05-07 14:08:50 +02:00
vikingowl	d71bd942c4	feat: local model reliability — SDK retries, capability probing, init skill, context compaction Three compounding bugs prevented tool calling with llama.cpp: - Stream parser set argsComplete on partial JSON (e.g. "{"), dropping subsequent argument deltas — fix: use json.Valid to detect completeness - Missing tool_choice default — llama.cpp needs explicit "auto" to activate its GBNF grammar constraint; now set when tools are present - Tool names in history used internal format (fs.ls) while definitions used API format (fs_ls) — now re-sanitized in translateMessage Additional changes: - Disable SDK retries for local providers (500s are deterministic) - Dynamic capability probing via /props (llama.cpp) and /api/show (Ollama), replacing hardcoded model prefix list - Engine respects forced arm ToolUse capability when router is active - Bundled /init skill with Go template blocks, context-aware for local vs cloud models, deduplication rules against CLAUDE.md - Tool result compaction for local models — previous round results replaced with size markers to stay within small context windows - Text-only fallback when tool-parse errors occur on local models - "text-only" TUI indicator when model lacks tool support - Session ResetError for retry after stream failures - AllowedTools per-turn filtering in engine buildRequest	2026-04-13 02:01:01 +02:00
vikingowl	0caab0fed1	fix(router): discovery loop removes forced arm, breaking routing The discovery loop's reconcileArms removed the CLI-forced arm (llamacpp/default) because the llama.cpp server reports the real model name (e.g. gemma-26b), creating a mismatch. After 30s the forced arm disappeared and all subsequent requests failed. Three-layer fix: - Eager: query the specific provider at startup to resolve the real model name before registering the forced arm - Lazy: reconcileArms detects placeholder "default" arm names and atomically renames them when discovery reveals the real identity, with an onReconcile callback to update the session and TUI - Guard: the forced arm is never garbage-collected by the removal loop Also fixes misleading /init error messaging — failed inits now show "loaded from disk (init failed)" instead of "AGENTS.md written to".	2026-04-12 17:51:30 +02:00
vikingowl	6bb9c33d04	fix(m8): replace_default map, error UX, benchmarks, and launch prep - Fix replace_default positional bug: []string → map[string]string for explicit MCP tool → built-in name mapping - Improve error messages for missing API keys (3 actionable options) and unknown providers (early validation with available list) - Remove python3 dependency from MCP tests (pure bash grep/sed parsing) - Add router benchmark scaffold (6 benchmarks in bench_test.go + docs) - Add .goreleaser.yml for cross-platform binary releases with ldflags - Add launch-ready README with quickstart, extensibility docs, GIF placeholder - Add CONTRIBUTING.md and Gitea issue templates (bug report, feature request)	2026-04-12 03:34:58 +02:00
vikingowl	8d86bc75fd	test: M7 audit — quality feedback, coordinator, agent tool coverage Quality feedback integration: TestQualityTracker_InfluencesArmSelection verifies that 5 successes vs 5 failures tips Router.Select() to the high-quality arm once EMA has enough observations. Companion test confirms heuristic fallback below minObservations. Coordinator tests expanded from 2 → 5: added guidance content check (parallel/serial/synthesize present), false-positive table extended with 7 cases including the reordered keywords from the previous fix. Agent tool suite: tool interface contracts for all four tools (Name, Description, Parameters validity, IsReadOnly). Extracted duplicated 2000-char truncation into truncateOutput() helper (format.go), removing the inline copies in agent.go and batch.go. Four boundary tests cover empty, short, exact-max, and over-max cases.	2026-04-06 00:59:12 +02:00
vikingowl	07a976c32a	fix: ClassifyTask priority ordering — orchestration below operational types Operational task types (debug, review, refactor, test, explain) now gate before orchestration in the keyword cascade. Previously, prompts like "review the orchestration layer" or "refactor the pipeline dispatch" matched "orchestrat"/"dispatch" and misclassified as TaskOrchestration. Planning is also moved below the operational types. Expanded orchestration keywords to cover common intent that the original four keywords missed: "fan out", "subtask", "delegate to", "spawn elf". Adds regression tests for false-positive cases and positive tests for new keywords.	2026-04-06 00:58:54 +02:00
vikingowl	39181168b6	feat: QualityTracker.Snapshot/Restore + Router.QualityTracker() for cross-session persistence	2026-04-05 23:40:19 +02:00
vikingowl	64ee385039	feat: QualityTracker — EMA router feedback from elf outcomes, ResultFilePaths tracking	2026-04-05 22:08:08 +02:00
vikingowl	4f1e0cf567	feat: Ollama/gemma4 compat — /init flow, stream filter, safety fixes provider/openai: - Fix doubled tool call args (argsComplete flag): Ollama sends complete args in the first streaming chunk then repeats them as delta, causing doubled JSON and 400 errors in elfs - Handle fs: prefix (gemma4 uses fs:grep instead of fs.grep) - Add Reasoning field support for Ollama thinking output cmd/gnoma: - Early TTY detection so logger is created with correct destination before any component gets a reference to it (fixes slog WARN bleed into TUI textarea) permission: - Exempt spawn_elfs and agent tools from safety scanner: elf prompt text may legitimately mention .env/.ssh/credentials patterns and should not be blocked tui/app: - /init retry chain: no-tool-calls → spawn_elfs nudge → write nudge (ask for plain text output) → TUI fallback write from streamBuf - looksLikeAgentsMD + extractMarkdownDoc: validate and clean fallback content before writing (reject refusals, strip narrative preambles) - Collapse thinking output to 3 lines; ctrl+o to expand (live stream and committed messages) - Stream-level filter for model pseudo-tool-call blocks: suppresses <<tool_code>>...</tool_code>> and <<function_call>>...<tool_call\|> from entering streamBuf across chunk boundaries - sanitizeAssistantText regex covers both block formats - Reset streamFilterClose at every turn start	2026-04-05 19:24:51 +02:00
vikingowl	11363f3b97	feat: M1-M7 gap audit phase 2 — security, TUI, context, router feedback Gap 6 (M3): 7 new bash security checks (8-14) - JQ injection, obfuscated flags (Unicode lookalike hyphens), /proc/environ access, brace expansion, Unicode whitespace, zsh dangerous constructs, comment-quote desync - Total: 14 checks (was 7) Gap 7 (M5): Model picker numbered selection - /model shows numbered sorted list, /model 3 picks by number Gap 8 (M5): /config set command - /config set provider.default mistral writes to .gnoma/config.toml - Whitelisted keys: provider.default, provider.model, permission.mode - New config/write.go with TOML round-trip via BurntSushi/toml Gap 9 (M6): Simple token estimator - EstimateTokens (len/4 heuristic), EstimateMessages (content + overhead) - PreEstimate on Tracker for proactive compaction triggering Gap 10 (M7): Router quality feedback from elfs - Router.Outcome + ReportOutcome (logs for now, M9 bandit uses later) - Manager tracks armID/taskType per elf via elfMeta map - Manager.ReportResult called after elf completion in both agent + batch tools	2026-04-04 11:07:08 +02:00
vikingowl	de1798ff5c	fix: M1-M7 gap audit phase 1 — bug fix + 5 quick wins Bug fix: - window.go: token ratio after compaction used len(w.messages) after reassignment, always producing ratio ~1.0. Fixed by saving original length before assignment. Gap 1 (M3): Scanner patterns 13 → 47 - Added 34 new patterns: Azure, DigitalOcean, HuggingFace, Grafana, GitHub extended (app/oauth/refresh), Shopify, Twilio, SendGrid, NPM, PyPI, Databricks, Pulumi, Postman, Sentry, Anthropic admin, OpenAI extended, Vault, Supabase, Telegram, Discord, JWT, Heroku, Mailgun, Figma Gap 2 (M3): Config security section - SecuritySection with EntropyThreshold + custom PatternConfig - Wire custom patterns from TOML into scanner at startup Gap 3 (M4): Polling discovery loop - StartDiscoveryLoop with 30s ticker, reconciles arms vs discovered - Router.RemoveArm for disappeared local models Gap 4 (M5): Incognito LocalOnly enforcement - Router.SetLocalOnly filters non-local arms in Select() - TUI incognito toggle (Ctrl+X, /incognito) sets local-only routing Gap 5 (M6): Reactive 413 compaction - Window.ForceCompact() bypasses ShouldCompact threshold - Engine handles 413 with emergency compact + retry	2026-04-03 23:11:08 +02:00
vikingowl	e1a47a7620	feat: rate limit pools, elf tree view, permission prompts, dep updates Rate limits: - Add PoolRPS/PoolTPM/PoolTokensMonth/PoolCostMonth pool kinds - Provider defaults for Mistral/Anthropic/OpenAI/Google (tier-aware) - Config override via [rate_limits.<provider>] TOML section - Pools auto-attached to arms on registration Elf tree view (CC-style): - Structured elf.Progress type replaces flat string channel - Tree with ├─/└─ branches, per-elf stats (tool uses, tokens) - Live activity updates: tool calls, "generating… (N chars)" - Completed elfs stay in tree with "Done (duration)" until turn ends - Suppress raw elf output from chat (tree + LLM summary instead) - Remove background elf mode (wait: false) — always wait - Truncate elf results to 2000 chars for parent context - Parallel hint in system prompt and tool description Permission prompts: - Show actual command in prompt: "bash wants to execute: find . -name '*.go'" - Compact hint in separator bar: "⚠ bash: find . \| wc -l [y/n]" - PermReqMsg carries tool name + args Other: - Fix /model not updating status bar (session.Local.SetModel) - Add make targets: run, check, install - Update deps: BurntSushi/toml v1.6.0, chroma v2.23.1, x/text v0.35.0, cloud.google.com/go v0.123.0	2026-04-03 20:54:48 +02:00
vikingowl	76916846aa	feat: auto-discover local models from ollama + llama.cpp At startup, polls ollama (/api/tags) and llama.cpp (/v1/models) for available models. Registers each as an arm in the router alongside the CLI-specified provider. Discovered: 7 ollama models + 1 llama.cpp model = 9 total arms. Router can now select from multiple local models based on task type. Discovery is non-blocking — failures logged and skipped.	2026-04-03 17:53:11 +02:00
vikingowl	847735a9f7	feat: add router foundation with task classification and arm selection internal/router/ — core routing layer: - Task classification: 10 types (boilerplate, generation, refactor, review, unit_test, planning, orchestration, security_review, debug, explain) with keyword heuristics and complexity scoring - Arm registry: provider+model pairs with capabilities and cost - Limit pools: shared resource budgets with scarcity multipliers, optimistic reservation, use-it-or-lose-it discounting - Heuristic selector: score = (quality × value) / effective_cost Prefers tools, thinking for planning, penalizes small models on complex tasks - Router: Select() picks best feasible arm, ForceArm() for CLI override Engine now routes through router.Select() when configured. Wired into CLI — arm registered per --provider/--model flags. 20 router tests. 173 tests total across 13 packages.	2026-04-03 14:23:15 +02:00

39 Commits