39 Commits

Author SHA1 Message Date
vikingowl fd327107df fix(router/discovery): always probe ollama capabilities, cache is optional
DiscoverOllama() interpreted a nil probeCache as 'skip probing
entirely' rather than 'probe but don't cache.' cmd/gnoma/main.go's
synchronous discovery path passes nil, so every ollama-discovered
model got SupportsTools=false (the Go zero value), regardless of
what ollama actually reported in its capabilities field.

The symptom: filterFeasible rejected every ollama arm for any
tool-requiring task with reason=tools_required_but_unsupported,
even when ollama itself reported the model as tool-capable. Verified
via curl: qwen3:14b advertises capabilities=[completion, tools,
thinking] and has 'tools' in its template, but the gnoma arm shipped
with tool_use_capability=false.

Fix: always run probeOllamaModel; treat probeCache as an optional
memoisation aid only. nil cache now means 'no caching across calls'
not 'no probing.' For users with many models, passing a real cache
still avoids redundant HTTP calls — semantics for that path are
unchanged.

Surfaced via the new filterFeasible Debug logging from the previous
commit, which made the per-arm rejection reasons visible.
2026-05-25 02:28:05 +02:00
vikingowl 0d3d190a8b fix(slm,session,router): classifier-only SLMs + session error recovery + feasibility diagnostics
Three coupled fixes that surfaced from a single FunctionGemma test
session where the SLM-as-execution-arm assumption broke down and
every subsequent prompt failed with 'session not idle (state: error)'.

(A) [slm].register_as_arm config. The SLM has always been
unconditionally registered as both classifier AND tier-0 execution
arm. Fine for general-purpose models (ministral, qwen3-chat); breaks
for task-specialised models (FunctionGemma emits function-call
syntax instead of prose; embedding models can't generate). New
pointer-bool config: nil/absent preserves the historical default
(true), explicit false makes the SLM classifier-only and the
execution path skips the slm/* arm. Three table tests cover absent
/ explicit-false / explicit-true decode paths.

(B) Session error recovery. After any routing or engine error, the
session moved to StateError and stayed there until restart — every
new user prompt got rejected with 'session not idle (state: error)'.
ResetError() was already wired for the /init retry path, but the
general user-input and slash-command paths didn't call it. Added
ResetError() before every user-initiated Send in the TUI so a fresh
prompt always represents intent-to-retry. The /init internal retry
already had its own ResetError; left alone.

(C) filterFeasible per-arm rejection logging. Today's 'no feasible
arm for task X' error tells you THAT every arm was rejected but
nothing about WHY. Added slog.Debug per rejection (arm, task,
complexity, reason, the specific violated constraint) plus a
summary line when zero arms are feasible at any quality. Visible
with --verbose; quiet otherwise. Surface area expansion only — no
behaviour change for users not chasing a bug.
2026-05-25 01:57:16 +02:00
vikingowl eea26a262e feat(router): surface bandit knobs as [router.bandit] config
Four hardcoded constants in the selector and feedback tracker are now
user-tunable via [router.bandit]:

- quality_alpha    (EMA smoothing, default 0.3)
- min_observations (samples before observed overrides heuristic, default 3)
- observed_weight  (observed/heuristic blend ratio, default 0.7)
- strength_bonus   (quality bonus for Strengths-tagged arms, default 0.15)

Each field treats 0 as 'use default', so an empty TOML block is
byte-identical to pre-config behaviour. BanditParams is plumbed via
router.Config{Bandit: ...} and resolveBanditParams() centralises the
fallback so every call site shares the same defaults.

QualityTracker, scoreArm, bestScored, and selectBest signatures now
take the configured values directly rather than reaching for package-
level constants. Tests updated to pass BanditParams{} (defaults) or
explicit overrides where they validate the new tuning paths.

Tracks item #3 from the 'Bandit selector — design decisions deferred'
TODO entry — ships independently of the EMA vs SLM strategic decision.
2026-05-24 22:42:34 +02:00
vikingowl a23eb6b92c style: gofmt drift from prior commits
Pure whitespace cleanup surfaced when 'make check' ran gofmt over the
tree. Mostly struct-field column alignment in internal/safety/banner.go
(SessionInfo) and the var(...) flag block in cmd/gnoma/main.go after
--dangerously-allow-anywhere was added without realignment. Verified
zero substantive changes via 'git diff --ignore-all-space
--ignore-blank-lines'.
2026-05-24 16:33:17 +02:00
vikingowl f9094f68f3 feat(router): [router].prefer = local | cloud | auto
Implements P-1 through P-6 of the prefer-routing-policy plan.

Adds a config knob that biases routing toward local arms, cloud
arms, or leaves selection unchanged. Default "auto" is
byte-identical to pre-change behavior (the new armTier path with
PreferAuto returns the same value as the old single-arg function).

Mechanism diverged from the plan after empirical testing:

The plan called for a score multiplier applied in bestScored.
Tests revealed the existing cost-floor math (scoreArm divides by
weighted cost which collapses to ~0.001 for free local arms) gives
local arms a ~280x raw-score advantage that a 0.3-0.5 multiplier
can't overcome. A tier-shift in armTier turned out cleaner:

  PreferLocal: cloud arms (true API, IsLocal=false && !IsCLIAgent)
               get +2 tier shift, landing behind locals.
  PreferCloud: IsLocal arms get +2 tier shift, landing behind
               cloud. SLM tier-0 arms shift to tier 2 — still
               below cloud's tier 3 — so the SLM-protection
               semantic (small stuff stays on the small model)
               survives PreferCloud. This matches the open
               question in the plan, now resolved as: yes, SLMs
               keep winning under PreferCloud by design.

The policyMultiplier was kept in bestScored as a within-tier
nudge (mostly cosmetic in practice given the cost-floor dynamics
described above; could matter when costs are calibrated). Worth
revisiting once router-wide cost calibration lands.

Strengths cross-tier promotion is unaffected: the promoted-set
path in selectBest bypasses armTier entirely, so a strongly-tagged
cloud arm still wins SecurityReview tasks under PreferLocal
(validated by TestPreferPolicy_StrengthsBeatsMultiplier).

CLI-agent subprocess arms count as "local" for PreferLocal
purposes — they proxy to cloud but the user-visible behavior is
local. Users who want to exclude them can use --provider X.

Forced arms (--provider X) and incognito take priority over the
policy: forced arm test pins this, incognito-still-wins test pins
the LocalOnly hard filter dominating PreferCloud.

Test coverage (prefer_test.go): ParsePreferPolicy / String round
trips; policyMultiplier table; acceptance scenarios across all
three policies with adjacent-tier arms; SLM-still-wins under
PreferCloud; Strengths beats multiplier; forced-arm bypass;
incognito beats prefer; lone cloud arm wins when no local feasible.

Refs: docs/superpowers/plans/2026-05-23-prefer-routing-policy.md
2026-05-23 22:13:26 +02:00
vikingowl 2f8d4c412f feat(router): cloud-arm defaults, gpt-5.3-codex registration
Closes R-4 and R-5 of the routing-defaults plan.

R-4: Strengths + CostWeight defaults for closed frontier models.
Cloud entries land in the same knownFamilyDefaults table as local
ones, with MaxComplexity intentionally left zero (cloud arms get
no complexity ceiling). CostWeight tuned per the plan's rationale:

  claude-opus-4-7    → Planning/SecurityReview/Debug/Refactor, 0.3
  claude-sonnet-4-6  → Generation/Refactor/Review,             0.7
  gpt-5.5            → Planning/SecurityReview/Generation,     0.3
  gpt-5.3-codex      → Generation/Refactor/Debug/UnitTest,     0.6
  gpt-5.2            → Orchestration/Review,                   0.8
  gemini-3.1-pro     → Planning/Review/Orchestration,          0.5
  gemini-3.5-flash   → Boilerplate/Explain/Orchestration,      1.2

The 0.3 weight on frontier arms keeps them competitive on
SecurityReview / Planning despite $4+/Mtok; 1.2 on Gemini Flash
penalizes cost more so it only wins when cost is genuinely
decisive (boilerplate, explain).

Mechanism: extracted applyFamilyDefaults into defaults.go and call
it from Router.RegisterArm. Single source of truth — both local
discovery and the primary-provider path in cmd/gnoma/main.go now
flow through the same defaults application. Removed the duplicate
apply block from RegisterDiscoveredModels.

Legacy model IDs (claude-opus-4-20250514, gpt-4o, o3, gemini-2.5-pro,
etc.) intentionally do not match any table entry — keeps users on
pinned older models safe from imposed 2026 Strengths.

R-5: gpt-5.3-codex registration.

  - internal/provider/openai/provider.go: added to fallbackModels
    and inferOpenAIModelCapabilities (400K context, 32K output).
  - internal/provider/ratelimits.go: gpt-5.3-codex and its dated
    alias gpt-5.3-codex-2026-02-15 added with the same Tier 1
    quotas as gpt-5.2.

Gemini 3.x (3.1-pro-preview, 3.5-flash, 3.1-flash-lite) was already
registered in both google/provider.go and ratelimits.go — no change
needed for that part of R-5.

Test coverage:
- ResolveFamilyDefaults table-driven across all 7 cloud entries
  including prefix-sharing (gpt-5.5-pro → gpt-5.5 defaults,
  gemini-3.1-pro-preview → gemini-3.1-pro defaults).
- Legacy IDs return !ok.
- RegisterArm applies cloud defaults end-to-end.
- User-supplied Strengths and CostWeight are not overridden.
- ID.Model() fallback works when ModelName is empty (test code
  often constructs arms this way).

Refs: docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md
2026-05-23 21:39:48 +02:00
vikingowl 9bb775a4aa feat(router): full local family defaults table with size-keyed ceilings
Expands the family-defaults scaffold to 23 entries covering the local
models that currently appear in real Ollama fleets: coder specialists
(qwen3-coder, devstral, qwen2.5-coder, yi-coder, deepseek-coder,
starcoder), reasoners (phi-4, phi-4-mini), Gemma 2/3/4 (including the
"edge" e2b/e4b variants under both Ollama and GGUF naming), Qwen
2.5/3/3.5 with a catch-all qwen entry, Mistral/Ministral (incl. the
24B mistral-small-3), Llama 3.2/4, tiny3.5 (reec's distill family),
Granite, GLM (incl. glm-ocr specialist), and MiniCPM-V.

Five families that span wide parameter ranges (qwen3.5, qwen3,
qwen2.5, ministral-3, tiny3.5) now use SizeCap ladders instead of a
flat MaxComplexity. A new parseSizeFromModelID helper splits the
model ID on :/-_/ and matches pure <N>b/<N>m tokens, correctly
ignoring qwen3.5 version strings, e2b edge tags, a3b MoE active
params, and v0.3 version suffixes.

ResolveMaxComplexity wraps ResolveFamilyDefaults plus the SizeCap
traversal, falling back to the smallest cap when size parsing fails
(conservative). Discovery's apply path now goes through it so
SizeCap entries actually take effect.

Test coverage:
- parseSizeFromModelID (11 cases)
- ResolveFamilyDefaults longest-prefix discipline (19 cases)
- Unknown-family fallback returns !ok
- ResolveMaxComplexity size-keyed ladder (13 cases)
- Size-parse-failure fallback
- knownFamilyDefaults invariants: SizeCaps ordered largest-first,
  SizeCaps and MaxComplexity mutually exclusive per entry
- Routing-payoff integration: 3 arms (tiny3.5:1.5b, phi-4:14b,
  qwen3-coder:30b) get picked for TaskGeneration / TaskPlanning /
  TaskBoilerplate respectively, without any [[arms]] config
- Local fleet visibility: the maintainer's actual `ollama ls`
  inventory registers correctly with expected MaxComplexity and
  Strengths; embeddinggemma stays filtered out

The Planning sub-case surfaced a separate issue worth flagging:
heuristicQuality floors out at 0.55 for a generic 14B local model
without ThinkingModes, below TaskPlanning's 0.60 threshold. The test
mutates phi-4's capabilities post-registration to reflect reality
(phi-4 is reasoning-tuned). A discovery-side thinking-capability
detection is out of scope for this plan but flagged in the test
comment for follow-up.

Refs: docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md
2026-05-23 21:34:09 +02:00
vikingowl a79e99199d feat(router): non-chat exclude, vision prefixes, family-defaults scaffold
Discovery previously registered every model returned by Ollama as a
chat arm, including embeddings, ASR, TTS, audio realtime, and
rerankers — which then failed at inference time when the router
selected them. Local arms also shipped with all-zero defaults, so
selection between e.g. tiny3.5:1.5b, phi-4:14b, and qwen3-coder:30b
was effectively random.

This change covers tasks R-1, R-2, R-6 from the routing-defaults plan.

- nonChatModelPatterns + isNonChatModel substring matcher; matched
  IDs are skipped during RegisterDiscoveredModels. Covers whisper,
  moonshine, kokoros, vibevoice, -asr, -tts, -audio, -embedding,
  embeddinggemma, -reranker, lfm2.
- knownVisionModelPrefixes gains gemma4, gemma-4, glm-ocr. gemma3
  and minicpm-v entries stay for regression coverage.
- New internal/router/defaults.go with FamilyDefaults struct,
  knownFamilyDefaults map, and ResolveFamilyDefaults longest-prefix
  lookup (with org/-namespace stripping so reecdev/tiny3.5:1.5b
  resolves to "tiny3.5"). Single entry for now: functiongemma is
  registered with Disabled=true and MaxComplexity=0.40, reserved for
  the future ArmRoleToolRouter path. Table will grow in R-3.
- RegisterDiscoveredModels consults ResolveFamilyDefaults and only
  populates fields that are still zero on the arm, so user [[arms]]
  overrides keep priority.

Plans:
- docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md
- docs/superpowers/plans/2026-05-23-tool-router-specialization.md

TODO.md surfaces both as in-flight items.
2026-05-23 21:24:59 +02:00
vikingowl a2b7f8eb3f feat(router): vision capability gating and Ollama vision detection
Task gains a RequiresVision bool; filterFeasible enforces it on
both the primary feasibility pass and the last-resort fallback
(no degradation to a non-vision arm — the model literally cannot
consume image bytes).

Ollama discovery now probes /api/show for vision capability:
- details.families containing "clip" / "mllama" / "*vl"
- capabilities array containing "vision" (newer Ollama)
- name-prefix fallback for releases that predate either
  (llava, qwen2.5-vl, llama3.2-vision, moondream, pixtral, etc.)

OllamaProbeResult replaces the map[string]bool tool cache so the
single /api/show call can populate tools + vision + ctx-size in
one probe. DiscoverOllama / DiscoverLocalModels signatures updated;
nil-cache callers in cmd/gnoma keep working unchanged.
RegisterDiscoveredModels propagates SupportsVision into the arm's
Capabilities.Vision.

Tests cover RequiresVision filtering in both the happy path
(vision-only arm chosen when image present) and the fallback path
(non-vision arm rejected even as last resort).
2026-05-22 11:50:33 +02:00
vikingowl c4fde583f5 chore(lint): gofmt sweep + errcheck cleanups in router discovery
Apply gofmt -w across the codebase (struct field comment realignment
only — no semantic changes) and silence two errcheck warnings on
fmt.Sscanf / fmt.Fprintf return values in internal/router/discovery
with explicit `_, _ =` discards. Required so `make check` is green
before tagging v0.1.0.
2026-05-20 03:13:05 +02:00
vikingowl fb42202834 refactor(security): seal SecureProvider via unexported marker method
The router.SecureProvider interface previously required a public
IsSecure() bool method. Any test mock — or future production type —
could satisfy it by returning true, defeating the W1 "only wrapped
providers may flow past the boundary" contract through convention
rather than at the type level.

Replaces IsSecure() bool with an unexported security.Marker interface
that has a single secured() method. Go's method-set semantics key
unexported methods by their defining package, so only types declared in
internal/security can satisfy Marker. *SafeProvider gets the lone
secured() implementation; router.SecureProvider embeds Marker.

The seal forces every test mock that previously implemented IsSecure()
to either (a) be wrapped with security.WrapProvider(mp, nil) at the use
site, or (b) drop the method entirely if the mock never flows through
SecureProvider. 93 use sites across 11 test files were updated via a
per-package secureMock helper. WrapProvider with a nil firewall ref is
a no-op pass-through, so test behavior is unchanged.

Empirically: a type from outside internal/security can declare
`secured()` but the compiler will reject assigning it to
router.SecureProvider because the unexported method belongs to the
other package's namespace. Convention → compile-time guarantee.
2026-05-20 02:04:07 +02:00
vikingowl f6f8801040 fix(router): restore llama.cpp model enumeration; keep /props for n_ctx
3c87527 rewrote DiscoverLlamaCPP to hit /props and emit a single hardcoded
"default" entry. That breaks two cases:

  1. Multi-model llama.cpp deployments (llama-swap, model-routing proxies)
     are collapsed to a single arm with a placeholder ID.
  2. Single-model deployments lose the real model name — arms are
     registered as llamacpp/default instead of llamacpp/<actual-id>.

Restores enumeration via /v1/models (the OpenAI-compatible endpoint
llama-server exposes) while keeping the concrete n_ctx read from /props.
/props is now best-effort: failure or missing n_ctx falls back to the
documented default rather than aborting discovery.

Adds three tests: multi-model enumeration with shared context, /props
unreachable, and the empty-/v1/models error path.
2026-05-20 01:45:54 +02:00
vikingowl 8539426a46 fix(router): restore Ollama cache prune + provider-specific context defaults
3c87527 refactored DiscoverOllama and DiscoverLlamaCPP and dropped two
behaviors:

  1. The Ollama toolCache prune loop. Without it, the cache grows
     unbounded across reconcile cycles and stale entries linger; a
     model that disappears and reappears replays an out-of-date
     tool-support verdict because the cache hit skips re-probing.

  2. Sensible context-size defaults. Both probes can yield
     ContextSize=0 (Ollama: no num_ctx in /api/show parameters;
     llama.cpp: /props default_generation_settings without n_ctx).
     Registering an arm with ContextWindow=0 misroutes — the post-SLM
     two-stage path treats it as a tiny model.

Restores the prune loop, applies 32768 (ollama) / 8192 (llama.cpp) as
fallbacks at discovery time, and adds three tests covering each path.
2026-05-20 01:42:14 +02:00
vikingowl 3c875276c9 feat(security): implement multi-wave audit remediation and agy provider support
Implemented full security remediation following Universal Security Pilot protocol:
- W1: Enforced SecureProvider at router and engine boundaries to prevent bypasses.
- W1: Implemented path-sensitive policy for MCP tools.
- W2: Added SHA256 hash verification for SLM downloads (llamafile).
- W3: Enhanced secret redaction for private keys (full body) and high-entropy strings.
- W4: Fixed symlink-based filesystem sandbox escapes in paths and grep.
- W4: Documented CLI agent trust boundaries.

Also added 'agy' (Antigravity) as a subprocess CLI provider with plain-text JSON schema support.
2026-05-20 01:13:13 +02:00
vikingowl 129d4f1ea6 chore: remove TinyLlama and set tiny3.5 (Qwen2.5 0.5B) as default SLM 2026-05-20 00:26:58 +02:00
vikingowl 34f6f1c786 feat(security): incognito coherence across firewall/router/persist (Wave 2)
Closes the cluster of audit findings where gnoma's incognito promise
('no persistence, no learning, local-only routing') silently broke
because state was duplicated across the CLI flag, the firewall's
IncognitoMode, the router's localOnly flag, and the TUI's local
m.incognito field. Wave 2 makes security.IncognitoMode the canonical
source of truth.

W2-1 Router.Select rejects forced non-local arms when localOnly is on
  rather than short-circuiting and silently routing to cloud. Main
  fails fast when --incognito + --provider <cloud> are combined; the
  TUI toggle (Ctrl+X, /incognito, config panel) refuses with an
  actionable message when a non-local arm is pinned. Factored the
  three duplicated toggle sites into Model.attemptIncognitoToggle.

W2-2 persist.Store.Save consults an IncognitoGate (local interface,
  *security.IncognitoMode satisfies it). nil gate = always persist
  (legacy behaviour for tests); non-nil gate is consulted on every
  Save so TUI runtime toggles take effect without reconstructing the
  store. File mode 0o600, dir mode 0o700.

W2-3 tui.New seeds m.incognito from cfg.Firewall.Incognito().Active().
  Fixes the Ctrl+X-on-launch-with-incognito case where the first
  toggle silently turned the firewall OFF because the local flag
  started false out of sync with the firewall.

W2-4 saveQuality gates on both *incognito (defensive, covers the
  window before fwRef.Set fires) and fw.Incognito().ShouldLearn() (so
  TUI Ctrl+X suppresses the snapshot on exit). Quality restore skipped
  under --incognito. Quality file written 0o600 in dir 0o700.
  engine.reportOutcome and elf.Manager.ReportResult both gate on
  fw.Incognito().ShouldLearn() — bandit signal no longer leaks out of
  incognito sessions.

W2-5 session files written 0o600 in dirs 0o700 (was 0o644 / 0o755).

W2-6 IncognitoMode.LocalOnly dropped — dead field with no readers;
  routing local-only state lives on the router, not the firewall.

Also wires rtr.SetLocalOnly(true) when --incognito at launch — main
previously activated the firewall's flag but never told the router to
filter, so even without the forced-arm bug, launching with
--incognito alone gave you 'incognito badge but full arm pool'.
2026-05-19 22:57:36 +02:00
vikingowl 0aabd19906 feat(router): per-arm strengths + cost weight (Phase D)
Plan D from docs/superpowers/plans/2026-05-19-post-slm-unlock.md
(static portion; dynamic bandit-driven promotion deferred to D-2).

Routing previously let tier ordering (CLI > local > API) dominate
selection — Opus, in tier 3, would lose to a tier-1 CLI agent for
SecurityReview even though Opus is empirically stronger at that task.
This change introduces explicit per-arm overrides:

  [[arms]]
  id = "anthropic/claude-opus-4-7"
  strengths = ["security_review", "planning"]
  cost_weight = 0.3

Strengths gate cross-tier promotion: arms matching task.Type bypass
the tier loop and compete with each other directly. Promotion is a
preference, not a pin — if no strength-tagged arm is feasible
(backoff, pool capacity, tool support), selection falls through to
the default tier order.

CostWeight linearly dampens the cost penalty in scoreArm via
  effectiveCost = 1 + CostWeight * (cost - 1)
CostWeight=1.0 (or unset) preserves current behavior; lower values
trade cheapness for quality. The earlier draft used cost^CostWeight
which inverts direction for sub-1 local-arm costs (raising a
fraction <1 to a fractional power makes it bigger, not smaller); a
monotonicity regression test prevents that drift.

- internal/router/arm.go: Strengths []TaskType, CostWeight float64,
  HasStrength(), ResolvedCostWeight() (zero → 1.0).
- internal/router/selector.go: scoreArm strength bonus const
  (strengthScoreBonus = 0.15) + linear cost dampening; selectBest
  cross-tier promotion before tier loop.
- internal/router/router.go: ArmOverride type + ApplyArmOverrides()
  returns unknown IDs; unknown strength names skipped with per-name
  warning via slog.
- internal/router/task.go: ParseTaskTypeStrict() returns ok bool;
  ParseTaskType now delegates so the two switches stay in sync.
- internal/config/config.go: ArmConfig + [[arms]] TOML wiring.
- cmd/gnoma/main.go: applies overrides after all initial arms
  register; logs a warning when an [[arms]] id has no matching
  registered arm.

Tests cover: predicate helpers, scoring direction across two arms,
linear-formula monotonicity on both sides of cost=1, cross-tier
promotion, empty-Strengths preserves tier order, promoted arm in
backoff falls through via full Router.Select path, observed-quality
tiebreak between two strength-tagged arms, ApplyArmOverrides happy
path + unknown-ID reporting + unknown-strength skipping.
2026-05-19 21:14:45 +02:00
vikingowl eb0583f606 fix(router): unpin config-default provider + complexity floor by task type
Two routing bugs were keeping the SLM out of every real prompt and,
once it was eligible, pulling complex tasks into it as well.

Bug 1: ForceArm was called unconditionally when a primary provider was
configured (cmd/gnoma/main.go:378). That short-circuited the entire
router — every prompt went straight to whatever was set as
[provider].default, regardless of tier, score, or feasibility. The SLM
arm appeared in `gnoma router stats` registration logs but had zero
observations after dozens of prompts.

Fix: only pin when the user passed --provider on the command line.
Config defaults register the arm but don't force it; the router picks
freely. Verified end-to-end — trivial prompts now reach slm/ollama
via the tier-0 priority.

Bug 2: A short prompt like "refactor the SLM module" classifies as
TaskRefactor with complexity 0.015 — well under the SLM arm's 0.3
ceiling. The arm became eligible despite the task being inherently
non-trivial. Once eligible, tier-0 priority then pulled it in over
the CLI agents.

Fix: add MinComplexityForType, applied in both ClassifyTask
(heuristic path) and slm.Classifier.Classify (SLM-overlay path). The
floor is per-task-type:

  - TaskSecurityReview, TaskOrchestration  → 0.60
  - TaskRefactor, TaskPlanning, TaskDebug  → 0.40
  - TaskUnitTest, TaskReview               → 0.35

Tasks like Explain/Generation/Boilerplate keep their organic
complexity score so trivial knowledge prompts (≤0.15) still fall to
the SLM. Tasks that imply existing code or multi-step reasoning are
clamped above the SLM's MaxComplexity, naturally routing them to a
bigger arm.

After both fixes, observed routing in a clean run:

  What is 2+2?              → slm/ollama (complexity 0.015)
  Define a closure          → slm/ollama (complexity 0.015)
  What is HTTP?             → slm/ollama (complexity 0.015)
  Refactor the SLM module   → subprocess/gemini (complexity 0.40)
  Audit for race conditions → subprocess/gemini (complexity 0.35)
  Plan a migration          → subprocess/gemini (complexity 0.40)
2026-05-19 19:22:16 +02:00
vikingowl a14fe8b504 feat(slm): pluggable backends + trivial-prompt routing
The SLM had two intended jobs — classify every prompt and execute the
small ones itself — but in practice three independent gates kept it
out of nearly all real work:

  1. llamafile cold-start blocked pipe-mode runs (always faster than
     the 15 s health check)
  2. ClassifyTask defaulted RequiresTools=true, excluding the SLM arm
     (ToolUse=false) from 9/10 task types
  3. armTier hard-coded CLI agents > local > API, so even when the SLM
     arm was feasible a CLI agent won

Each gate is addressed below. The result is an SLM that actually does
its job — small stuff stays local, complex stuff routes up — gated by
arm capability rather than by accidents of the boot order.

Backend layer (the bigger change)

The original implementation hard-coded llamafile. That's fine if you
have nothing else, but most users with a local model setup already run
Ollama or llama.cpp. The new factory at internal/slm/backend.go picks
between:

  - ollama (any local Ollama daemon)
  - llamacpp (any llama.cpp server)
  - llamafile (gnoma-managed, current behaviour)
  - openaicompat (LM Studio, vLLM, remote API)
  - auto (probes in order, picks first reachable)
  - disabled

[slm].backend in config.toml selects which. Documented in
docs/slm-backends.md with copy-paste presets for each. The factory
probes the underlying model's actual capabilities (Ollama /api/show,
llama.cpp /props) and sets the SLM arm's ToolUse accordingly — so the
arm picks up simple file-read style tasks on tool-capable models and
stays knowledge-only on completion-only models.

Trivial-prompt heuristic (Gate 2)

ClassifyTask now flips RequiresTools=false for short, low-complexity
prompts whose task type doesn't imply existing code (Explain,
Generation, Boilerplate). Tool-needing tokens (read, write, run, test,
file, …) keep RequiresTools=true even when the prompt is brief.

Complexity-aware tier ordering (Gate 3)

armTier takes a Task and returns tier 0 for arms whose MaxComplexity
ceiling fits the task. CLI agents drop to tier 1, local to 2, API to 3.
For trivial tasks the SLM arm wins; for complex tasks the SLM falls
out of the feasible set (MaxComplexity exclusion) and the original
ordering reasserts.

Eager boot with user-facing wait (Gate 1)

Removed the original goroutine-only path. SLM startup now blocks
synchronously inside the factory; for llamafile that means up to
[slm].startup_timeout (default 5 s) of waiting on the first
invocation, with "Starting SLM…" → "SLM ready (backend, model, tools,
boot=N)" / "SLM unavailable: …" messages on stderr. Ollama / llamacpp
backends boot instantly because the daemon is already running.

waitHealthy() now respects the caller's context deadline instead of
its old hardcoded 15 s ceiling.

Classifier reliability

Classifier timeout bumped 2 s → 5 s for thinking-mode models like
Qwen3-distilled Tiny3.5. System prompt includes /no_think directive
for the same family. These help but don't eliminate small-model
JSON-contract failures — see the docs section on picking a model.

Probe + telemetry surfaces

gnoma slm status now prints the configured backend + model + a live
probe result (✓/✗) instead of just the llamafile manifest state.

`gnoma router stats` already (from the previous commit) shows the
classifier-source mix; with this change you can finally see slm /
slm_fallback / heuristic share rise from "always heuristic" to
something reflecting real SLM activity.

Tests

  - 9 new backend-factory tests (httptest-backed Ollama probe, error
    paths, auto-detection, capability flags)
  - Tier-ordering tests cover the new "specialised small arm wins
    trivial task" path
  - Trivial-prompt heuristic tested for both halves (knowledge-only
    flips RequiresTools=false; debug/file/run keeps it true)

Deletes the dead SLMManager field from the TUI Config — it was
declared but never read.
2026-05-19 18:53:32 +02:00
vikingowl 58beb7ce3c feat(router): classifier-source telemetry + router stats command
Phase 4 routing decisions depend on knowing whether the SLM classifier
is actually firing or whether the heuristic is silently doing all the
work. Adds the instrumentation to make that observable.

router.ClassifierSource enum (heuristic / slm / slm_fallback) is set
on Task by every classifier:
- HeuristicClassifier → ClassifierHeuristic
- slm.Classifier → ClassifierSLM on success, ClassifierSLMFallback when
  the SLM call fails or returns unparseable output

The source is plumbed through router.Outcome to QualityTracker, which
now maintains per-source counters alongside the existing per-arm × task
EMA scores. QualitySnapshot serializes both (classifier_counts is
omitempty for back-compat with pre-feature quality.json files).

lazyClassifier logs at INFO the first time it falls back to heuristic
because the SLM hasn't booted yet — distinguishes operational fallback
from an unconfigured-SLM run.

slm.Manager.Start() now records elapsed-to-healthy and the main.go
goroutine logs it as part of the "SLM ready" event. Confirms whether
short-lived runs are racing the boot cycle.

New `gnoma router stats` subcommand prints both tables (arm × task
quality, classifier source breakdown) from quality.json with a Phase 4
trust hint when the data is too sparse or the SLM share is low.

6 new tests cover ClassifierSource string/enum, heuristic + SLM source
propagation, QualityTracker counter round-trip, and back-compat
restore from a legacy quality.json without classifier_counts.
2026-05-19 18:18:22 +02:00
vikingowl ec9433d783 chore(lint): clear remaining errcheck and staticcheck findings
Brings the project to a clean `make lint` baseline (0 issues).

Mechanical:
- Wrap deferred resp.Body.Close() in closures (router/discovery.go,
  router/probe.go) so the unchecked return surfaces as `_ = ...`.
- Apply `_ = ...` (single or multi-return blank) to test-file calls
  that intentionally ignore errors: os.MkdirAll / os.WriteFile / os.Chdir
  in setup paths, Close / Shutdown in teardown, Submit / Spawn / Send /
  LoadDir in tests that assert on side effects.

Structural:
- engine.handleRequestTooLarge drops the unused req parameter and
  rebuilds the request from compacted history (SA4009 — argument was
  overwritten before first use).
- provider.ClassifyHTTPStatus and google.applyCapabilityOverrides switch
  to tagged switches over the discriminator (QF1002).
- tui.app.go MouseWheel + inputMode and cmd/gnoma main slm-status use
  tagged switches in place of equality chains (QF1003).
- cmd/gnoma main.go merges a var decl with its immediate assignment
  (S1021).
- Three empty-branch sites (dispatcher_test, loader_test,
  coordinator_test) become real assertions or get the dead `if` removed
  (SA9003).
2026-05-19 17:53:42 +02:00
vikingowl 135c8afe80 feat: various improvements to engine, router, and TUI
- engine/loop: enhanced loop handling
- router: dynamic model discovery and task improvements
- tui: suggestion box, input mode indicator, completions enhancements
2026-05-07 22:51:50 +02:00
vikingowl a9213ec382 feat(slm): Wave C — SLM classifier, MaxComplexity routing, CLI subcommands, TUI status
- slm.Classifier: openaicompat → llamafile, 2s timeout + heuristic fallback,
  heuristic baseline blended so Priority/RequiredEffort are never zeroed,
  extractJSON strips markdown fences from small-model responses
- router.ParseTaskType: case-insensitive string → TaskType, unknown → TaskGeneration
- router.Arm.MaxComplexity: zero = no ceiling (preserves existing arm behavior);
  filterFeasible excludes arms when task.ComplexityScore > MaxComplexity
- config.SLMSection: [slm] enabled / model_url / data_dir
- openaicompat.NewLlamafile: no API key, model = "default", no retries
- slm.Manager: DefaultDataDir() (XDG), Manifest() accessor
- cmd/gnoma: `gnoma slm setup` / `gnoma slm status` subcommands; SLM arm
  registered with MaxComplexity=0.3 when enabled + set up
- tui: /config shows slm status (ready/missing/not set up + base URL if running)
- docs: roadmap updated to reflect llamafile pivot from Ollama
2026-05-07 16:44:32 +02:00
vikingowl 8b2202e8ec feat(classifier): Wave A — TaskClassifier interface + HeuristicClassifier
- internal/router/classifier.go: TaskClassifier interface with
  Classify(ctx, prompt, history) signature. HeuristicClassifier wraps
  the existing ClassifyTask() with zero behavior change.

- engine.Config.Classifier: injectable TaskClassifier; nil defaults
  to HeuristicClassifier. Engine.classify() helper handles nil + error
  fallback transparently.

- loop.go: all four router.ClassifyTask() call sites replaced with
  e.classify(ctx, prompt). SLMClassifier slots in without further
  changes to the engine.
2026-05-07 16:11:20 +02:00
vikingowl 6883c2a041 feat(router): tier-based routing — CLI > local > API, disabled arms
Adds explicit tier preference to arm selection so the router
deterministically prefers lower-cost arms before falling back:

  tier 0: CLI agents (IsCLIAgent=true, subprocess/claude|gemini|vibe)
  tier 1: local models (IsLocal=true, ollama/llamacpp)
  tier 2: API providers (everything else)

Within a tier, quality/cost scoring still applies. filterFeasible still
gates on quality thresholds, so a low-quality local arm won't beat a
high-quality API arm when the task's minimum threshold rules it out.

Also adds Arm.Disabled: arms with Disabled=true are excluded from
auto-routing but remain selectable via ForceArm.

Implementation: armTier helper + selectBest refactored to try tiers in
order, bestScored picks within a tier. router.Select skips disabled arms
in allArms collection (forced arm bypasses disable check).
2026-05-07 14:36:36 +02:00
vikingowl 7fbb5454ee feat(router): normalize effort/thinking abstraction across providers
Add EffortLevel (auto/low/medium/high) as a provider-agnostic reasoning
control, replacing the Capabilities.Thinking bool. Each provider maps
the level to its native parameter: Anthropic budget tokens (1K/8K/16K),
OpenAI reasoning_effort (low/medium/high), Google thinking budget
(1K/8K/16K). Task classification auto-infers effort from TaskType and
complexity; filterFeasible excludes arms that lack the required level.
2026-05-07 14:08:50 +02:00
vikingowl d71bd942c4 feat: local model reliability — SDK retries, capability probing, init skill, context compaction
Three compounding bugs prevented tool calling with llama.cpp:
- Stream parser set argsComplete on partial JSON (e.g. "{"), dropping
  subsequent argument deltas — fix: use json.Valid to detect completeness
- Missing tool_choice default — llama.cpp needs explicit "auto" to
  activate its GBNF grammar constraint; now set when tools are present
- Tool names in history used internal format (fs.ls) while definitions
  used API format (fs_ls) — now re-sanitized in translateMessage

Additional changes:
- Disable SDK retries for local providers (500s are deterministic)
- Dynamic capability probing via /props (llama.cpp) and /api/show
  (Ollama), replacing hardcoded model prefix list
- Engine respects forced arm ToolUse capability when router is active
- Bundled /init skill with Go template blocks, context-aware for local
  vs cloud models, deduplication rules against CLAUDE.md
- Tool result compaction for local models — previous round results
  replaced with size markers to stay within small context windows
- Text-only fallback when tool-parse errors occur on local models
- "text-only" TUI indicator when model lacks tool support
- Session ResetError for retry after stream failures
- AllowedTools per-turn filtering in engine buildRequest
2026-04-13 02:01:01 +02:00
vikingowl 0caab0fed1 fix(router): discovery loop removes forced arm, breaking routing
The discovery loop's reconcileArms removed the CLI-forced arm
(llamacpp/default) because the llama.cpp server reports the real model
name (e.g. gemma-26b), creating a mismatch. After 30s the forced arm
disappeared and all subsequent requests failed.

Three-layer fix:
- Eager: query the specific provider at startup to resolve the real
  model name before registering the forced arm
- Lazy: reconcileArms detects placeholder "default" arm names and
  atomically renames them when discovery reveals the real identity,
  with an onReconcile callback to update the session and TUI
- Guard: the forced arm is never garbage-collected by the removal loop

Also fixes misleading /init error messaging — failed inits now show
"loaded from disk (init failed)" instead of "AGENTS.md written to".
2026-04-12 17:51:30 +02:00
vikingowl 6bb9c33d04 fix(m8): replace_default map, error UX, benchmarks, and launch prep
- Fix replace_default positional bug: []string → map[string]string for
  explicit MCP tool → built-in name mapping
- Improve error messages for missing API keys (3 actionable options) and
  unknown providers (early validation with available list)
- Remove python3 dependency from MCP tests (pure bash grep/sed parsing)
- Add router benchmark scaffold (6 benchmarks in bench_test.go + docs)
- Add .goreleaser.yml for cross-platform binary releases with ldflags
- Add launch-ready README with quickstart, extensibility docs, GIF placeholder
- Add CONTRIBUTING.md and Gitea issue templates (bug report, feature request)
2026-04-12 03:34:58 +02:00
vikingowl 8d86bc75fd test: M7 audit — quality feedback, coordinator, agent tool coverage
Quality feedback integration: TestQualityTracker_InfluencesArmSelection
verifies that 5 successes vs 5 failures tips Router.Select() to the
high-quality arm once EMA has enough observations. Companion test
confirms heuristic fallback below minObservations.

Coordinator tests expanded from 2 → 5: added guidance content check
(parallel/serial/synthesize present), false-positive table extended with
7 cases including the reordered keywords from the previous fix.

Agent tool suite: tool interface contracts for all four tools (Name,
Description, Parameters validity, IsReadOnly). Extracted duplicated
2000-char truncation into truncateOutput() helper (format.go), removing
the inline copies in agent.go and batch.go. Four boundary tests cover
empty, short, exact-max, and over-max cases.
2026-04-06 00:59:12 +02:00
vikingowl 07a976c32a fix: ClassifyTask priority ordering — orchestration below operational types
Operational task types (debug, review, refactor, test, explain) now gate
before orchestration in the keyword cascade. Previously, prompts like
"review the orchestration layer" or "refactor the pipeline dispatch"
matched "orchestrat"/"dispatch" and misclassified as TaskOrchestration.
Planning is also moved below the operational types.

Expanded orchestration keywords to cover common intent that the original
four keywords missed: "fan out", "subtask", "delegate to", "spawn elf".

Adds regression tests for false-positive cases and positive tests for new
keywords.
2026-04-06 00:58:54 +02:00
vikingowl 39181168b6 feat: QualityTracker.Snapshot/Restore + Router.QualityTracker() for cross-session persistence 2026-04-05 23:40:19 +02:00
vikingowl 64ee385039 feat: QualityTracker — EMA router feedback from elf outcomes, ResultFilePaths tracking 2026-04-05 22:08:08 +02:00
vikingowl 4f1e0cf567 feat: Ollama/gemma4 compat — /init flow, stream filter, safety fixes
provider/openai:
- Fix doubled tool call args (argsComplete flag): Ollama sends complete
  args in the first streaming chunk then repeats them as delta, causing
  doubled JSON and 400 errors in elfs
- Handle fs: prefix (gemma4 uses fs:grep instead of fs.grep)
- Add Reasoning field support for Ollama thinking output

cmd/gnoma:
- Early TTY detection so logger is created with correct destination
  before any component gets a reference to it (fixes slog WARN bleed
  into TUI textarea)

permission:
- Exempt spawn_elfs and agent tools from safety scanner: elf prompt
  text may legitimately mention .env/.ssh/credentials patterns and
  should not be blocked

tui/app:
- /init retry chain: no-tool-calls → spawn_elfs nudge → write nudge
  (ask for plain text output) → TUI fallback write from streamBuf
- looksLikeAgentsMD + extractMarkdownDoc: validate and clean fallback
  content before writing (reject refusals, strip narrative preambles)
- Collapse thinking output to 3 lines; ctrl+o to expand (live stream
  and committed messages)
- Stream-level filter for model pseudo-tool-call blocks: suppresses
  <<tool_code>>...</tool_code>> and <<function_call>>...<tool_call|>
  from entering streamBuf across chunk boundaries
- sanitizeAssistantText regex covers both block formats
- Reset streamFilterClose at every turn start
2026-04-05 19:24:51 +02:00
vikingowl 11363f3b97 feat: M1-M7 gap audit phase 2 — security, TUI, context, router feedback
Gap 6 (M3): 7 new bash security checks (8-14)
- JQ injection, obfuscated flags (Unicode lookalike hyphens),
  /proc/environ access, brace expansion, Unicode whitespace,
  zsh dangerous constructs, comment-quote desync
- Total: 14 checks (was 7)

Gap 7 (M5): Model picker numbered selection
- /model shows numbered sorted list, /model 3 picks by number

Gap 8 (M5): /config set command
- /config set provider.default mistral writes to .gnoma/config.toml
- Whitelisted keys: provider.default, provider.model, permission.mode
- New config/write.go with TOML round-trip via BurntSushi/toml

Gap 9 (M6): Simple token estimator
- EstimateTokens (len/4 heuristic), EstimateMessages (content + overhead)
- PreEstimate on Tracker for proactive compaction triggering

Gap 10 (M7): Router quality feedback from elfs
- Router.Outcome + ReportOutcome (logs for now, M9 bandit uses later)
- Manager tracks armID/taskType per elf via elfMeta map
- Manager.ReportResult called after elf completion in both agent + batch tools
2026-04-04 11:07:08 +02:00
vikingowl de1798ff5c fix: M1-M7 gap audit phase 1 — bug fix + 5 quick wins
Bug fix:
- window.go: token ratio after compaction used len(w.messages) after
  reassignment, always producing ratio ~1.0. Fixed by saving original
  length before assignment.

Gap 1 (M3): Scanner patterns 13 → 47
- Added 34 new patterns: Azure, DigitalOcean, HuggingFace, Grafana,
  GitHub extended (app/oauth/refresh), Shopify, Twilio, SendGrid,
  NPM, PyPI, Databricks, Pulumi, Postman, Sentry, Anthropic admin,
  OpenAI extended, Vault, Supabase, Telegram, Discord, JWT, Heroku,
  Mailgun, Figma

Gap 2 (M3): Config security section
- SecuritySection with EntropyThreshold + custom PatternConfig
- Wire custom patterns from TOML into scanner at startup

Gap 3 (M4): Polling discovery loop
- StartDiscoveryLoop with 30s ticker, reconciles arms vs discovered
- Router.RemoveArm for disappeared local models

Gap 4 (M5): Incognito LocalOnly enforcement
- Router.SetLocalOnly filters non-local arms in Select()
- TUI incognito toggle (Ctrl+X, /incognito) sets local-only routing

Gap 5 (M6): Reactive 413 compaction
- Window.ForceCompact() bypasses ShouldCompact threshold
- Engine handles 413 with emergency compact + retry
2026-04-03 23:11:08 +02:00
vikingowl e1a47a7620 feat: rate limit pools, elf tree view, permission prompts, dep updates
Rate limits:
- Add PoolRPS/PoolTPM/PoolTokensMonth/PoolCostMonth pool kinds
- Provider defaults for Mistral/Anthropic/OpenAI/Google (tier-aware)
- Config override via [rate_limits.<provider>] TOML section
- Pools auto-attached to arms on registration

Elf tree view (CC-style):
- Structured elf.Progress type replaces flat string channel
- Tree with ├─/└─ branches, per-elf stats (tool uses, tokens)
- Live activity updates: tool calls, "generating… (N chars)"
- Completed elfs stay in tree with "Done (duration)" until turn ends
- Suppress raw elf output from chat (tree + LLM summary instead)
- Remove background elf mode (wait: false) — always wait
- Truncate elf results to 2000 chars for parent context
- Parallel hint in system prompt and tool description

Permission prompts:
- Show actual command in prompt: "bash wants to execute: find . -name '*.go'"
- Compact hint in separator bar: "⚠ bash: find . | wc -l [y/n]"
- PermReqMsg carries tool name + args

Other:
- Fix /model not updating status bar (session.Local.SetModel)
- Add make targets: run, check, install
- Update deps: BurntSushi/toml v1.6.0, chroma v2.23.1, x/text v0.35.0, cloud.google.com/go v0.123.0
2026-04-03 20:54:48 +02:00
vikingowl 76916846aa feat: auto-discover local models from ollama + llama.cpp
At startup, polls ollama (/api/tags) and llama.cpp (/v1/models) for
available models. Registers each as an arm in the router alongside
the CLI-specified provider.

Discovered: 7 ollama models + 1 llama.cpp model = 9 total arms.
Router can now select from multiple local models based on task type.
Discovery is non-blocking — failures logged and skipped.
2026-04-03 17:53:11 +02:00
vikingowl 847735a9f7 feat: add router foundation with task classification and arm selection
internal/router/ — core routing layer:
- Task classification: 10 types (boilerplate, generation, refactor,
  review, unit_test, planning, orchestration, security_review, debug,
  explain) with keyword heuristics and complexity scoring
- Arm registry: provider+model pairs with capabilities and cost
- Limit pools: shared resource budgets with scarcity multipliers,
  optimistic reservation, use-it-or-lose-it discounting
- Heuristic selector: score = (quality × value) / effective_cost
  Prefers tools, thinking for planning, penalizes small models on
  complex tasks
- Router: Select() picks best feasible arm, ForceArm() for CLI override

Engine now routes through router.Select() when configured.
Wired into CLI — arm registered per --provider/--model flags.

20 router tests. 173 tests total across 13 packages.
2026-04-03 14:23:15 +02:00