256 Commits

Author SHA1 Message Date
vikingowl 7213a1e2fd docs: switch recommended SLM from reecdev/tiny3.5:500m to qwen3:0.6b
Release / release (push) Has been cancelled
Empirical comparison on 2026-05-25 across three candidate SLMs on
identical prompts (two prompts: trivial 'what is 2+2' + knowledge
'explain a multi-armed bandit'):

  qwen3:0.6b           consistent across both prompts
  functiongemma:270m   works trivial, derails on knowledge prompts
  gemma3:1b            unusable (emits just '{' or invented keys)
  reecdev/tiny3.5:1.5b unusable (ignores /no_think, leaks <Thought Process> blocks)
  qwen2.5-coder:1.5b   unusable (ignores classifier prompt, answers in prose)

qwen3:0.6b honours Qwen3's native /no_think flag (the distillation in
the old default did not), is smaller than the previous recommendation
(520 MB vs 1 GB), and was the only candidate to classify both test
prompts successfully without falling back to heuristic.

README quickstart block + slm-backends.md presets + status output
sample all switched. Also documents register_as_arm (default true,
set false for task-specialised models like FunctionGemma) and
classify_timeout (default 15s) in the example configs since both
landed in v0.3.3+.

Code defaults for the tiny3.5 family in internal/router/defaults.go
are unchanged — that table still applies when users have tiny3.5
registered as a routing arm independent of the SLM role.
v0.3.4
2026-05-25 02:43:11 +02:00
vikingowl fd327107df fix(router/discovery): always probe ollama capabilities, cache is optional
DiscoverOllama() interpreted a nil probeCache as 'skip probing
entirely' rather than 'probe but don't cache.' cmd/gnoma/main.go's
synchronous discovery path passes nil, so every ollama-discovered
model got SupportsTools=false (the Go zero value), regardless of
what ollama actually reported in its capabilities field.

The symptom: filterFeasible rejected every ollama arm for any
tool-requiring task with reason=tools_required_but_unsupported,
even when ollama itself reported the model as tool-capable. Verified
via curl: qwen3:14b advertises capabilities=[completion, tools,
thinking] and has 'tools' in its template, but the gnoma arm shipped
with tool_use_capability=false.

Fix: always run probeOllamaModel; treat probeCache as an optional
memoisation aid only. nil cache now means 'no caching across calls'
not 'no probing.' For users with many models, passing a real cache
still avoids redundant HTTP calls — semantics for that path are
unchanged.

Surfaced via the new filterFeasible Debug logging from the previous
commit, which made the per-arm rejection reasons visible.
2026-05-25 02:28:05 +02:00
vikingowl 0d3d190a8b fix(slm,session,router): classifier-only SLMs + session error recovery + feasibility diagnostics
Three coupled fixes that surfaced from a single FunctionGemma test
session where the SLM-as-execution-arm assumption broke down and
every subsequent prompt failed with 'session not idle (state: error)'.

(A) [slm].register_as_arm config. The SLM has always been
unconditionally registered as both classifier AND tier-0 execution
arm. Fine for general-purpose models (ministral, qwen3-chat); breaks
for task-specialised models (FunctionGemma emits function-call
syntax instead of prose; embedding models can't generate). New
pointer-bool config: nil/absent preserves the historical default
(true), explicit false makes the SLM classifier-only and the
execution path skips the slm/* arm. Three table tests cover absent
/ explicit-false / explicit-true decode paths.

(B) Session error recovery. After any routing or engine error, the
session moved to StateError and stayed there until restart — every
new user prompt got rejected with 'session not idle (state: error)'.
ResetError() was already wired for the /init retry path, but the
general user-input and slash-command paths didn't call it. Added
ResetError() before every user-initiated Send in the TUI so a fresh
prompt always represents intent-to-retry. The /init internal retry
already had its own ResetError; left alone.

(C) filterFeasible per-arm rejection logging. Today's 'no feasible
arm for task X' error tells you THAT every arm was rejected but
nothing about WHY. Added slog.Debug per rejection (arm, task,
complexity, reason, the specific violated constraint) plus a
summary line when zero arms are feasible at any quality. Visible
with --verbose; quiet otherwise. Surface area expansion only — no
behaviour change for users not chasing a bug.
2026-05-25 01:57:16 +02:00
vikingowl c065a2dea7 fix(provider/openai): wire ResponseFormat into OpenAI request params
provider.Request.ResponseFormat was being silently dropped by the
openai translation layer (translate.go:translateRequest). The
upstream provider type and the openai-go SDK both supported it; the
adapter just never propagated it.

This is why Move 1 (set ResponseFormat=ResponseJSON in the SLM
classifier) produced zero observable change: the field made it from
the classifier into provider.Request but stopped at the OpenAI
translation step. The ollama backend (used via the OpenAI-compatible
endpoint) therefore never received format=json_object and kept
emitting free-form prose, which the classifier's downstream JSON
parser duly rejected — 50 fallbacks in a row across two model swaps.

Translate provider.ResponseJSON to oai.ResponseFormatJSONObjectParam
and provider.ResponseText to oai.ResponseFormatTextParam; leave the
union zero-valued when the caller didn't set ResponseFormat so the
SDK omits the field per its omitzero tag. Three table cases cover
the json / text / unset paths.

Affects ollama, llama.cpp, llamafile, and any other backend reached
via openaicompat — all run through openai.translateRequest.
2026-05-25 01:26:38 +02:00
vikingowl 24945b1eb2 docs(plans): encoder + contextual-bandit router architecture
Captures the architectural research surfaced during the 2026-05-25
SLM-failure diagnostic session: RouteLLM treats routing as
classification, ModernBERT is well-suited to that classification, and
FunctionGemma fits as an optional JSON-sanity layer rather than the
primary classifier. The current decoder-SLM-as-classifier design is
the wrong shape (100% failure rate observed across two model swaps).

Five-phase plan:
  1. Embedding feature scaffold (near-term, additive, opt-in)
  2. Contextual bandit (LinUCB / Thompson) over the feature set
  3. Retire the decoder-SLM classifier once 2 outperforms
  4. ModernBERT fine-tune on the accumulated labelled data
  5. FunctionGemma JSON sanity layer (optional final stage)

Phase 1 is the only piece scoped for near-term implementation; the
rest is multi-month and hinges on the strategic 'EMA vs SLM'
question already tracked in TODO.

Cross-references the existing tool-router-specialization plan so a
reader of either lands on both. Updates the TODO entry for the
bandit selector to note the supersession path.
2026-05-25 01:22:18 +02:00
vikingowl c0c2e4bff5 fix(slm): enforce JSON output + strip thinking-block prefixes
Two structural fixes for the SLM classifier's 100% failure rate:

(1) Pass ResponseFormat=json_object + Temperature=0 + TopP=1 +
MaxTokens=128 in the classifier Request. The provider type already
supports these but callSLM was leaving them unset, which meant ollama
(and any other backend) ran with default sampling and free-form text
output. format=json mode in particular makes ollama emit only valid
JSON at decoding time — eliminates the majority of parse failures.

(2) Harden extractJSON to strip common thinking-block tags before
hunting for the brace. Seen in the wild: <think>…</think> (Qwen3
distillations) and <Thought Process>…</Thought Process> (tiny3.5).
Defensive list also covers <reasoning>, <thoughts>. Unterminated
thinking blocks fall back to brace-search so we still have a shot.
Table-driven tests cover all variants plus the no-tag and
fenced-json paths to confirm no regression.

Even with format=json on a capable provider, the extractor is the
safety net for backends that don't enforce format strictly — same
defence-in-depth shape as the existing fence stripping.

Doesn't fix the deeper architecture question (encoder + bandit
preferred over decoder-SLM as classifier — see plan doc landing in
the same PR); fixes the immediate bug.
2026-05-25 01:19:51 +02:00
vikingowl f3c70bd802 fix(slm,router): honest classifier diagnostics + 15s default timeout
Five fixes folded into one commit because they all answer the same
question: 'why does my router stats output lie to me?'

Issue 1 (timeout). Default classify timeout was 5s — too short for
cold-start ollama loads on small models. Bumped to 15s and surfaced
as [slm].classify_timeout (0 = built-in default). Empirically caught
when a user's reecdev/tiny3.5:1.5b hit 'stream error: context
deadline exceeded' on every single classify call.

Issue 2 (Warn-level error). The SLM-fallback path logged the
underlying error at Debug, invisible without --verbose. Promoted to
Warn so a first-time misconfiguration surfaces immediately. The
fallback itself is benign; the signal is that the SLM isn't doing
the work it was supposed to.

Issue 3 (stats hint). Hard-coded 'check that llamafile boots' even
when the user is on ollama. Replaced with backend-templated advice
read from cfg.SLM.Backend. Also distinguishes three diagnostic
cases that were collapsed before:
- SLM never called (zero attempts)
- SLM called N times but every call fell back (timeout/parse)
- SLM working but minority share

Issue 4 (effective heuristic share). The classifier breakdown
shows 'heuristic' and 'slm_fallback' as separate sources, but both
routed through HeuristicClassifier — only the source tag differs.
New line under 'total observations' surfaces the combined share
honestly: 'effective heuristic share: 100% (44 fallbacks + 10
pure heuristic)'.

Issue 5 (config schema). [slm].classify_timeout joins the existing
[slm] knobs alongside startup_timeout. Documented inline with the
cold-start-load rationale.
2026-05-25 01:05:57 +02:00
vikingowl fa65a68728 docs(plans): config-migration and sensitive-content-policy
Release / release (push) Has been cancelled
Promotes two TODO entries into phased plan docs and links them
from the TODO bullets.

config-migration plan covers the silent layered-config corruption
chain (encoder zero-spam -> reader overwrite -> wrong effective
values) and its remediation across five phases: encoder fix
(omitempty + pointer-numeric hybrid), project registry, gnoma
doctor, gnoma upgrade-config, and auto-migration on startup with
banner notice.

sensitive-content-policy plan unifies three input paths (pasted
text, pasted images, tool-read files) behind one decision API
with consistent UI surface and audit-log integration. Phases A-E
sequence the work from highest-leverage (text paste) to most
complex (image OCR with local vision arm).

Neither plan starts implementation in this commit — they exist to
make the design decisions explicit so the eventual code can be
reviewed against a written intent rather than a TODO bullet.
v0.3.3
2026-05-24 22:51:33 +02:00
vikingowl 8b9bdc2978 feat(security): per-session firewall audit log
New AuditLogger writes one JSON line per firewall action to
<projectRoot>/.gnoma/sessions/<sessionID>/audit.jsonl so a user can
grep 'what did the firewall do this session?' after the fact.

Records 'block', 'redact', 'warn', and 'unicode_sanitize' events with
the matcher name, source (tool_result / message_text / etc.), and
token length. Discipline: never the bytes themselves — only the
matcher name and the length, matching the README's scope-note
promise about audit data.

Plumbing:
- Firewall gains an audit *AuditLogger field plus SetAudit setter.
  The firewall is constructed before the session ID exists, so the
  audit logger is wired post-hoc once main.go has the sessionID.
- Honours incognito: Record is a silent no-op when the firewall's
  IncognitoMode is active, preserving the no-persistence contract.
- Tolerant of fs errors: mkdir / open / encode failures log a Warn
  but never propagate; the scan pipeline must not depend on audit
  succeeding.
- Nil receiver is a valid no-op so callers don't need nil-guards
  around every Record.

Tracks 'Security boundary — per-session audit log' from the
v0.3.0 r/SideProject launch thread (u/Secret_Theme3192,
2026-05-24). Per-host egress allowlist remains separately tracked
pending the commenter's reply on host-level vs per-tool semantics.
2026-05-24 22:47:28 +02:00
vikingowl eea26a262e feat(router): surface bandit knobs as [router.bandit] config
Four hardcoded constants in the selector and feedback tracker are now
user-tunable via [router.bandit]:

- quality_alpha    (EMA smoothing, default 0.3)
- min_observations (samples before observed overrides heuristic, default 3)
- observed_weight  (observed/heuristic blend ratio, default 0.7)
- strength_bonus   (quality bonus for Strengths-tagged arms, default 0.15)

Each field treats 0 as 'use default', so an empty TOML block is
byte-identical to pre-config behaviour. BanditParams is plumbed via
router.Config{Bandit: ...} and resolveBanditParams() centralises the
fallback so every call site shares the same defaults.

QualityTracker, scoreArm, bestScored, and selectBest signatures now
take the configured values directly rather than reaching for package-
level constants. Tests updated to pass BanditParams{} (defaults) or
explicit overrides where they validate the new tuning paths.

Tracks item #3 from the 'Bandit selector — design decisions deferred'
TODO entry — ships independently of the EMA vs SLM strategic decision.
2026-05-24 22:42:34 +02:00
vikingowl 352cab4a94 docs(todo): extend config-migration plan with project registry
Release / release (push) Has been cancelled
Adds item #5 to the config write/merge corruption entry:
~/.config/gnoma/projects.json tracking which directories gnoma has
been launched in. Enables doctor --all-projects, cross-project
session listing, and one-shot upgrade-config across all known
projects.

Documents the design constraints: must use the same omitempty /
atomic-write discipline as the encoder fix to avoid recreating the
class of bug it exists to help solve. Privacy footprint flagged
(local-only directory log; opt-out toggle). Stale-entry handling
gated through doctor, not auto-prune.
v0.3.2
2026-05-24 22:29:56 +02:00
vikingowl 58f4001917 docs(todo): track config write/merge corruption + doctor/upgrade design
setConfig() serializes the entire Config struct on every key change,
which writes zero-valued fields into the file. On the next load those
explicit zeros override higher-priority layers via toml.Decode's
present-beats-absent semantics. Concrete symptom today: a global
prefer = 'cloud' was silently shadowed by a project prefer = ''.

Captures the multi-part fix surface so it doesn't get half-done:
- Stop generating zero-spam (omitempty hybrid or pelletier swap).
- gnoma doctor: read-only diagnostic (zero-spam, invalid enums,
  removed keys, effective-merged values).
- gnoma upgrade-config: active migration with .bak backup + diff.
- Auto-migrate project-level on startup with TUI banner notice;
  global stays explicit.
2026-05-24 22:24:59 +02:00
vikingowl 6c5e969217 feat(tui): add /router command for runtime routing-preference switch
Mirrors the pattern of /permission: bare command shows the current
value plus a help line; with an argument (auto/local/cloud) it calls
Router.SetPreferPolicy and emits a system message. Session-only — does
not write back to config.toml, matching /permission and Ctrl+X
incognito-toggle conventions.

Tab completion on the value via routerPreferModes alongside the
existing permissionModes pattern. Help text updated. Status-bar
indicator deferred (separate concern if it turns out to be wanted).
2026-05-24 22:13:27 +02:00
vikingowl 74bd570438 fix(tui): de-dupe /init in command picker; skill names shadow builtins
/init appeared twice in the completion picker — once from the static
builtinCommands list and once from the bundled init skill at
internal/skill/skills/init.md (registered via skills.All()).

Two changes:

- Remove /init from builtinCommands. The skill provides the canonical
  entry, and its description ('Generate or update AGENTS.md project
  documentation') is more accurate than the static one ('initialize
  project — create AGENTS.md') because the skill handles both create
  and update.
- Refactor completionSource() so a skill name silently shadows any
  builtin with the same name. Prevents this from recurring if a
  future builtin migrates to a skill, and lets users override a
  builtin's description by dropping a skill of the same name into
  .gnoma/skills/.
2026-05-24 22:08:46 +02:00
vikingowl d38d7daf25 fix(subprocess/agy): disable ToolUse until stream-json lands
agy is registered with FormatAgyText and the agyParser emits every
stdout line as a plain EventTextDelta. There is no path for a
structured ToolCall event to come back. With ToolUse=true the router
would dispatch tool-needing tasks (security_review, spawn_elfs, file
edit) to agy; the underlying Gemini model would describe calling the
tool in prose — invented UUIDs and 'I will pause now'-style stubs —
the engine would receive only text, and the turn would hang waiting
for a tool call that never arrives.

Surfaced when /init routed to agy for a security_review task and
elf spawning visibly hallucinated in the TUI. Capability flag
flipped to false; agy stays usable for tool-free prompts (explain,
summarize, simple chat). TODO entry for native stream-json updated
to flag that the capability flip is part of that same change.
2026-05-24 21:58:22 +02:00
vikingowl 06d4069076 ci: pin GoReleaser to the triggering tag, fix tag-collision regression
Release / release (push) Has been cancelled
When v0.3.1 was tagged on the same commit as v0.3.1-rc2, the release
workflow built and tried to publish rc2 artifacts instead of v0.3.1,
failing with 'already_exists' on every asset upload.

Root cause: goreleaser-action@v6 + 'version: latest' (locked to v2.x)
falls back to 'git describe --tags' for the current tag, which picked
v0.3.1-rc2 over v0.3.1 when both refs pointed at HEAD. Explicitly
setting GORELEASER_CURRENT_TAG = github.ref_name forces the workflow
to use the tag that triggered it, regardless of other refs at the same
commit.
v0.3.1
2026-05-24 17:36:01 +02:00
vikingowl f641bd4971 docs(todo): track bandit selector design questions
Two related items surfaced from the r/coolgithubprojects v0.3.1
launch thread. Bundled because they share the selector code:

1. Whether to keep numeric EMA at all post-SLM dispatcher (open
   strategic question from the 2026-05-07 roadmap — not a
   must-implement).
2. Surfacing hardcoded selector knobs (qualityAlpha, blend ratio,
   strength bonus, quality floor) as [router.bandit] config keys —
   ships independently of #1.
2026-05-24 17:34:13 +02:00
vikingowl 798f2ab3c3 fix(release): prerelease auto-detect; changelog excludes scoped conventional commits
Release / release (push) Has been cancelled
Two polish issues surfaced by the v0.3.1-rc1 pipeline test:

- The release was tagged v0.3.1-rc1 but published without the
  prerelease flag, so it appeared alongside stable releases. Add
  'prerelease: auto' to release.github so GoReleaser marks any tag
  with a semver prerelease suffix (-rc, -beta, -alpha, -pre)
  appropriately.

- The changelog filters used '^docs:' patterns that only match bare
  conventional commits. Scoped variants like 'docs(readme):' and
  'chore(make):' slipped through into the published changelog.
  Switch to '^docs[:(]' style patterns to match both forms, and add
  '^style[:(]' so gofmt-drift commits are excluded too.
v0.3.1-rc2
2026-05-24 17:05:49 +02:00
vikingowl 9814795b3c ci: migrate release pipeline from Woodpecker to GitHub Actions
Release / release (push) Has been cancelled
Drop the broken .woodpecker/release.yml (top-level when: triggered an
'error' status on every dev push instead of skipping non-tag events)
and replace with .github/workflows/release.yml driving the same
GoReleaser flow.

Rationale:
- Release artifacts already land on GitHub (releases + ghcr.io), so
  running the pipeline on GitHub eliminates a build hop.
- GH Actions auto-provides GITHUB_TOKEN with packages:write via the
  workflow permissions block — no PAT plumbing or login secrets.
- docker/setup-qemu-action and docker/setup-buildx-action handle the
  multi-arch cross-build setup that Woodpecker would require manual
  host configuration for.

Trigger: any tag matching refs/tags/v*. Mirror sync from somegit.dev
propagates tags to GitHub, so 'git push origin v0.3.1' on the canonical
remote still drives the GitHub-side release.
v0.3.1-rc1
2026-05-24 16:45:17 +02:00
vikingowl 047924da2b ci(woodpecker): release pipeline on vX.Y.Z tag
Runs 'go test ./...' then 'goreleaser release --clean' inside the
official goreleaser image when a tag matching refs/tags/v* is pushed.
GITHUB_TOKEN comes from the 'github_token' repo secret (needs repo +
write:packages scopes) and is reused for ghcr.io docker login so the
multi-arch image build can push.

Runner requirements documented inline: docker socket access plus QEMU
registered on the host (tonistiigi/binfmt --install all) for arm64
cross-builds. Directory form chosen so a non-release CI pipeline can
land later under .woodpecker/ci.yml without restructuring.
2026-05-24 16:38:24 +02:00
vikingowl a23eb6b92c style: gofmt drift from prior commits
Pure whitespace cleanup surfaced when 'make check' ran gofmt over the
tree. Mostly struct-field column alignment in internal/safety/banner.go
(SessionInfo) and the var(...) flag block in cmd/gnoma/main.go after
--dangerously-allow-anywhere was added without realignment. Verified
zero substantive changes via 'git diff --ignore-all-space
--ignore-blank-lines'.
2026-05-24 16:33:17 +02:00
vikingowl 0981fb82d6 chore(make): add govulncheck and semgrep to 'make check'
Both checks already passed locally on the current dev tip; wiring them
into the canonical pre-commit gate so security regressions fail fast
instead of leaking into a release.

- 'make vuln' runs govulncheck with reachability analysis against the
  Go vuln DB.
- 'make sec' runs semgrep with p/golang + p/security-audit, metrics
  off, --error so findings exit non-zero.

Tools must be installed locally (commands in Makefile comments). If
upstream Woodpecker CI runs 'make check', it will need both binaries
on the runner image.
2026-05-24 16:30:54 +02:00
vikingowl 3888966e68 fix(deps): bump golang.org/x/net to v0.55.0 to clear reachable CVEs
govulncheck flagged two reachable vulnerabilities in
golang.org/x/net@v0.52.0:

- GO-2026-5026 (idna fails to reject ASCII-only Punycode labels),
  reached via router.DiscoverOllama -> http.Client.Do -> idna.ToASCII.
- GO-2026-4918 (HTTP/2 transport infinite loop on bad
  SETTINGS_MAX_FRAME_SIZE), same call path -> http2.Transport.*.

Bumping to v0.55.0 covers both. Transitive bumps to x/crypto v0.51.0,
x/sys v0.45.0, x/text v0.37.0. Post-bump govulncheck reports 0
reachable vulnerabilities and 0 in directly imported packages.
2026-05-24 16:27:28 +02:00
vikingowl 847cd5fe0c fix(security): use crypto/rand for session-ID suffix
Semgrep flagged math/rand for the /tmp artifact-directory session-ID
generation. Modern Go (1.20+) auto-seeds the global math/rand source
so this wasn't exploitable in practice, but crypto/rand is the
idiomatic choice for any security-adjacent identifier and removes the
finding from future security audits.

Drops the mrand alias entirely; reads 8 random bytes once and masks
to 24 bits to preserve the existing %06x suffix format.
2026-05-24 16:22:50 +02:00
vikingowl 001865f069 fix(env): correct ANTHROPIC_API_KEY typo, add missing vars
The placeholder ANTHROPICS_API_KEY (with trailing S) silently failed:
the auth layer reads ANTHROPIC_API_KEY, so anyone copying .env.example
to .env and pasting their key would see gnoma never pick it up, with
no clear error.

Also surfaces vars that already work but weren't templated:
GOOGLE_API_KEY (alternative to GEMINI_API_KEY), GNOMA_PROVIDER and
GNOMA_MODEL (config overrides), and the two subprocess sandbox bypass
footguns (GNOMA_AGY_BYPASS_PERMISSIONS, GNOMA_CODEX_BYPASS_SANDBOX),
left commented out so they don't accidentally turn on.
2026-05-24 16:16:39 +02:00
vikingowl c1c52f139d docs(readme): add 'no phone-home' bullet and data-flow scope note
Clarify that gnoma itself emits no telemetry to external services
while being explicit that cloud-provider arms send data to those
providers by design. Adds:
- 'No phone-home' bullet to the differentiator list, naming the
  on-device path (Ollama/llama.cpp + --incognito).
- 'Data flow' paragraph to the Security scope-note blockquote so
  the framing is consistent between the hero bullets and the
  Security section.
2026-05-24 16:00:40 +02:00
vikingowl 7040041f13 docs(readme): correct firewall scope; track egress controls in TODO
The 'What makes gnoma different' bullet and Security section both
implied a network-egress firewall. Today the Firewall only enforces a
content boundary (secret scan, Unicode sanitize, redact/block). Reword
both spots and add a Scope note. Surface the gap as a top-of-TODO
entry covering per-session audit log and per-host egress allowlist,
with the open design question (host-level vs per-tool) called out.
Raised via r/SideProject v0.3.0 launch thread.
2026-05-24 15:50:35 +02:00
vikingowl 1828151162 docs(claude): big-picture architecture and expanded test commands
Add a 'Big picture' section summarising the request flow (cmd →
session → engine → router → security/permission → extensibility) so
future Claude Code instances can orient without reading INDEX.md plus
five package directories first. Note that internal/safety and
internal/slm aren't in INDEX.md yet. Document the somegit.dev /
GitHub mirror split and the ruleset that blocks force-push and
deletion on main/dev. Expand build/test section with make check, make
test-integration, single-test, and benchmark commands.
2026-05-24 15:39:23 +02:00
vikingowl b5062d59e9 docs(readme): hero screenshot, differentiators, status, TOC
Add docs/img/gnoma-tui.png as a hero image so visitors see the TUI
above the fold instead of a wall of text. Pull the bandit router,
prefer-policy, SLM, and built-in firewall out of buried sections into
a 'What makes gnoma different' bullet list. Add a Status block flagging
pre-1.0 and a table of contents. Move the pygmy-owl naming note and
upstream/mirror URLs into a footer About section.
2026-05-24 15:39:14 +02:00
vikingowl b13a6a2801 docs(plans): mark v0.3.0 plans shipped
Three plans shipped end-to-end in v0.3.0; removing them from
TODO.md In-flight and adding a Status: shipped header to each
plan doc with the commit references.

Shipped:
- 2026-05-23-routing-defaults-refresh.md
- 2026-05-23-prefer-routing-policy.md
- 2026-05-23-startup-safety-banner.md

Still in flight (telemetry-gated, fires only if measurements
support it):
- 2026-05-23-tool-router-specialization.md
2026-05-23 22:45:05 +02:00
vikingowl 8ba77c1685 fix(safety): env-template precision, label alignment, banner on bypass
Three polish items surfaced during the maintainer's manual smoke
of the previous safety commit.

env-template precision (false-positive fix):
  The "env file" rule matched .env.* universally, which flagged
  conventional templates like .env.example / .env.sample /
  .env.template / .env.dist / .env.default — these hold variable
  NAMES, no values, and are commonly committed. Now skipped.
  Real env files (.env, .env.local, .env.production) still match.
  New envTemplateSuffixes table + isEnvTemplate helper; check runs
  only inside the env-file rule so the suffix denylist is scoped.
  Tests added for both directions: 6 templates that must NOT flag,
  6 real env files that must.

Banner label alignment:
  Field labels were padded to 8 chars except "sensitive" at 9,
  producing visible misalignment in the rendered banner:
      cwd      : /...
      provider : ollama / ...
      sensitive : 0 matches in cwd     <- one extra space
  Padded all labels to 9 chars so the ":" separators line up.

Context banner on bypass:
  --dangerously-allow-anywhere previously suppressed the entire
  safety block, including the informational context banner.
  Bypassing the GATE is not the same as opting out of the info —
  the user still wants to see cwd / git state / sensitive files
  nearby. Restructured the safety block so classification + banner
  always run; the bypass only skips the refuse/warn FLOW. The
  bypass warning log now also includes the classified tier and
  cwd path for diagnostics.
v0.3.0
2026-05-23 22:32:26 +02:00
vikingowl c483656681 docs(plans): fix gnoma one-shot invocation in safety-banner plan
gnoma takes the prompt as a positional argument, not via -p (that's
Claude Code's syntax). Surfaced when the maintainer tried the
manual smoke from the plan's "Definition of done" section and hit
the "flag provided but not defined: -p" error.

  before: gnoma -p "test"
  after:  gnoma "test"

The same wrong syntax appears in the f9094f6 / 3eeb5b4 commit
messages but those are immutable. This commit also serves as the
public record of the typo so future readers don't repeat it.
2026-05-23 22:26:56 +02:00
vikingowl d206b3cf09 docs: routing-prefer + startup-safety user docs, plan tier-shift note
README:
- New "Preferring local vs cloud" subsection under "Routing
  defaults" — table of the three [router].prefer values, priority
  order against forced arm / incognito / Strengths, and the
  CLI-agent-counts-as-local clarification.
- New "Startup safety check" subsection under "Security" — tier
  table, [safety] config block, --dangerously-allow-anywhere flag,
  container detection note, link to the plan doc.

Plan doc (prefer-routing-policy):
- Approach section updated to describe the tier-shift mechanism
  that actually shipped, with a clear "Implementation note"
  explaining why the original score-multiplier approach was
  abandoned (cost-floor math gives local arms a ~280x raw-score
  advantage that any reasonable multiplier can't overcome).
- CLI-agent placement flipped from "non-local" to "local" with
  rationale — implementation chose user-facing behavior axis over
  the privacy axis the original draft used.
- Tier-shift rationale table replacing the multiplier rationale.
- P-3 task rewritten to reflect the actual implementation (checked
  off and pointing at the right code), with the policyMultiplier
  helper noted as a within-tier nudge of limited present effect.

The implementation-vs-plan deviation is now documented in both the
plan doc and the original feature commit message (f9094f6). Future
readers reach the same understanding via either path.
2026-05-23 22:23:57 +02:00
vikingowl 3eeb5b46d7 feat(safety): pre-launch cwd classifier + context banner
Implements S-1 through S-7 of the startup-safety-banner plan.

Adds a pre-launch safety check that classifies the current working
directory into three tiers and gates the launch:

  TierRefuse  /, /etc, /sys, /proc, /usr, /var, /bin, /sbin, /boot,
              /root, /dev (Linux) and /System, /Library, /private,
              /Applications (macOS). Refuses with exit 2 unless
              --dangerously-allow-anywhere is passed.

  TierWarn    $HOME, ~/Desktop, ~/Downloads, ~/Documents, ~/.config,
              ~/.local, ~/.cache, /tmp, and similar dumping grounds.
              Prints a banner and reads a single y/Y from stdin to
              confirm; any other input (or EOF, including piped/
              scripted invocation) aborts with exit 1.

  TierOK      Anywhere with a recognized project marker (.gnoma/,
              go.mod, package.json, pyproject.toml, Cargo.toml,
              Makefile, Dockerfile, build.gradle*, pom.xml) or
              inside a git repo. No prompt; banner only.

Project markers and git-repo presence override the TierWarn check —
a project dir inside $HOME stays TierOK. The require_project_marker
config knob can flip that for strict users.

Container detection: when /.dockerenv or /run/.containerenv exists,
TierRefuse downgrades to TierWarn (devcontainers often chroot to /
or similar). Best-effort; false positives only soften the gate.

The context banner is always rendered (TierOK, TierWarn, TierRefuse
alike) and summarizes: cwd, git branch + dirty state, project type,
provider/model, modes (permission, incognito, prefer), and a
top-level sensitive-file inventory. Inventory matches .env,
.env.*, env.local; private-key extensions (.pem, .key, .crt, .p12,
.pfx); SSH key names (id_rsa, id_ed25519, ...); credentials files;
.netrc / .pgpass; KeePass vaults; and .ssh/ .aws/ .kube/ .gcloud/
.azure/ .docker/ directories. Precision-tested: .envrc and
secret_handler.go do NOT match. Bounded at 1000 entries.

Architecture:
- internal/safety/cwd.go — Classification + symlink-resolving tier
  classifier with platform-specific roots and container detection.
- internal/safety/sensitive.go — pattern-based top-level scanner,
  deterministic ordering, scanLimit guard against pathological dirs.
- internal/safety/banner.go — pure render functions for the warn
  prefix, refuse message, and context banner. Safe for golden-string
  testing.
- internal/config/config.go — new [safety] section with three
  config keys, defaults applied via ResolvedSafety() helper. Pointer
  fields distinguish "user omitted" from "user set to false."
- cmd/gnoma/main.go — gate runs after subcommand dispatch (so
  `gnoma providers / profile / slm / router` skip the prompt) and
  before provider creation. --dangerously-allow-anywhere bypasses
  the gate with an explicit log warning.

The runtime keypress reads up to 8 bytes from os.Stdin and accepts
only "y" / "Y" trimmed; EOF returns false (piped invocations
without the flag will abort). Documented in the readYesConfirmation
helper. Manual smoke (per plan):
  - `cd / && gnoma -p test` → refuses
  - `cd ~ && gnoma` → warns + keypress
  - `cd ~/git/some-repo && gnoma` → banner only
  - subcommands skip the gate entirely

Linux + macOS classification; Windows path handling deferred per
plan (treated as TierOK there until follow-up).

Refs: docs/superpowers/plans/2026-05-23-startup-safety-banner.md
2026-05-23 22:19:39 +02:00
vikingowl f9094f68f3 feat(router): [router].prefer = local | cloud | auto
Implements P-1 through P-6 of the prefer-routing-policy plan.

Adds a config knob that biases routing toward local arms, cloud
arms, or leaves selection unchanged. Default "auto" is
byte-identical to pre-change behavior (the new armTier path with
PreferAuto returns the same value as the old single-arg function).

Mechanism diverged from the plan after empirical testing:

The plan called for a score multiplier applied in bestScored.
Tests revealed the existing cost-floor math (scoreArm divides by
weighted cost which collapses to ~0.001 for free local arms) gives
local arms a ~280x raw-score advantage that a 0.3-0.5 multiplier
can't overcome. A tier-shift in armTier turned out cleaner:

  PreferLocal: cloud arms (true API, IsLocal=false && !IsCLIAgent)
               get +2 tier shift, landing behind locals.
  PreferCloud: IsLocal arms get +2 tier shift, landing behind
               cloud. SLM tier-0 arms shift to tier 2 — still
               below cloud's tier 3 — so the SLM-protection
               semantic (small stuff stays on the small model)
               survives PreferCloud. This matches the open
               question in the plan, now resolved as: yes, SLMs
               keep winning under PreferCloud by design.

The policyMultiplier was kept in bestScored as a within-tier
nudge (mostly cosmetic in practice given the cost-floor dynamics
described above; could matter when costs are calibrated). Worth
revisiting once router-wide cost calibration lands.

Strengths cross-tier promotion is unaffected: the promoted-set
path in selectBest bypasses armTier entirely, so a strongly-tagged
cloud arm still wins SecurityReview tasks under PreferLocal
(validated by TestPreferPolicy_StrengthsBeatsMultiplier).

CLI-agent subprocess arms count as "local" for PreferLocal
purposes — they proxy to cloud but the user-visible behavior is
local. Users who want to exclude them can use --provider X.

Forced arms (--provider X) and incognito take priority over the
policy: forced arm test pins this, incognito-still-wins test pins
the LocalOnly hard filter dominating PreferCloud.

Test coverage (prefer_test.go): ParsePreferPolicy / String round
trips; policyMultiplier table; acceptance scenarios across all
three policies with adjacent-tier arms; SLM-still-wins under
PreferCloud; Strengths beats multiplier; forced-arm bypass;
incognito beats prefer; lone cloud arm wins when no local feasible.

Refs: docs/superpowers/plans/2026-05-23-prefer-routing-policy.md
2026-05-23 22:13:26 +02:00
vikingowl 162c8b1017 docs(plans): prefer-routing-policy and startup-safety-banner
Two parallel pre-flight plans surfaced in the 2026-05-23 session,
both deferred while the routing-defaults-refresh implementation
landed. Drafted as separate plans because they're independent:
the prefer-policy is a router scoring change; the safety banner is
a launch-time check that never touches the router.

prefer-routing-policy
  [router].prefer = "local" | "cloud" | "auto" — soft score
  multiplier (0.3 / 0.5 / 1.0) biasing toward local or cloud arms
  while preserving Strengths cross-tier promotion and bandit
  learning. Default "auto" is byte-identical to current behavior.
  Forced arms and incognito retain priority. CLI-agent subprocess
  arms count as non-local for this knob (they proxy to cloud).

startup-safety-banner
  Three-tier cwd classification at launch — refuse in /etc /sys
  and other system roots; warn+keypress in $HOME, /tmp, ~/Desktop,
  ~/Downloads; OK inside any git repo or directory with a project
  marker (.gnoma/, go.mod, package.json, etc.). Always shows a
  context banner with cwd, git state, model, modes, and a
  top-level sensitive-file inventory (.env, id_rsa, *.pem, .ssh/,
  etc. — informational only, no recursion, capped at 1000 entries).
  Bypass via --dangerously-allow-anywhere. Complements the in-flight
  sensitive-content unified-policy TODO item: this is the pre-flight
  layer, that is the runtime input-path layer.

Both plans default-on with safe defaults; both have explicit
out-of-scope sections to prevent scope creep during implementation.
Linux + macOS first; Windows path classification deferred.

TODO.md surfaces both as in-flight.
2026-05-23 22:00:21 +02:00
vikingowl c99b2c64ad docs(readme): document routing defaults table and [[arms]] overrides
Closes R-8 of the routing-defaults plan. Adds a new "Routing
defaults" section between Config and SLM that documents what arms
ship with out-of-the-box — the family-keyed Strengths /
MaxComplexity / CostWeight matrix plus the non-chat exclude list.

Also introduces the [[arms]] override block in the README for the
first time (previously undocumented), showing how users keep
priority over the defaults.

Links back to the plan doc for the benchmark sources and per-entry
rationale.
2026-05-23 21:42:05 +02:00
vikingowl 2f8d4c412f feat(router): cloud-arm defaults, gpt-5.3-codex registration
Closes R-4 and R-5 of the routing-defaults plan.

R-4: Strengths + CostWeight defaults for closed frontier models.
Cloud entries land in the same knownFamilyDefaults table as local
ones, with MaxComplexity intentionally left zero (cloud arms get
no complexity ceiling). CostWeight tuned per the plan's rationale:

  claude-opus-4-7    → Planning/SecurityReview/Debug/Refactor, 0.3
  claude-sonnet-4-6  → Generation/Refactor/Review,             0.7
  gpt-5.5            → Planning/SecurityReview/Generation,     0.3
  gpt-5.3-codex      → Generation/Refactor/Debug/UnitTest,     0.6
  gpt-5.2            → Orchestration/Review,                   0.8
  gemini-3.1-pro     → Planning/Review/Orchestration,          0.5
  gemini-3.5-flash   → Boilerplate/Explain/Orchestration,      1.2

The 0.3 weight on frontier arms keeps them competitive on
SecurityReview / Planning despite $4+/Mtok; 1.2 on Gemini Flash
penalizes cost more so it only wins when cost is genuinely
decisive (boilerplate, explain).

Mechanism: extracted applyFamilyDefaults into defaults.go and call
it from Router.RegisterArm. Single source of truth — both local
discovery and the primary-provider path in cmd/gnoma/main.go now
flow through the same defaults application. Removed the duplicate
apply block from RegisterDiscoveredModels.

Legacy model IDs (claude-opus-4-20250514, gpt-4o, o3, gemini-2.5-pro,
etc.) intentionally do not match any table entry — keeps users on
pinned older models safe from imposed 2026 Strengths.

R-5: gpt-5.3-codex registration.

  - internal/provider/openai/provider.go: added to fallbackModels
    and inferOpenAIModelCapabilities (400K context, 32K output).
  - internal/provider/ratelimits.go: gpt-5.3-codex and its dated
    alias gpt-5.3-codex-2026-02-15 added with the same Tier 1
    quotas as gpt-5.2.

Gemini 3.x (3.1-pro-preview, 3.5-flash, 3.1-flash-lite) was already
registered in both google/provider.go and ratelimits.go — no change
needed for that part of R-5.

Test coverage:
- ResolveFamilyDefaults table-driven across all 7 cloud entries
  including prefix-sharing (gpt-5.5-pro → gpt-5.5 defaults,
  gemini-3.1-pro-preview → gemini-3.1-pro defaults).
- Legacy IDs return !ok.
- RegisterArm applies cloud defaults end-to-end.
- User-supplied Strengths and CostWeight are not overridden.
- ID.Model() fallback works when ModelName is empty (test code
  often constructs arms this way).

Refs: docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md
2026-05-23 21:39:48 +02:00
vikingowl 9bb775a4aa feat(router): full local family defaults table with size-keyed ceilings
Expands the family-defaults scaffold to 23 entries covering the local
models that currently appear in real Ollama fleets: coder specialists
(qwen3-coder, devstral, qwen2.5-coder, yi-coder, deepseek-coder,
starcoder), reasoners (phi-4, phi-4-mini), Gemma 2/3/4 (including the
"edge" e2b/e4b variants under both Ollama and GGUF naming), Qwen
2.5/3/3.5 with a catch-all qwen entry, Mistral/Ministral (incl. the
24B mistral-small-3), Llama 3.2/4, tiny3.5 (reec's distill family),
Granite, GLM (incl. glm-ocr specialist), and MiniCPM-V.

Five families that span wide parameter ranges (qwen3.5, qwen3,
qwen2.5, ministral-3, tiny3.5) now use SizeCap ladders instead of a
flat MaxComplexity. A new parseSizeFromModelID helper splits the
model ID on :/-_/ and matches pure <N>b/<N>m tokens, correctly
ignoring qwen3.5 version strings, e2b edge tags, a3b MoE active
params, and v0.3 version suffixes.

ResolveMaxComplexity wraps ResolveFamilyDefaults plus the SizeCap
traversal, falling back to the smallest cap when size parsing fails
(conservative). Discovery's apply path now goes through it so
SizeCap entries actually take effect.

Test coverage:
- parseSizeFromModelID (11 cases)
- ResolveFamilyDefaults longest-prefix discipline (19 cases)
- Unknown-family fallback returns !ok
- ResolveMaxComplexity size-keyed ladder (13 cases)
- Size-parse-failure fallback
- knownFamilyDefaults invariants: SizeCaps ordered largest-first,
  SizeCaps and MaxComplexity mutually exclusive per entry
- Routing-payoff integration: 3 arms (tiny3.5:1.5b, phi-4:14b,
  qwen3-coder:30b) get picked for TaskGeneration / TaskPlanning /
  TaskBoilerplate respectively, without any [[arms]] config
- Local fleet visibility: the maintainer's actual `ollama ls`
  inventory registers correctly with expected MaxComplexity and
  Strengths; embeddinggemma stays filtered out

The Planning sub-case surfaced a separate issue worth flagging:
heuristicQuality floors out at 0.55 for a generic 14B local model
without ThinkingModes, below TaskPlanning's 0.60 threshold. The test
mutates phi-4's capabilities post-registration to reflect reality
(phi-4 is reasoning-tuned). A discovery-side thinking-capability
detection is out of scope for this plan but flagged in the test
comment for follow-up.

Refs: docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md
2026-05-23 21:34:09 +02:00
vikingowl a79e99199d feat(router): non-chat exclude, vision prefixes, family-defaults scaffold
Discovery previously registered every model returned by Ollama as a
chat arm, including embeddings, ASR, TTS, audio realtime, and
rerankers — which then failed at inference time when the router
selected them. Local arms also shipped with all-zero defaults, so
selection between e.g. tiny3.5:1.5b, phi-4:14b, and qwen3-coder:30b
was effectively random.

This change covers tasks R-1, R-2, R-6 from the routing-defaults plan.

- nonChatModelPatterns + isNonChatModel substring matcher; matched
  IDs are skipped during RegisterDiscoveredModels. Covers whisper,
  moonshine, kokoros, vibevoice, -asr, -tts, -audio, -embedding,
  embeddinggemma, -reranker, lfm2.
- knownVisionModelPrefixes gains gemma4, gemma-4, glm-ocr. gemma3
  and minicpm-v entries stay for regression coverage.
- New internal/router/defaults.go with FamilyDefaults struct,
  knownFamilyDefaults map, and ResolveFamilyDefaults longest-prefix
  lookup (with org/-namespace stripping so reecdev/tiny3.5:1.5b
  resolves to "tiny3.5"). Single entry for now: functiongemma is
  registered with Disabled=true and MaxComplexity=0.40, reserved for
  the future ArmRoleToolRouter path. Table will grow in R-3.
- RegisterDiscoveredModels consults ResolveFamilyDefaults and only
  populates fields that are still zero on the arm, so user [[arms]]
  overrides keep priority.

Plans:
- docs/superpowers/plans/2026-05-23-routing-defaults-refresh.md
- docs/superpowers/plans/2026-05-23-tool-router-specialization.md

TODO.md surfaces both as in-flight items.
2026-05-23 21:24:59 +02:00
vikingowl 1606d19366 feat(subprocess/codex): account for cached and reasoning tokens
codex 0.133.0 emits two token-accounting fields at top level that
we previously dropped:

  cached_input_tokens   — subset of input_tokens that hit the prompt
                          cache (cheaper, but still counted in
                          input_tokens per OpenAI Responses API
                          semantics)
  reasoning_output_tokens — separately reported billable thinking
                          tokens on reasoning-capable models

Map cached_input_tokens to message.Usage.CacheReadTokens and subtract
it from InputTokens. message.Usage.Add() sums InputTokens and
CacheReadTokens as peers, so the uncached residual goes in
InputTokens — matches the anthropic provider's convention and keeps
cumulative usage tracking arithmetically correct.

Fold reasoning_output_tokens into OutputTokens for accurate cost
tracking. The top-level peer positioning (vs nested in
output_tokens_details) implies a separately counted billable
quantity, not a subset of output_tokens.

Defensive clamp at zero in case a future codex build reports
cached > input due to schema drift. Includes a verbatim regression
guard against the live 2026-05-22 codex 0.133.0 output to catch
schema changes early.
2026-05-22 13:35:57 +02:00
vikingowl fe24907ce5 docs(readme): refresh post-v0.2.1 with badges and v0.2.x features
- Add for-the-badge style shields (release, license, Go 1.26+, GHCR)
- Drop the "until the first tag is cut" line that's been stale since
  v0.1.0 shipped on 2026-05-20
- Add a Vision / image input section covering Ctrl+V paste, literal
  [Image: /path] markers, the 10 MiB cap, the incognito carve-out,
  and the router's Vision capability gating
- Add a Subprocess sandbox bypass subsection under Providers
  documenting GNOMA_AGY_BYPASS_PERMISSIONS and
  GNOMA_CODEX_BYPASS_SANDBOX as deliberate footguns
- Add an Entropy false-positive reduction subsection under Security
  showing the [security].entropy_safelist opt-in (Phase F-1) and
  noting the per-pattern Debug telemetry that feeds F-2 gating
2026-05-22 13:21:31 +02:00
vikingowl 847ec159d7 chore(deps): promote cloud.google.com/go/auth and atotto/clipboard to direct
go mod tidy (triggered by GoReleaser's before hook) correctly
promoted both modules from indirect to direct: cloud.google.com/go/auth
is imported by internal/provider/google for the ADC credential
walk, and github.com/atotto/clipboard is imported by internal/tui
for image-paste handling. Listing them as direct reflects actual
usage and prevents tooling from suggesting their removal.
2026-05-22 13:06:36 +02:00
vikingowl 9ceddd39c1 chore(todo): track dockers_v2 migration under distribution follow-ups
GoReleaser is phasing out the dockers + docker_manifests pair in
favour of dockers_v2, which collapses our four-block setup into
one. The migration also touches Dockerfile (per-platform binary
layout in the build context), so it's worth scheduling as its own
commit rather than a release-time rush.
2026-05-22 13:06:24 +02:00
vikingowl 3f74b6e362 fix(release): point GHCR image.source at GitHub mirror
GHCR's package page auto-links to a GitHub repo via the
org.opencontainers.image.source label. The previous value
pointed at the Gitea canonical (somegit.dev/Owlibou/gnoma),
which GHCR can't resolve — so the package page just showed a
"Link this package to a repository" prompt and contributors,
Readme, and discussions never auto-populated.

Swap the two URL labels: source now points at the GitHub
mirror, url keeps the Gitea canonical reference. Both arch
build blocks updated.

Takes effect on the next release (v0.2.0 images already shipped
with the old labels and stay as-is).
v0.2.1
2026-05-22 13:03:36 +02:00
vikingowl 49d80cf847 feat(security): format-aware entropy safelist (Phase F-1)
Add a deterministic pre-extractor that skips known-safe token shapes
before they reach the entropy scorer. Targets the false-positive
regime that bites under lowered entropy_threshold or
redact_high_entropy = true — UUIDs (~3.4 bits), SHA hex digests
(~3.9 bits), ISO-8601 timestamps, and HTTP(S) URLs.

Config knob lives under the existing security section to match
entropy_threshold / redact_high_entropy convention:

  [security]
  entropy_safelist = ["uuid", "sha_hex", "iso8601", "url"]

Empty / unset preserves pre-F-1 behaviour exactly — users opt in.

Per-pattern Debug telemetry fires on every skip (pattern name +
token length, never the token bytes). This is the data F-2's
go/no-go gate depends on; the plan literally specifies it.

NewFirewall validates names at the config boundary and emits a
Warn for unknown entries so a typo like "uid" instead of "uuid"
surfaces loudly instead of silently disabling FP reduction.

Tests cover: UUID/SHA-1/SHA-256 skipped at lowered threshold,
mixed payload (safe shape + real secret) preserves the secret,
secret-adjacent-to-UUID regression guard, empty safelist preserves
pre-F-1 behaviour, unknown name silently dropped at scanner level
but warned at firewall level, end-to-end FirewallConfig wiring,
and the skip-telemetry log line.

F-2 remains gated on real-workload FP-rate observations.
v0.2.0
2026-05-22 12:39:10 +02:00
vikingowl ea1a5361e2 chore: restore agy JSON-output TODO; idiomatic t.TempDir() in google test
The worktree commit 12a6b83 dropped the "Native agy JSON output"
backlog item alongside removing the agy agent. Since we restored
agy in this branch, the TODO is relevant again — agy v1.0.0 still
emits plain text and the prompt-augmentation fallback should be
replaced by --output-format stream-json once the CLI supports it.

Switch TestTryLoadOAuthCredentials_Formats to t.TempDir() to drop
the unchecked os.RemoveAll defer that golangci-lint's errcheck
caught after the merge.
2026-05-22 12:17:10 +02:00
vikingowl 246997c4be Merge branch 'feat/agy-sdk-integration' into dev
Brings in the Google auth precedence work (agy > gemini > ADC
credential walk, fileTokenProvider expiry handling, slog-backed
error reporting), the Codex CLI integration as a new subprocess
agent, and the restoration of the agy subprocess agent that was
accidentally removed by the initial codex commit. Sandbox-bypass
flags on both agy and codex are now opt-out via env vars
(GNOMA_AGY_BYPASS_PERMISSIONS, GNOMA_CODEX_BYPASS_SANDBOX).

Includes review-driven fixes:
- ADC fallback now uses real DetectOptions (cloud-platform scope)
- fileTokenProvider returns an error on expired tokens instead
  of shipping a known-dead bearer
- TestNew_Precedence asserts which credential was actually picked
- codex parser tolerates non-JSON banner / debug lines on stdout
- codex usage takes max(input_tokens, prompt_tokens) so accounting
  can't silently undercount

No conflicts expected with the dev image-content feature: the
worktree branch only touches the google and subprocess provider
families.
2026-05-22 12:15:32 +02:00
vikingowl 0975bf7118 docs(readme): list codex and vibe alongside claude/gemini/agy
The subprocess CLI table only mentioned three agents; the full set
now is claude, gemini, agy, codex, and vibe (Mistral). Bring the
documentation in line with knownAgents.
2026-05-22 12:15:01 +02:00
vikingowl afc31b0af4 fix(subprocess): restore agy alongside codex; env-gate sandbox bypass
The original commit on this branch replaced the agy subprocess agent
with codex (overwriting the slot in knownAgents, deleting agy_test.go
and the agyParser). That was unintentional — agy (antigravity) is a
distinct CLI from codex (OpenAI's). Antigravity will replace gemini
when gemini retires on 2026-06-16, so it needs to keep its own slot.

Restored: FormatAgyText constant, agyParser with newAgyParser and
the line-delimited text parser, the agy CLIAgent entry in
knownAgents with PromptResponseFormat:true, agy_test.go, and the
agy case in newParser. Sourced from the parent commit so behavior
matches what shipped before the codex change.

Sandbox bypass: both agy (--dangerously-skip-permissions) and codex
(--dangerously-bypass-approvals-and-sandbox) need a flag to run
non-interactively (their stdin is closed; without it they block on
approval prompts nobody can answer). Both default to ON for
out-of-box behavior; operators with pre-approved trust config can
opt out via GNOMA_AGY_BYPASS_PERMISSIONS=0 or
GNOMA_CODEX_BYPASS_SANDBOX=0. Tests cover the on / opt-out / unknown
value branches.

TestKnownAgents_ValidFormats updated to accept the restored
FormatAgyText.
2026-05-22 12:14:54 +02:00