docs: switch recommended SLM from reecdev/tiny3.5:500m to qwen3:0.6b

Empirical comparison on 2026-05-25 across three candidate SLMs on identical prompts (two prompts: trivial 'what is 2+2' + knowledge 'explain a multi-armed bandit'): qwen3:0.6b consistent across both prompts functiongemma:270m works trivial, derails on knowledge prompts gemma3:1b unusable (emits just '{' or invented keys) reecdev/tiny3.5:1.5b unusable (ignores /no_think, leaks <Thought Process> blocks) qwen2.5-coder:1.5b unusable (ignores classifier prompt, answers in prose) qwen3:0.6b honours Qwen3's native /no_think flag (the distillation in the old default did not), is smaller than the previous recommendation (520 MB vs 1 GB), and was the only candidate to classify both test prompts successfully without falling back to heuristic. README quickstart block + slm-backends.md presets + status output sample all switched. Also documents register_as_arm (default true, set false for task-specialised models like FunctionGemma) and classify_timeout (default 15s) in the example configs since both landed in v0.3.3+. Code defaults for the tiny3.5 family in internal/router/defaults.go are unchanged — that table still applies when users have tiny3.5 registered as a routing arm independent of the SLM role.
2026-05-25 02:43:11 +02:00
parent fd327107df
commit 7213a1e2fd
2 changed files with 30 additions and 13 deletions
@@ -364,9 +364,12 @@ gnoma can run a tiny local model alongside the main provider to:

 ```toml
 [slm]
-enabled = true
-backend = "auto"           # ollama | llamacpp | llamafile | openaicompat | auto | disabled
-model   = "reecdev/tiny3.5:500m"
+enabled         = true
+backend         = "auto"      # ollama | llamacpp | llamafile | openaicompat | auto | disabled
+model           = "qwen3:0.6b"
+register_as_arm = true        # default; set to false to make the SLM classifier-only
+                              # (e.g. for FunctionGemma, code-completion-tuned models)
+classify_timeout = "15s"      # default; bump higher for slow cold-loads
 ```

 Setup, presets, and verification: [docs/slm-backends.md](docs/slm-backends.md).
@@ -24,27 +24,41 @@ The "ollama" path is the easiest if you're already running a local model — it

 ## Presets

-Presets use `reecdev/tiny3.5:500m` as the default model — a 500 M-parameter Qwen3.5 distillation with tool support, available on Ollama. Pull it once with:
+Presets use `qwen3:0.6b` as the default model — a 600 M-parameter Qwen3 instruction-tuned model with native `/no_think` support, available on Ollama. Pull it once with:

 ```bash
-ollama pull reecdev/tiny3.5:500m   # ~1 GB
-# or the 1.5 B variant for slightly better quality:
-ollama pull reecdev/tiny3.5:1.5b   # ~3 GB
+ollama pull qwen3:0.6b           # ~520 MB
 ```

+### Model choice notes
+
+Empirical testing (2026-05-25) across three candidate SLMs on identical prompts:
+
+| Model | Classifier success | Notes |
+|---|---|---|
+| `qwen3:0.6b` | consistent across trivial + knowledge prompts | recommended default; honours `/no_think` cleanly |
+| `functiongemma:270m` | works on trivial prompts, derails on knowledge ones | needs function-signature prompt rewrite or LoRA fine-tune to be reliable |
+| `gemma3:1b` | unusable | emits malformed JSON (just `{` or invented keys) |
+| `reecdev/tiny3.5:1.5b` | unusable | thinking-mode distillation; ignores `/no_think` and emits `<Thought Process>` blocks |
+| `qwen2.5-coder:1.5b` | unusable | code-completion-tuned; ignores the classifier prompt entirely and answers in prose |
+
 Substitute any small Ollama model you prefer. The probe at startup reads each model's actual capability — `tools` enables the SLM arm to handle simple file reads; without it, the SLM only handles knowledge-only prompts.

+If your SLM is task-specialised (function-call models like FunctionGemma; embedding-only models; code-completion-tuned models) and produces wrong-shape output when asked to answer a general prompt, set `register_as_arm = false` so the SLM stays classifier-only and execution routes to other local arms.
+
 ### Preset 1 — Ollama (recommended for most users)

 ```toml
 [slm]
-enabled = true
-backend = "ollama"
-model   = "reecdev/tiny3.5:500m"
+enabled         = true
+backend         = "ollama"
+model           = "qwen3:0.6b"
+register_as_arm = true              # default; set false for classifier-only models
+classify_timeout = "15s"            # default; bump for slow cold-load
 # base_url defaults to http://localhost:11434
 ```

-Prereq: `ollama pull reecdev/tiny3.5:500m` (or any model you'd rather use).
+Prereq: `ollama pull qwen3:0.6b` (or any model you'd rather use).

 ### Preset 2 — llama.cpp server

@@ -150,10 +164,10 @@ Output looks like:
 ```
 slm enabled: true
 slm backend: ollama
-  model:   reecdev/tiny3.5:500m
+  model:   qwen3:0.6b

 live probe:
-  ✓ ollama ready (model=reecdev/tiny3.5:500m, boot=0s)
+  ✓ ollama ready (model=qwen3:0.6b, boot=0s)
 ```

 Run a few prompts, then check: