From 7213a1e2fd76e05b3ff4e08269ba261d9e31470c Mon Sep 17 00:00:00 2001
From: vikingowl <christian@nachtigall.dev>
Date: Mon, 25 May 2026 02:43:11 +0200
Subject: [PATCH] docs: switch recommended SLM from reecdev/tiny3.5:500m to
 qwen3:0.6b
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Empirical comparison on 2026-05-25 across three candidate SLMs on
identical prompts (two prompts: trivial 'what is 2+2' + knowledge
'explain a multi-armed bandit'):

  qwen3:0.6b           consistent across both prompts
  functiongemma:270m   works trivial, derails on knowledge prompts
  gemma3:1b            unusable (emits just '{' or invented keys)
  reecdev/tiny3.5:1.5b unusable (ignores /no_think, leaks <Thought Process> blocks)
  qwen2.5-coder:1.5b   unusable (ignores classifier prompt, answers in prose)

qwen3:0.6b honours Qwen3's native /no_think flag (the distillation in
the old default did not), is smaller than the previous recommendation
(520 MB vs 1 GB), and was the only candidate to classify both test
prompts successfully without falling back to heuristic.

README quickstart block + slm-backends.md presets + status output
sample all switched. Also documents register_as_arm (default true,
set false for task-specialised models like FunctionGemma) and
classify_timeout (default 15s) in the example configs since both
landed in v0.3.3+.

Code defaults for the tiny3.5 family in internal/router/defaults.go
are unchanged — that table still applies when users have tiny3.5
registered as a routing arm independent of the SLM role.
---
 README.md            |  9 ++++++---
 docs/slm-backends.md | 34 ++++++++++++++++++++++++----------
 2 files changed, 30 insertions(+), 13 deletions(-)
diff --git a/README.md b/README.md
index 24077c0..e637a24 100644
--- a/README.md
+++ b/README.md
@@ -364,9 +364,12 @@ gnoma can run a tiny local model alongside the main provider to:
 
 ```toml
 [slm]
-enabled = true
-backend = "auto"           # ollama | llamacpp | llamafile | openaicompat | auto | disabled
-model   = "reecdev/tiny3.5:500m"
+enabled         = true
+backend         = "auto"      # ollama | llamacpp | llamafile | openaicompat | auto | disabled
+model           = "qwen3:0.6b"
+register_as_arm = true        # default; set to false to make the SLM classifier-only
+                              # (e.g. for FunctionGemma, code-completion-tuned models)
+classify_timeout = "15s"      # default; bump higher for slow cold-loads
 ```
 
 Setup, presets, and verification: [docs/slm-backends.md](docs/slm-backends.md).
diff --git a/docs/slm-backends.md b/docs/slm-backends.md
index ae901a7..59854a6 100644
--- a/docs/slm-backends.md
+++ b/docs/slm-backends.md
@@ -24,27 +24,41 @@ The "ollama" path is the easiest if you're already running a local model — it
 
 ## Presets
 
-Presets use `reecdev/tiny3.5:500m` as the default model — a 500 M-parameter Qwen3.5 distillation with tool support, available on Ollama. Pull it once with:
+Presets use `qwen3:0.6b` as the default model — a 600 M-parameter Qwen3 instruction-tuned model with native `/no_think` support, available on Ollama. Pull it once with:
 
 ```bash
-ollama pull reecdev/tiny3.5:500m   # ~1 GB
-# or the 1.5 B variant for slightly better quality:
-ollama pull reecdev/tiny3.5:1.5b   # ~3 GB
+ollama pull qwen3:0.6b           # ~520 MB
 ```
 
+### Model choice notes
+
+Empirical testing (2026-05-25) across three candidate SLMs on identical prompts:
+
+| Model | Classifier success | Notes |
+|---|---|---|
+| `qwen3:0.6b` | consistent across trivial + knowledge prompts | recommended default; honours `/no_think` cleanly |
+| `functiongemma:270m` | works on trivial prompts, derails on knowledge ones | needs function-signature prompt rewrite or LoRA fine-tune to be reliable |
+| `gemma3:1b` | unusable | emits malformed JSON (just `{` or invented keys) |
+| `reecdev/tiny3.5:1.5b` | unusable | thinking-mode distillation; ignores `/no_think` and emits `<Thought Process>` blocks |
+| `qwen2.5-coder:1.5b` | unusable | code-completion-tuned; ignores the classifier prompt entirely and answers in prose |
+
 Substitute any small Ollama model you prefer. The probe at startup reads each model's actual capability — `tools` enables the SLM arm to handle simple file reads; without it, the SLM only handles knowledge-only prompts.
 
+If your SLM is task-specialised (function-call models like FunctionGemma; embedding-only models; code-completion-tuned models) and produces wrong-shape output when asked to answer a general prompt, set `register_as_arm = false` so the SLM stays classifier-only and execution routes to other local arms.
+
 ### Preset 1 — Ollama (recommended for most users)
 
 ```toml
 [slm]
-enabled = true
-backend = "ollama"
-model   = "reecdev/tiny3.5:500m"
+enabled         = true
+backend         = "ollama"
+model           = "qwen3:0.6b"
+register_as_arm = true              # default; set false for classifier-only models
+classify_timeout = "15s"            # default; bump for slow cold-load
 # base_url defaults to http://localhost:11434
 ```
 
-Prereq: `ollama pull reecdev/tiny3.5:500m` (or any model you'd rather use).
+Prereq: `ollama pull qwen3:0.6b` (or any model you'd rather use).
 
 ### Preset 2 — llama.cpp server
 
@@ -150,10 +164,10 @@ Output looks like:
 ```
 slm enabled: true
 slm backend: ollama
-  model:   reecdev/tiny3.5:500m
+  model:   qwen3:0.6b
 
 live probe:
-  ✓ ollama ready (model=reecdev/tiny3.5:500m, boot=0s)
+  ✓ ollama ready (model=qwen3:0.6b, boot=0s)
 ```
 
 Run a few prompts, then check: