From 7213a1e2fd76e05b3ff4e08269ba261d9e31470c Mon Sep 17 00:00:00 2001 From: vikingowl Date: Mon, 25 May 2026 02:43:11 +0200 Subject: [PATCH] docs: switch recommended SLM from reecdev/tiny3.5:500m to qwen3:0.6b MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Empirical comparison on 2026-05-25 across three candidate SLMs on identical prompts (two prompts: trivial 'what is 2+2' + knowledge 'explain a multi-armed bandit'): qwen3:0.6b consistent across both prompts functiongemma:270m works trivial, derails on knowledge prompts gemma3:1b unusable (emits just '{' or invented keys) reecdev/tiny3.5:1.5b unusable (ignores /no_think, leaks blocks) qwen2.5-coder:1.5b unusable (ignores classifier prompt, answers in prose) qwen3:0.6b honours Qwen3's native /no_think flag (the distillation in the old default did not), is smaller than the previous recommendation (520 MB vs 1 GB), and was the only candidate to classify both test prompts successfully without falling back to heuristic. README quickstart block + slm-backends.md presets + status output sample all switched. Also documents register_as_arm (default true, set false for task-specialised models like FunctionGemma) and classify_timeout (default 15s) in the example configs since both landed in v0.3.3+. Code defaults for the tiny3.5 family in internal/router/defaults.go are unchanged — that table still applies when users have tiny3.5 registered as a routing arm independent of the SLM role. --- README.md | 9 ++++++--- docs/slm-backends.md | 34 ++++++++++++++++++++++++---------- 2 files changed, 30 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 24077c0..e637a24 100644 --- a/README.md +++ b/README.md @@ -364,9 +364,12 @@ gnoma can run a tiny local model alongside the main provider to: ```toml [slm] -enabled = true -backend = "auto" # ollama | llamacpp | llamafile | openaicompat | auto | disabled -model = "reecdev/tiny3.5:500m" +enabled = true +backend = "auto" # ollama | llamacpp | llamafile | openaicompat | auto | disabled +model = "qwen3:0.6b" +register_as_arm = true # default; set to false to make the SLM classifier-only + # (e.g. for FunctionGemma, code-completion-tuned models) +classify_timeout = "15s" # default; bump higher for slow cold-loads ``` Setup, presets, and verification: [docs/slm-backends.md](docs/slm-backends.md). diff --git a/docs/slm-backends.md b/docs/slm-backends.md index ae901a7..59854a6 100644 --- a/docs/slm-backends.md +++ b/docs/slm-backends.md @@ -24,27 +24,41 @@ The "ollama" path is the easiest if you're already running a local model — it ## Presets -Presets use `reecdev/tiny3.5:500m` as the default model — a 500 M-parameter Qwen3.5 distillation with tool support, available on Ollama. Pull it once with: +Presets use `qwen3:0.6b` as the default model — a 600 M-parameter Qwen3 instruction-tuned model with native `/no_think` support, available on Ollama. Pull it once with: ```bash -ollama pull reecdev/tiny3.5:500m # ~1 GB -# or the 1.5 B variant for slightly better quality: -ollama pull reecdev/tiny3.5:1.5b # ~3 GB +ollama pull qwen3:0.6b # ~520 MB ``` +### Model choice notes + +Empirical testing (2026-05-25) across three candidate SLMs on identical prompts: + +| Model | Classifier success | Notes | +|---|---|---| +| `qwen3:0.6b` | consistent across trivial + knowledge prompts | recommended default; honours `/no_think` cleanly | +| `functiongemma:270m` | works on trivial prompts, derails on knowledge ones | needs function-signature prompt rewrite or LoRA fine-tune to be reliable | +| `gemma3:1b` | unusable | emits malformed JSON (just `{` or invented keys) | +| `reecdev/tiny3.5:1.5b` | unusable | thinking-mode distillation; ignores `/no_think` and emits `` blocks | +| `qwen2.5-coder:1.5b` | unusable | code-completion-tuned; ignores the classifier prompt entirely and answers in prose | + Substitute any small Ollama model you prefer. The probe at startup reads each model's actual capability — `tools` enables the SLM arm to handle simple file reads; without it, the SLM only handles knowledge-only prompts. +If your SLM is task-specialised (function-call models like FunctionGemma; embedding-only models; code-completion-tuned models) and produces wrong-shape output when asked to answer a general prompt, set `register_as_arm = false` so the SLM stays classifier-only and execution routes to other local arms. + ### Preset 1 — Ollama (recommended for most users) ```toml [slm] -enabled = true -backend = "ollama" -model = "reecdev/tiny3.5:500m" +enabled = true +backend = "ollama" +model = "qwen3:0.6b" +register_as_arm = true # default; set false for classifier-only models +classify_timeout = "15s" # default; bump for slow cold-load # base_url defaults to http://localhost:11434 ``` -Prereq: `ollama pull reecdev/tiny3.5:500m` (or any model you'd rather use). +Prereq: `ollama pull qwen3:0.6b` (or any model you'd rather use). ### Preset 2 — llama.cpp server @@ -150,10 +164,10 @@ Output looks like: ``` slm enabled: true slm backend: ollama - model: reecdev/tiny3.5:500m + model: qwen3:0.6b live probe: - ✓ ollama ready (model=reecdev/tiny3.5:500m, boot=0s) + ✓ ollama ready (model=qwen3:0.6b, boot=0s) ``` Run a few prompts, then check: