docs: switch recommended SLM from reecdev/tiny3.5:500m to qwen3:0.6b
Release / release (push) Waiting to run

Empirical comparison on 2026-05-25 across three candidate SLMs on
identical prompts (two prompts: trivial 'what is 2+2' + knowledge
'explain a multi-armed bandit'):

  qwen3:0.6b           consistent across both prompts
  functiongemma:270m   works trivial, derails on knowledge prompts
  gemma3:1b            unusable (emits just '{' or invented keys)
  reecdev/tiny3.5:1.5b unusable (ignores /no_think, leaks <Thought Process> blocks)
  qwen2.5-coder:1.5b   unusable (ignores classifier prompt, answers in prose)

qwen3:0.6b honours Qwen3's native /no_think flag (the distillation in
the old default did not), is smaller than the previous recommendation
(520 MB vs 1 GB), and was the only candidate to classify both test
prompts successfully without falling back to heuristic.

README quickstart block + slm-backends.md presets + status output
sample all switched. Also documents register_as_arm (default true,
set false for task-specialised models like FunctionGemma) and
classify_timeout (default 15s) in the example configs since both
landed in v0.3.3+.

Code defaults for the tiny3.5 family in internal/router/defaults.go
are unchanged — that table still applies when users have tiny3.5
registered as a routing arm independent of the SLM role.
This commit is contained in:
2026-05-25 02:43:11 +02:00
parent fd327107df
commit 7213a1e2fd
2 changed files with 30 additions and 13 deletions
+6 -3
View File
@@ -364,9 +364,12 @@ gnoma can run a tiny local model alongside the main provider to:
```toml
[slm]
enabled = true
backend = "auto" # ollama | llamacpp | llamafile | openaicompat | auto | disabled
model = "reecdev/tiny3.5:500m"
enabled = true
backend = "auto" # ollama | llamacpp | llamafile | openaicompat | auto | disabled
model = "qwen3:0.6b"
register_as_arm = true # default; set to false to make the SLM classifier-only
# (e.g. for FunctionGemma, code-completion-tuned models)
classify_timeout = "15s" # default; bump higher for slow cold-loads
```
Setup, presets, and verification: [docs/slm-backends.md](docs/slm-backends.md).
+24 -10
View File
@@ -24,27 +24,41 @@ The "ollama" path is the easiest if you're already running a local model — it
## Presets
Presets use `reecdev/tiny3.5:500m` as the default model — a 500 M-parameter Qwen3.5 distillation with tool support, available on Ollama. Pull it once with:
Presets use `qwen3:0.6b` as the default model — a 600 M-parameter Qwen3 instruction-tuned model with native `/no_think` support, available on Ollama. Pull it once with:
```bash
ollama pull reecdev/tiny3.5:500m # ~1 GB
# or the 1.5 B variant for slightly better quality:
ollama pull reecdev/tiny3.5:1.5b # ~3 GB
ollama pull qwen3:0.6b # ~520 MB
```
### Model choice notes
Empirical testing (2026-05-25) across three candidate SLMs on identical prompts:
| Model | Classifier success | Notes |
|---|---|---|
| `qwen3:0.6b` | consistent across trivial + knowledge prompts | recommended default; honours `/no_think` cleanly |
| `functiongemma:270m` | works on trivial prompts, derails on knowledge ones | needs function-signature prompt rewrite or LoRA fine-tune to be reliable |
| `gemma3:1b` | unusable | emits malformed JSON (just `{` or invented keys) |
| `reecdev/tiny3.5:1.5b` | unusable | thinking-mode distillation; ignores `/no_think` and emits `<Thought Process>` blocks |
| `qwen2.5-coder:1.5b` | unusable | code-completion-tuned; ignores the classifier prompt entirely and answers in prose |
Substitute any small Ollama model you prefer. The probe at startup reads each model's actual capability — `tools` enables the SLM arm to handle simple file reads; without it, the SLM only handles knowledge-only prompts.
If your SLM is task-specialised (function-call models like FunctionGemma; embedding-only models; code-completion-tuned models) and produces wrong-shape output when asked to answer a general prompt, set `register_as_arm = false` so the SLM stays classifier-only and execution routes to other local arms.
### Preset 1 — Ollama (recommended for most users)
```toml
[slm]
enabled = true
backend = "ollama"
model = "reecdev/tiny3.5:500m"
enabled = true
backend = "ollama"
model = "qwen3:0.6b"
register_as_arm = true # default; set false for classifier-only models
classify_timeout = "15s" # default; bump for slow cold-load
# base_url defaults to http://localhost:11434
```
Prereq: `ollama pull reecdev/tiny3.5:500m` (or any model you'd rather use).
Prereq: `ollama pull qwen3:0.6b` (or any model you'd rather use).
### Preset 2 — llama.cpp server
@@ -150,10 +164,10 @@ Output looks like:
```
slm enabled: true
slm backend: ollama
model: reecdev/tiny3.5:500m
model: qwen3:0.6b
live probe:
✓ ollama ready (model=reecdev/tiny3.5:500m, boot=0s)
✓ ollama ready (model=qwen3:0.6b, boot=0s)
```
Run a few prompts, then check: