docs: switch recommended SLM from reecdev/tiny3.5:500m to qwen3:0.6b
Release / release (push) Waiting to run
Release / release (push) Waiting to run
Empirical comparison on 2026-05-25 across three candidate SLMs on
identical prompts (two prompts: trivial 'what is 2+2' + knowledge
'explain a multi-armed bandit'):
qwen3:0.6b consistent across both prompts
functiongemma:270m works trivial, derails on knowledge prompts
gemma3:1b unusable (emits just '{' or invented keys)
reecdev/tiny3.5:1.5b unusable (ignores /no_think, leaks <Thought Process> blocks)
qwen2.5-coder:1.5b unusable (ignores classifier prompt, answers in prose)
qwen3:0.6b honours Qwen3's native /no_think flag (the distillation in
the old default did not), is smaller than the previous recommendation
(520 MB vs 1 GB), and was the only candidate to classify both test
prompts successfully without falling back to heuristic.
README quickstart block + slm-backends.md presets + status output
sample all switched. Also documents register_as_arm (default true,
set false for task-specialised models like FunctionGemma) and
classify_timeout (default 15s) in the example configs since both
landed in v0.3.3+.
Code defaults for the tiny3.5 family in internal/router/defaults.go
are unchanged — that table still applies when users have tiny3.5
registered as a routing arm independent of the SLM role.
This commit is contained in:
@@ -364,9 +364,12 @@ gnoma can run a tiny local model alongside the main provider to:
|
||||
|
||||
```toml
|
||||
[slm]
|
||||
enabled = true
|
||||
backend = "auto" # ollama | llamacpp | llamafile | openaicompat | auto | disabled
|
||||
model = "reecdev/tiny3.5:500m"
|
||||
enabled = true
|
||||
backend = "auto" # ollama | llamacpp | llamafile | openaicompat | auto | disabled
|
||||
model = "qwen3:0.6b"
|
||||
register_as_arm = true # default; set to false to make the SLM classifier-only
|
||||
# (e.g. for FunctionGemma, code-completion-tuned models)
|
||||
classify_timeout = "15s" # default; bump higher for slow cold-loads
|
||||
```
|
||||
|
||||
Setup, presets, and verification: [docs/slm-backends.md](docs/slm-backends.md).
|
||||
|
||||
+24
-10
@@ -24,27 +24,41 @@ The "ollama" path is the easiest if you're already running a local model — it
|
||||
|
||||
## Presets
|
||||
|
||||
Presets use `reecdev/tiny3.5:500m` as the default model — a 500 M-parameter Qwen3.5 distillation with tool support, available on Ollama. Pull it once with:
|
||||
Presets use `qwen3:0.6b` as the default model — a 600 M-parameter Qwen3 instruction-tuned model with native `/no_think` support, available on Ollama. Pull it once with:
|
||||
|
||||
```bash
|
||||
ollama pull reecdev/tiny3.5:500m # ~1 GB
|
||||
# or the 1.5 B variant for slightly better quality:
|
||||
ollama pull reecdev/tiny3.5:1.5b # ~3 GB
|
||||
ollama pull qwen3:0.6b # ~520 MB
|
||||
```
|
||||
|
||||
### Model choice notes
|
||||
|
||||
Empirical testing (2026-05-25) across three candidate SLMs on identical prompts:
|
||||
|
||||
| Model | Classifier success | Notes |
|
||||
|---|---|---|
|
||||
| `qwen3:0.6b` | consistent across trivial + knowledge prompts | recommended default; honours `/no_think` cleanly |
|
||||
| `functiongemma:270m` | works on trivial prompts, derails on knowledge ones | needs function-signature prompt rewrite or LoRA fine-tune to be reliable |
|
||||
| `gemma3:1b` | unusable | emits malformed JSON (just `{` or invented keys) |
|
||||
| `reecdev/tiny3.5:1.5b` | unusable | thinking-mode distillation; ignores `/no_think` and emits `<Thought Process>` blocks |
|
||||
| `qwen2.5-coder:1.5b` | unusable | code-completion-tuned; ignores the classifier prompt entirely and answers in prose |
|
||||
|
||||
Substitute any small Ollama model you prefer. The probe at startup reads each model's actual capability — `tools` enables the SLM arm to handle simple file reads; without it, the SLM only handles knowledge-only prompts.
|
||||
|
||||
If your SLM is task-specialised (function-call models like FunctionGemma; embedding-only models; code-completion-tuned models) and produces wrong-shape output when asked to answer a general prompt, set `register_as_arm = false` so the SLM stays classifier-only and execution routes to other local arms.
|
||||
|
||||
### Preset 1 — Ollama (recommended for most users)
|
||||
|
||||
```toml
|
||||
[slm]
|
||||
enabled = true
|
||||
backend = "ollama"
|
||||
model = "reecdev/tiny3.5:500m"
|
||||
enabled = true
|
||||
backend = "ollama"
|
||||
model = "qwen3:0.6b"
|
||||
register_as_arm = true # default; set false for classifier-only models
|
||||
classify_timeout = "15s" # default; bump for slow cold-load
|
||||
# base_url defaults to http://localhost:11434
|
||||
```
|
||||
|
||||
Prereq: `ollama pull reecdev/tiny3.5:500m` (or any model you'd rather use).
|
||||
Prereq: `ollama pull qwen3:0.6b` (or any model you'd rather use).
|
||||
|
||||
### Preset 2 — llama.cpp server
|
||||
|
||||
@@ -150,10 +164,10 @@ Output looks like:
|
||||
```
|
||||
slm enabled: true
|
||||
slm backend: ollama
|
||||
model: reecdev/tiny3.5:500m
|
||||
model: qwen3:0.6b
|
||||
|
||||
live probe:
|
||||
✓ ollama ready (model=reecdev/tiny3.5:500m, boot=0s)
|
||||
✓ ollama ready (model=qwen3:0.6b, boot=0s)
|
||||
```
|
||||
|
||||
Run a few prompts, then check:
|
||||
|
||||
Reference in New Issue
Block a user