2026-05-05

Benchmark Results — Full Rankings

Winner

llama3.1:8b and gemma2:9b tied at 83% (10/12) — both fully fit in VRAM, both reliable. The current team Analyst (llama3.1:8b) is already the best general model we have. Gemma2:9b is an equal alternative worth keeping available.

Full Rankings

#	Model	Size	Score	Speed	VRAM
1	llama3.1:8b	4.9 GB	10/12 (83%)	38.6 tok/s	fits
1	gemma2:9b	5.4 GB	10/12 (83%)	23.5 tok/s	fits
3	qwen2.5:14b	9.0 GB	9/12 (75%)	19.8 tok/s	fits
4	llama3.2:3b	2.0 GB	8/12 (67%)	45.3 tok/s	fits
4	qwen2.5-coder:7b	4.7 GB	8/12 (67%)	32.8 tok/s	fits
4	qwen2.5:32b	19 GB	8/12 (67%)	0.6 tok/s	offloaded
7	deepseek-r1:7b	4.7 GB	4/12 (33%)	45.1 tok/s	fits

Note: phi4:latest was present on Hercules but missed due to a model ID mismatch in the benchmark runner (registered as phi4:14b). Will be re-run separately.

Surprises

DeepSeek-R1 7B bombed (33%) — Its chain-of-thought reasoning style produces verbose internal monologue before answering. The exact-match grader penalised it heavily. It likely answered correctly in spirit but not in the required format. Not suitable for structured output tasks.
32B offloaded didn't outperform 8B — Same score as llama3.2:3b despite being 4× the size. The PCIe bottleneck capped it at 0.6 tok/s, and one task timed out entirely. More parameters ≠ better results when the RAM offload path is this slow.
3B model at 67% is respectable — llama3.2:3b runs at 45 tok/s and scored as high as most 7-9B models. Good for fast routing tasks.

Team Roster Implications

Scout: Keep llama3.2:3b. Fast, good enough for routing/summarization.
Engineer: Keep qwen2.5-coder:7b for code tasks. Consider qwen2.5:14b as an upgrade (75%, still fits in VRAM).
Analyst: llama3.1:8b confirmed best general model. Could swap to gemma2:9b as alternative.
Commander (Goal 6): The 32B offload result is a warning — simply going bigger doesn't help if hardware can't serve it. The real upgrade path requires either a bigger GPU or a more targeted model.

Next Step

Goal 6 picks up directly from here: find or fine-tune a model that acts as a capable deputy. The benchmark shows our ceiling on current hardware is around 83% on this task suite with models that fit in VRAM. To meaningfully exceed that, we need either a stronger model in the same size class or a different approach (fine-tuning, specialised models).