Benchmark Results — Full Rankings
Winner
llama3.1:8b and gemma2:9b tied at 83% (10/12) — both fully fit in VRAM, both reliable. The current team Analyst (llama3.1:8b) is already the best general model we have. Gemma2:9b is an equal alternative worth keeping available.
Full Rankings
| # | Model | Size | Score | Speed | VRAM |
|---|---|---|---|---|---|
| 1 | llama3.1:8b | 4.9 GB | 10/12 (83%) | 38.6 tok/s | fits |
| 1 | gemma2:9b | 5.4 GB | 10/12 (83%) | 23.5 tok/s | fits |
| 3 | qwen2.5:14b | 9.0 GB | 9/12 (75%) | 19.8 tok/s | fits |
| 4 | llama3.2:3b | 2.0 GB | 8/12 (67%) | 45.3 tok/s | fits |
| 4 | qwen2.5-coder:7b | 4.7 GB | 8/12 (67%) | 32.8 tok/s | fits |
| 4 | qwen2.5:32b | 19 GB | 8/12 (67%) | 0.6 tok/s | offloaded |
| 7 | deepseek-r1:7b | 4.7 GB | 4/12 (33%) | 45.1 tok/s | fits |
Note: phi4:latest was present on Hercules but missed due to a model ID mismatch in the benchmark runner (registered as phi4:14b). Will be re-run separately.
Surprises
- DeepSeek-R1 7B bombed (33%) — Its chain-of-thought reasoning style produces verbose internal monologue before answering. The exact-match grader penalised it heavily. It likely answered correctly in spirit but not in the required format. Not suitable for structured output tasks.
- 32B offloaded didn't outperform 8B — Same score as llama3.2:3b despite being 4× the size. The PCIe bottleneck capped it at 0.6 tok/s, and one task timed out entirely. More parameters ≠ better results when the RAM offload path is this slow.
- 3B model at 67% is respectable — llama3.2:3b runs at 45 tok/s and scored as high as most 7-9B models. Good for fast routing tasks.
Team Roster Implications
- Scout: Keep llama3.2:3b. Fast, good enough for routing/summarization.
- Engineer: Keep qwen2.5-coder:7b for code tasks. Consider qwen2.5:14b as an upgrade (75%, still fits in VRAM).
- Analyst: llama3.1:8b confirmed best general model. Could swap to gemma2:9b as alternative.
- Commander (Goal 6): The 32B offload result is a warning — simply going bigger doesn't help if hardware can't serve it. The real upgrade path requires either a bigger GPU or a more targeted model.
Next Step
Goal 6 picks up directly from here: find or fine-tune a model that acts as a capable deputy. The benchmark shows our ceiling on current hardware is around 83% on this task suite with models that fit in VRAM. To meaningfully exceed that, we need either a stronger model in the same size class or a different approach (fine-tuning, specialised models).