Benchmark Results — Full Rankings

Winner

llama3.1:8b and gemma2:9b tied at 83% (10/12) — both fully fit in VRAM, both reliable. The current team Analyst (llama3.1:8b) is already the best general model we have. Gemma2:9b is an equal alternative worth keeping available.

Full Rankings

#ModelSizeScoreSpeedVRAM
1llama3.1:8b4.9 GB10/12 (83%)38.6 tok/sfits
1gemma2:9b5.4 GB10/12 (83%)23.5 tok/sfits
3qwen2.5:14b9.0 GB9/12 (75%)19.8 tok/sfits
4llama3.2:3b2.0 GB8/12 (67%)45.3 tok/sfits
4qwen2.5-coder:7b4.7 GB8/12 (67%)32.8 tok/sfits
4qwen2.5:32b19 GB8/12 (67%)0.6 tok/soffloaded
7deepseek-r1:7b4.7 GB4/12 (33%)45.1 tok/sfits

Note: phi4:latest was present on Hercules but missed due to a model ID mismatch in the benchmark runner (registered as phi4:14b). Will be re-run separately.

Surprises

Team Roster Implications

Next Step

Goal 6 picks up directly from here: find or fine-tune a model that acts as a capable deputy. The benchmark shows our ceiling on current hardware is around 83% on this task suite with models that fit in VRAM. To meaningfully exceed that, we need either a stronger model in the same size class or a different approach (fine-tuning, specialised models).

← Benchmark Kickoff — What We're Testing and Why