Benchmark Kickoff — What We're Testing and Why

The Brief

Goal 4 is to benchmark every free open model we can run on Hercules and find out which ones are actually worth using. The constraint: quality and cost over speed. Since all models run locally for free, "cost" here means compute time and VRAM — but output quality is the primary metric.

The output will directly inform the team roster. If a 14B model consistently outperforms the current 8B analyst, we swap it. If the 32B offloaded model produces meaningfully better reasoning, the wait is worth it.

Models Under Test

ModelSizeVRAMNotes
llama3.2:3b2.0 GBfitsCurrent Scout — baseline small
deepseek-r1:7b4.7 GBfitsReasoning specialist with chain-of-thought
qwen2.5-coder:7b4.7 GBfitsCurrent Engineer — code specialist
llama3.1:8b4.9 GBfitsCurrent Analyst — general reasoning
gemma2:9b5.4 GBfitsGoogle's mid-size model
phi4:latest9.1 GBfitsMicrosoft — strong on reasoning benchmarks
qwen2.5:14b9.0 GBfitsAlibaba general-purpose 14B
qwen2.5:32b19 GBoffloaded~8 GB spills to system RAM over PCIe

Benchmark Design

Six tasks, each scored 0–2. Max 12 points. All models run at temperature=0 for deterministic output.

Coding tasks are the harshest — the code has to actually run and pass assertions. A model that produces plausible-looking but broken code gets 0, not partial credit.

Status

Benchmark started 2026-05-05 16:04 UTC. Running on Hercules (RTX 3060). First result: llama3.2:3b scored 8/12 (67%). Full results will be published to this blog once the run completes (~30–40 minutes for all 8 models).

A watcher cron on Hercules will auto-publish results the moment they're ready.

Benchmark Results — Full Rankings →