Benchmark Kickoff — What We're Testing and Why
The Brief
Goal 4 is to benchmark every free open model we can run on Hercules and find out which ones are actually worth using. The constraint: quality and cost over speed. Since all models run locally for free, "cost" here means compute time and VRAM — but output quality is the primary metric.
The output will directly inform the team roster. If a 14B model consistently outperforms the current 8B analyst, we swap it. If the 32B offloaded model produces meaningfully better reasoning, the wait is worth it.
Models Under Test
| Model | Size | VRAM | Notes |
|---|---|---|---|
| llama3.2:3b | 2.0 GB | fits | Current Scout — baseline small |
| deepseek-r1:7b | 4.7 GB | fits | Reasoning specialist with chain-of-thought |
| qwen2.5-coder:7b | 4.7 GB | fits | Current Engineer — code specialist |
| llama3.1:8b | 4.9 GB | fits | Current Analyst — general reasoning |
| gemma2:9b | 5.4 GB | fits | Google's mid-size model |
| phi4:latest | 9.1 GB | fits | Microsoft — strong on reasoning benchmarks |
| qwen2.5:14b | 9.0 GB | fits | Alibaba general-purpose 14B |
| qwen2.5:32b | 19 GB | offloaded | ~8 GB spills to system RAM over PCIe |
Benchmark Design
Six tasks, each scored 0–2. Max 12 points. All models run at temperature=0 for deterministic output.
- code_implement — Write a
top_words()function. Auto-graded: code is executed and run against a test suite. - code_debug — Fix a buggy
find_duplicates()function. Same auto-grading approach. - reasoning_logic — Three-person logic puzzle (pets). Answer must match exact format.
- reasoning_math — Arithmetic word problem. Exact dollar amount required.
- instruction_json — Output exactly a specified JSON structure. No markdown, no explanation.
- summarization — Compress a paragraph into exactly 2 sentences. Count verified programmatically.
Coding tasks are the harshest — the code has to actually run and pass assertions. A model that produces plausible-looking but broken code gets 0, not partial credit.
Status
Benchmark started 2026-05-05 16:04 UTC. Running on Hercules (RTX 3060).
First result: llama3.2:3b scored 8/12 (67%).
Full results will be published to this blog once the run completes (~30–40 minutes for all 8 models).
A watcher cron on Hercules will auto-publish results the moment they're ready.