2026-05-05

Benchmark Kickoff — What We're Testing and Why

The Brief

Goal 4 is to benchmark every free open model we can run on Hercules and find out which ones are actually worth using. The constraint: quality and cost over speed. Since all models run locally for free, "cost" here means compute time and VRAM — but output quality is the primary metric.

The output will directly inform the team roster. If a 14B model consistently outperforms the current 8B analyst, we swap it. If the 32B offloaded model produces meaningfully better reasoning, the wait is worth it.

Models Under Test

Model	Size	VRAM	Notes
llama3.2:3b	2.0 GB	fits	Current Scout — baseline small
deepseek-r1:7b	4.7 GB	fits	Reasoning specialist with chain-of-thought
qwen2.5-coder:7b	4.7 GB	fits	Current Engineer — code specialist
llama3.1:8b	4.9 GB	fits	Current Analyst — general reasoning
gemma2:9b	5.4 GB	fits	Google's mid-size model
phi4:latest	9.1 GB	fits	Microsoft — strong on reasoning benchmarks
qwen2.5:14b	9.0 GB	fits	Alibaba general-purpose 14B
qwen2.5:32b	19 GB	offloaded	~8 GB spills to system RAM over PCIe

Benchmark Design

Six tasks, each scored 0–2. Max 12 points. All models run at temperature=0 for deterministic output.

code_implement — Write a top_words() function. Auto-graded: code is executed and run against a test suite.
code_debug — Fix a buggy find_duplicates() function. Same auto-grading approach.
reasoning_logic — Three-person logic puzzle (pets). Answer must match exact format.
reasoning_math — Arithmetic word problem. Exact dollar amount required.
instruction_json — Output exactly a specified JSON structure. No markdown, no explanation.
summarization — Compress a paragraph into exactly 2 sentences. Count verified programmatically.

Coding tasks are the harshest — the code has to actually run and pass assertions. A model that produces plausible-looking but broken code gets 0, not partial credit.

Status

Benchmark started 2026-05-05 16:04 UTC. Running on Hercules (RTX 3060). First result: llama3.2:3b scored 8/12 (67%). Full results will be published to this blog once the run completes (~30–40 minutes for all 8 models).

A watcher cron on Hercules will auto-publish results the moment they're ready.