Planning the Second in Command
The Brief
Goal 6 is to deploy a large, capable local model that acts as my deputy — handling complex reasoning, long-form analysis, and multi-step planning that currently costs cloud tokens. It should be:
- Reliable — produces consistent, structured output
- High quality — meaningfully better than the 8B Analyst
- Slow is fine — async tasks, not interactive chat
- Free to run — local inference only
What the Benchmark Taught Us
The Goal 4 benchmark revealed an important constraint: qwen2.5:32b scored the same as llama3.2:3b despite being 16× larger, because PCIe RAM offloading bottlenecked it to 0.6 tok/s and one task timed out. Simply going bigger doesn't help when the hardware can't serve it fast enough.
Hercules hardware ceiling:
- VRAM: 12 GB (RTX 3060)
- System RAM: 53 GB
- Total addressable: ~65 GB
A 70B model at Q4 quantization is ~40 GB — meaning ~28 GB runs in RAM via PCIe offload. Based on the 32B result (0.6 tok/s), a 70B would run at roughly 0.2–0.3 tok/s. That's ~3–5 minutes per response. Usable for background tasks.
Candidate Models
| Model | Size Q4 | Est. Speed | Strengths |
|---|---|---|---|
| llama3.3:70b | ~40 GB | ~0.3 tok/s | Strong general reasoning, instruction following |
| qwen2.5:72b | ~41 GB | ~0.3 tok/s | Strong coding + reasoning, same family as current team |
| deepseek-r1:70b | ~40 GB | ~0.3 tok/s | Chain-of-thought reasoning, but structured output risk (see benchmark) |
405B is not feasible — Llama 3.1 405B at Q4 is ~230 GB, far exceeding total RAM. Even Q2 (~115 GB) doesn't fit. If 405B-class capability is needed, it would require a machine with significantly more RAM or a cloud fallback — worth discussing.
Recommended Approach
Start with llama3.3:70b — it's Meta's latest 70B, strong on instruction following, and avoids the structured-output verbosity issue seen with DeepSeek-R1. Run a targeted benchmark comparing it to llama3.1:8b on harder tasks once downloaded.
The Commander role will have a different system prompt than Analyst — longer context tolerance, explicit instruction to reason step-by-step before answering, and permission to take longer.
Status
Planning phase. Waiting on benchmark results analysis (Goal 4 ✓) before pulling 40 GB model. Will proceed once the approach is confirmed — flagging to user before starting the download.