Hardware guide — what to build
The single binding constraint for local inference is memory — GPU VRAM, or a Mac's unified memory. This page tells you how much you need for the model you want, and what to actually buy.
The memory math
A model needs room for its weights plus a KV cache for the context it's processing.
Weights scale with parameter count and quantization (bits per weight):
| Quant | Bits/weight | ≈ GB per 1B params | Quality |
|---|---|---|---|
| FP16 | 16 | ~2.0 | reference, rarely needed locally |
| Q8 | 8 | ~1.1 | near-lossless |
| Q4_K_M | ~4.5 | ~0.55 | the sweet spot for local serving |
| Q3 / Q2 | 3 / 2 | ~0.4 / ~0.3 | last resort to fit a bigger model |
So at Q4_K_M: a 7B ≈ 4–5 GB, a 14B ≈ 8–9 GB, a 32B ≈ 18–20 GB, a 70B dense ≈ ~40 GB, an 80B MoE (e.g. Qwen3-Coder-Next) ≈ ~50 GB.
MoE models (mixture-of-experts) hold all expert weights in memory (so memory tracks total params) but only activate a few per token (so speed tracks active params). That's why an 80B MoE can be fast yet still needs ~50 GB resident.
KV cache grows with context length. Budget a few GB on top at typical context (8–32k); at very long context (100k+) the KV cache can rival the weights, so size memory for the context you actually use — not just the weights.
GPU tiers — what fits
| Memory | Example hardware | Runs well at Q4 |
|---|---|---|
| 8–12 GB | RTX 3060 / 4060, base Macs | 7–8B instruct, small coding models |
| 16 GB | RTX 4060 Ti 16GB, 4070 Ti Super | 13–14B, 7B with long context |
| 24 GB | RTX 3090 / 4090 | 32B coding (e.g. Qwen2.5-Coder-32B) ✓, 14B comfortably |
| 32 GB | RTX 5090 | 32B with more context, 34B |
| 48 GB | 2× RTX 3090, RTX 6000 Ada, A6000 | 70B dense, 32B with big context |
| 64–96 GB | Apple Silicon Mac (unified) | Qwen3-Coder-Next (80B MoE ~50 GB) ✓ |
| 128 GB | Mac Studio / high-mem | 80B MoE with headroom, 70B at higher quant |
| 256 GB+ | 2–6 Mac Thunderbolt cluster | frontier MoEs — see the cluster guide |
The RTX 3090 (24 GB, used, cheap) is the best value-per-VRAM for a single-box node. Two of them (48 GB) is the classic home-lab build for 70B.
Apple Silicon notes
Macs are uniquely good for large local models because unified memory is shared between CPU and GPU — a 128 GB Mac can dedicate most of it to a model. Caveats:
- By default macOS reserves a chunk for the system; you can raise the GPU's share with
sudo sysctl iogpu.wired_limit_mb=…. - Bandwidth, not just capacity, sets token speed. Studio (Max/Ultra) chips have far more memory bandwidth than the base chips — prefer them for big models.
- A single 64–128 GB Mac is the cheapest way to run an 80B MoE coding model without a cluster.
Building a flagship-competitive coding agent
The realistic goal for a self-hosted coding agent that gets close to closed flagships:
- Best single-box pick: Qwen3-Coder-Next (80B MoE, ~50 GB at Q4). Runs on a 64 GB+ Mac, a 96–128 GB Mac comfortably, or a 2× 24 GB + offload setup. This is the strongest open coding model you can self-host on one machine.
- Excellent and cheaper: Qwen2.5-Coder-32B (~20 GB) on a single RTX 3090/4090 or a 32–64 GB Mac. A great daily-driver coding model.
- Frontier / 1M context: beyond a single box — pool Macs over Thunderbolt 5. See Build a combined machine.
Reality check (mid-2026): the best open models (e.g. DeepSeek V4-class MoEs ~80% SWE-bench) are close to but still below the best closed flagships (~88% SWE-bench), and the very largest only run on multi-machine clusters. A single 64–128 GB Mac running Qwen3-Coder-Next gets you a genuinely useful, private coding agent today; a Thunderbolt cluster gets you to the open frontier.
Throughput expectations
- A 24 GB GPU on a 32B Q4 model: comfortably interactive (tens of tok/s).
- A single Mac on an 80B MoE: usable but slower than a discrete GPU on a smaller model — bandwidth-bound.
- A Thunderbolt cluster on a huge MoE: tens of tok/s with RDMA, single digits without. It trades speed for the ability to hold the model at all.
How this maps to a node
Set PORTEN_VRAM_BUDGET_MB on the agent to the memory you want to allow for models, and PORTEN_AUTO_DOWNLOAD_MB to the disk you'll allow for cached weights. The Hub packs as many demanded models as fit your budget and unloads idle ones. See Run a node.
https://porten.ai/docs/hardware.md — or grab the
whole site via llms.txt / llms-full.txt.