# Hardware guide — what to build

# Hardware guide — what to build

The single binding constraint for local inference is **memory** — GPU VRAM, or a Mac's unified memory. This page tells you how much you need for the model you want, and what to actually buy.

## The memory math

A model needs room for its **weights** plus a **KV cache** for the context it's processing.

**Weights** scale with parameter count and quantization (bits per weight):

| Quant | Bits/weight | ≈ GB per 1B params | Quality |
|---|---|---|---|
| FP16 | 16 | ~2.0 | reference, rarely needed locally |
| Q8 | 8 | ~1.1 | near-lossless |
| **Q4_K_M** | ~4.5 | **~0.55** | the sweet spot for local serving |
| Q3 / Q2 | 3 / 2 | ~0.4 / ~0.3 | last resort to fit a bigger model |

So at Q4_K_M: a **7B** ≈ 4–5 GB, a **14B** ≈ 8–9 GB, a **32B** ≈ 18–20 GB, a **70B dense** ≈ ~40 GB, an **80B MoE** (e.g. Qwen3-Coder-Next) ≈ ~50 GB.

> **MoE models** (mixture-of-experts) hold *all* expert weights in memory (so memory tracks **total** params) but only activate a few per token (so speed tracks **active** params). That's why an 80B MoE can be fast yet still needs ~50 GB resident.

**KV cache** grows with context length. Budget a few GB on top at typical context (8–32k); at very long context (100k+) the KV cache can rival the weights, so size memory for the context you actually use — not just the weights.

## GPU tiers — what fits

| Memory | Example hardware | Runs well at Q4 |
|---|---|---|
| 8–12 GB | RTX 3060 / 4060, base Macs | 7–8B instruct, small coding models |
| 16 GB | RTX 4060 Ti 16GB, 4070 Ti Super | 13–14B, 7B with long context |
| **24 GB** | **RTX 3090 / 4090** | **32B coding (e.g. Qwen2.5-Coder-32B)** ✓, 14B comfortably |
| 32 GB | RTX 5090 | 32B with more context, 34B |
| 48 GB | 2× RTX 3090, RTX 6000 Ada, A6000 | 70B dense, 32B with big context |
| 64–96 GB | Apple Silicon Mac (unified) | **Qwen3-Coder-Next (80B MoE ~50 GB)** ✓ |
| 128 GB | Mac Studio / high-mem | 80B MoE with headroom, 70B at higher quant |
| 256 GB+ | 2–6 Mac Thunderbolt cluster | frontier MoEs — see the [cluster guide](/docs/cluster-thunderbolt) |

The RTX 3090 (24 GB, used, cheap) is the best value-per-VRAM for a single-box node. Two of them (48 GB) is the classic home-lab build for 70B.

## Apple Silicon notes

Macs are uniquely good for **large** local models because unified memory is shared between CPU and GPU — a 128 GB Mac can dedicate most of it to a model. Caveats:

- By default macOS reserves a chunk for the system; you can raise the GPU's share with `sudo sysctl iogpu.wired_limit_mb=…`.
- Bandwidth, not just capacity, sets token speed. Studio (Max/Ultra) chips have far more memory bandwidth than the base chips — prefer them for big models.
- A single 64–128 GB Mac is the cheapest way to run an 80B MoE coding model without a cluster.

## Building a flagship-competitive coding agent

The realistic goal for a self-hosted coding agent that gets close to closed flagships:

- **Best single-box pick:** **Qwen3-Coder-Next** (80B MoE, ~50 GB at Q4). Runs on a **64 GB+ Mac**, a 96–128 GB Mac comfortably, or a 2× 24 GB + offload setup. This is the strongest open coding model you can self-host on one machine.
- **Excellent and cheaper:** **Qwen2.5-Coder-32B** (~20 GB) on a single **RTX 3090/4090 or a 32–64 GB Mac**. A great daily-driver coding model.
- **Frontier / 1M context:** beyond a single box — pool Macs over Thunderbolt 5. See [Build a combined machine](/docs/cluster-thunderbolt).

> **Reality check (mid-2026):** the best *open* models (e.g. DeepSeek V4-class MoEs ~80% SWE-bench) are close to but still below the best closed flagships (~88% SWE-bench), and the very largest only run on multi-machine clusters. A single 64–128 GB Mac running Qwen3-Coder-Next gets you a genuinely useful, private coding agent today; a Thunderbolt cluster gets you to the open frontier.

## Throughput expectations

- A 24 GB GPU on a 32B Q4 model: comfortably interactive (tens of tok/s).
- A single Mac on an 80B MoE: usable but slower than a discrete GPU on a smaller model — bandwidth-bound.
- A Thunderbolt cluster on a huge MoE: tens of tok/s *with* RDMA, single digits without. It trades speed for the ability to hold the model at all.

## How this maps to a node

Set `PORTEN_VRAM_BUDGET_MB` on the agent to the memory you want to allow for models, and `PORTEN_AUTO_DOWNLOAD_MB` to the disk you'll allow for cached weights. The Hub packs as many demanded models as fit your budget and unloads idle ones. See [Run a node](/docs/running-a-node).
