Documentation

Hardware guide — what to build

The single binding constraint for local inference is memory — GPU VRAM, or a Mac's unified memory. This page tells you how much you need for the model you want, and what to actually buy.

The memory math

A model needs room for its weights plus a KV cache for the context it's processing.

Weights scale with parameter count and quantization (bits per weight):

Quant Bits/weight ≈ GB per 1B params Quality
FP16 16 ~2.0 reference, rarely needed locally
Q8 8 ~1.1 near-lossless
Q4_K_M ~4.5 ~0.55 the sweet spot for local serving
Q3 / Q2 3 / 2 ~0.4 / ~0.3 last resort to fit a bigger model

So at Q4_K_M: a 7B ≈ 4–5 GB, a 14B ≈ 8–9 GB, a 32B ≈ 18–20 GB, a 70B dense ≈ ~40 GB, an 80B MoE (e.g. Qwen3-Coder-Next) ≈ ~50 GB.

MoE models (mixture-of-experts) hold all expert weights in memory (so memory tracks total params) but only activate a few per token (so speed tracks active params). That's why an 80B MoE can be fast yet still needs ~50 GB resident.

KV cache grows with context length. Budget a few GB on top at typical context (8–32k); at very long context (100k+) the KV cache can rival the weights, so size memory for the context you actually use — not just the weights.

GPU tiers — what fits

Memory Example hardware Runs well at Q4
8–12 GB RTX 3060 / 4060, base Macs 7–8B instruct, small coding models
16 GB RTX 4060 Ti 16GB, 4070 Ti Super 13–14B, 7B with long context
24 GB RTX 3090 / 4090 32B coding (e.g. Qwen2.5-Coder-32B) ✓, 14B comfortably
32 GB RTX 5090 32B with more context, 34B
48 GB 2× RTX 3090, RTX 6000 Ada, A6000 70B dense, 32B with big context
64–96 GB Apple Silicon Mac (unified) Qwen3-Coder-Next (80B MoE ~50 GB)
128 GB Mac Studio / high-mem 80B MoE with headroom, 70B at higher quant
256 GB+ 2–6 Mac Thunderbolt cluster frontier MoEs — see the cluster guide

The RTX 3090 (24 GB, used, cheap) is the best value-per-VRAM for a single-box node. Two of them (48 GB) is the classic home-lab build for 70B.

Apple Silicon notes

Macs are uniquely good for large local models because unified memory is shared between CPU and GPU — a 128 GB Mac can dedicate most of it to a model. Caveats:

  • By default macOS reserves a chunk for the system; you can raise the GPU's share with sudo sysctl iogpu.wired_limit_mb=….
  • Bandwidth, not just capacity, sets token speed. Studio (Max/Ultra) chips have far more memory bandwidth than the base chips — prefer them for big models.
  • A single 64–128 GB Mac is the cheapest way to run an 80B MoE coding model without a cluster.

Building a flagship-competitive coding agent

The realistic goal for a self-hosted coding agent that gets close to closed flagships:

  • Best single-box pick: Qwen3-Coder-Next (80B MoE, ~50 GB at Q4). Runs on a 64 GB+ Mac, a 96–128 GB Mac comfortably, or a 2× 24 GB + offload setup. This is the strongest open coding model you can self-host on one machine.
  • Excellent and cheaper: Qwen2.5-Coder-32B (~20 GB) on a single RTX 3090/4090 or a 32–64 GB Mac. A great daily-driver coding model.
  • Frontier / 1M context: beyond a single box — pool Macs over Thunderbolt 5. See Build a combined machine.

Reality check (mid-2026): the best open models (e.g. DeepSeek V4-class MoEs ~80% SWE-bench) are close to but still below the best closed flagships (~88% SWE-bench), and the very largest only run on multi-machine clusters. A single 64–128 GB Mac running Qwen3-Coder-Next gets you a genuinely useful, private coding agent today; a Thunderbolt cluster gets you to the open frontier.

Throughput expectations

  • A 24 GB GPU on a 32B Q4 model: comfortably interactive (tens of tok/s).
  • A single Mac on an 80B MoE: usable but slower than a discrete GPU on a smaller model — bandwidth-bound.
  • A Thunderbolt cluster on a huge MoE: tens of tok/s with RDMA, single digits without. It trades speed for the ability to hold the model at all.

How this maps to a node

Set PORTEN_VRAM_BUDGET_MB on the agent to the memory you want to allow for models, and PORTEN_AUTO_DOWNLOAD_MB to the disk you'll allow for cached weights. The Hub packs as many demanded models as fit your budget and unloads idle ones. See Run a node.

📄 Reading as a machine? This page is available as raw Markdown at https://porten.ai/docs/hardware.md — or grab the whole site via llms.txt / llms-full.txt.