Build a combined machine (Thunderbolt 5)
Pool 2–6 Apple-Silicon Macs over Thunderbolt into a single logical node with their combined unified memory, so Porten can route a model no single box could hold — large MoEs, 1M-context models, the open frontier. To the Hub it's one node: one agent, one tunnel, advertising the aggregate VRAM. No Hub changes — this is purely host setup.
How it fits together
[ Mac 1 ]──TB5──[ Mac 2 ]──TB5──[ Mac 3 ] exo pools unified memory and
└────────── exo (MLX distributed) ──────┐ shards ONE model across the boxes,
▼ exposing ONE OpenAI endpoint.
head Mac: porten (engine=openai)
│ PORTEN_VRAM_BUDGET_MB = pooled total
▼
Porten Hub ── one node, big VRAM
The cluster does the cross-box tensor/pipeline parallelism internally; tokens stream back through the single tunnel. The Hub's openai engine proxies and health-checks the endpoint, and PORTEN_VRAM_BUDGET_MB overrides detected VRAM — so the fleet treats the whole cluster as one fat node.
Hardware
- 2–6 Apple-Silicon Macs (Studio/Max preferred — more memory bandwidth). More unified memory per box = fewer boxes for a given model.
- Thunderbolt 5 between them. Use RDMA over TB5 — it's the difference between ~5 tok/s and ~25 tok/s on a big MoE. TB5 gives ~80 Gb/s, which is what makes sharding across boxes practical.
- Daisy-chain or hub the Macs over TB5; exo auto-discovers them.
Memory math (≈128 GB/Mac; Studios can have far more):
| Macs | Pooled memory | Unlocks |
|---|---|---|
| 2 | ~256 GB | DeepSeek V4-Flash Q4 (~110 GB), 70B dense, mid MoEs |
| 3–4 | ~384–512 GB | MiniMax M3 (1M context), 235B-class MoE |
| 6 | ~768 GB | DeepSeek V4-Pro (1.6T) at low quant — the open frontier |
V4-Pro is ~400 GB even at Q2 → ~6 high-memory Macs. V4-Flash (284B total / 13B active) is the realistic frontier target for a 2–4 Mac cluster. 1M context is only practical via MiniMax M3 on a 3–4 Mac pool — KV cache for 1M tokens is huge, so size memory for context, not just weights.
Step 1 — pool the Macs with exo
On every Mac (head + workers), install exo and start it. They auto-discover over Thunderbolt and elect a head that serves an OpenAI-compatible API (default :8000). Pull/serve the model you want — exo fetches the weights and shards them across the pool. Verify from the head:
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"deepseek-v4-flash","messages":[{"role":"user","content":"hi"}]}'
(MLX distributed, llama.cpp rpc-server, or GPUStack are alternatives — any runtime that exposes one OpenAI endpoint across the boxes works.)
Step 2 — run porten on the head Mac
Point the agent's openai engine at exo and tell it the aggregate usable VRAM:
export PORTEN_HUB_URL=wss://porten.ai
export PORTEN_ENGINE=openai
export PORTEN_ENGINE_URL=http://localhost:8000 # exo's endpoint
export PORTEN_VRAM_BUDGET_MB=262144 # pooled total across the Macs
export PORTEN_ENGINE_MODELS=deepseek-v4-flash # what exo is serving
porten login # approve in your browser (no token)
porten service install # run as a service
It registers as one node advertising PORTEN_VRAM_BUDGET_MB, so the fleet can place a model that needs that much memory on it.
Step 3 — make the model available
Two things have to line up for the cluster to serve a frontier model:
- It's offered by the Hub. The Hub serves a curated catalog, so a model only routes if it's been enabled there. If you operate the Hub, enable it; if you don't, ask whoever does to offer the model you're serving.
- Your node advertises it. The cluster node advertises whatever exo serves, via
PORTEN_ENGINE_MODELS— so the canonical id must match exactly what exo serves.
Once both hold, point any client at https://porten.ai/v1 with model set to that id.
Caveats
- Throughput is modest — a Mac cluster trades speed for memory. Expect tens of tok/s on big MoEs with RDMA, single digits without.
- Quality — the best open frontier (V4-Pro ~80.6% SWE-bench) is close to but below the closed flagships (~88.7%).
- Keep the runtime current — frontier MoE architectures need recent exo / MLX / llama.cpp.
- If you just want a strong coding agent on one machine, you don't need a cluster — a 64–128 GB Mac runs Qwen3-Coder-Next. See the Hardware guide.
https://porten.ai/docs/cluster-thunderbolt.md — or grab the
whole site via llms.txt / llms-full.txt.