Documentation

Build a combined machine (Thunderbolt 5)

Pool 2–6 Apple-Silicon Macs over Thunderbolt into a single logical node with their combined unified memory, so Porten can route a model no single box could hold — large MoEs, 1M-context models, the open frontier. To the Hub it's one node: one agent, one tunnel, advertising the aggregate VRAM. No Hub changes — this is purely host setup.

How it fits together

[ Mac 1 ]──TB5──[ Mac 2 ]──TB5──[ Mac 3 ]     exo pools unified memory and
   └────────── exo (MLX distributed) ──────┐   shards ONE model across the boxes,
                                            ▼   exposing ONE OpenAI endpoint.
                          head Mac: porten (engine=openai)
                                            │  PORTEN_VRAM_BUDGET_MB = pooled total
                                            ▼
                                   Porten Hub  ── one node, big VRAM

The cluster does the cross-box tensor/pipeline parallelism internally; tokens stream back through the single tunnel. The Hub's openai engine proxies and health-checks the endpoint, and PORTEN_VRAM_BUDGET_MB overrides detected VRAM — so the fleet treats the whole cluster as one fat node.

Hardware

  • 2–6 Apple-Silicon Macs (Studio/Max preferred — more memory bandwidth). More unified memory per box = fewer boxes for a given model.
  • Thunderbolt 5 between them. Use RDMA over TB5 — it's the difference between ~5 tok/s and ~25 tok/s on a big MoE. TB5 gives ~80 Gb/s, which is what makes sharding across boxes practical.
  • Daisy-chain or hub the Macs over TB5; exo auto-discovers them.

Memory math (≈128 GB/Mac; Studios can have far more):

Macs Pooled memory Unlocks
2 ~256 GB DeepSeek V4-Flash Q4 (~110 GB), 70B dense, mid MoEs
3–4 ~384–512 GB MiniMax M3 (1M context), 235B-class MoE
6 ~768 GB DeepSeek V4-Pro (1.6T) at low quant — the open frontier

V4-Pro is ~400 GB even at Q2 → ~6 high-memory Macs. V4-Flash (284B total / 13B active) is the realistic frontier target for a 2–4 Mac cluster. 1M context is only practical via MiniMax M3 on a 3–4 Mac pool — KV cache for 1M tokens is huge, so size memory for context, not just weights.

Step 1 — pool the Macs with exo

On every Mac (head + workers), install exo and start it. They auto-discover over Thunderbolt and elect a head that serves an OpenAI-compatible API (default :8000). Pull/serve the model you want — exo fetches the weights and shards them across the pool. Verify from the head:

curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"deepseek-v4-flash","messages":[{"role":"user","content":"hi"}]}'

(MLX distributed, llama.cpp rpc-server, or GPUStack are alternatives — any runtime that exposes one OpenAI endpoint across the boxes works.)

Step 2 — run porten on the head Mac

Point the agent's openai engine at exo and tell it the aggregate usable VRAM:

export PORTEN_HUB_URL=wss://porten.ai
export PORTEN_ENGINE=openai
export PORTEN_ENGINE_URL=http://localhost:8000     # exo's endpoint
export PORTEN_VRAM_BUDGET_MB=262144                # pooled total across the Macs
export PORTEN_ENGINE_MODELS=deepseek-v4-flash      # what exo is serving

porten login                                 # approve in your browser (no token)
porten service install                       # run as a service

It registers as one node advertising PORTEN_VRAM_BUDGET_MB, so the fleet can place a model that needs that much memory on it.

Step 3 — make the model available

Two things have to line up for the cluster to serve a frontier model:

  1. It's offered by the Hub. The Hub serves a curated catalog, so a model only routes if it's been enabled there. If you operate the Hub, enable it; if you don't, ask whoever does to offer the model you're serving.
  2. Your node advertises it. The cluster node advertises whatever exo serves, via PORTEN_ENGINE_MODELS — so the canonical id must match exactly what exo serves.

Once both hold, point any client at https://porten.ai/v1 with model set to that id.

Caveats

  • Throughput is modest — a Mac cluster trades speed for memory. Expect tens of tok/s on big MoEs with RDMA, single digits without.
  • Quality — the best open frontier (V4-Pro ~80.6% SWE-bench) is close to but below the closed flagships (~88.7%).
  • Keep the runtime current — frontier MoE architectures need recent exo / MLX / llama.cpp.
  • If you just want a strong coding agent on one machine, you don't need a cluster — a 64–128 GB Mac runs Qwen3-Coder-Next. See the Hardware guide.
📄 Reading as a machine? This page is available as raw Markdown at https://porten.ai/docs/cluster-thunderbolt.md — or grab the whole site via llms.txt / llms-full.txt.