# Build a combined machine (Thunderbolt 5)

# Build a combined machine (Thunderbolt 5)

Pool **2–6 Apple-Silicon Macs** over Thunderbolt into a **single logical node** with their combined unified memory, so Porten can route a model no single box could hold — large MoEs, 1M-context models, the open frontier. To the Hub it's **one node**: one agent, one tunnel, advertising the *aggregate* VRAM. No Hub changes — this is purely host setup.

## How it fits together

```
[ Mac 1 ]──TB5──[ Mac 2 ]──TB5──[ Mac 3 ]     exo pools unified memory and
   └────────── exo (MLX distributed) ──────┐   shards ONE model across the boxes,
                                            ▼   exposing ONE OpenAI endpoint.
                          head Mac: porten (engine=openai)
                                            │  PORTEN_VRAM_BUDGET_MB = pooled total
                                            ▼
                                   Porten Hub  ── one node, big VRAM
```

The cluster does the cross-box tensor/pipeline parallelism internally; tokens stream back through the single tunnel. The Hub's `openai` engine proxies and health-checks the endpoint, and `PORTEN_VRAM_BUDGET_MB` overrides detected VRAM — so the fleet treats the whole cluster as one fat node.

## Hardware

- **2–6 Apple-Silicon Macs** (Studio/Max preferred — more memory bandwidth). More unified memory per box = fewer boxes for a given model.
- **Thunderbolt 5** between them. Use **RDMA over TB5** — it's the difference between ~5 tok/s and ~25 tok/s on a big MoE. TB5 gives ~80 Gb/s, which is what makes sharding across boxes practical.
- Daisy-chain or hub the Macs over TB5; exo auto-discovers them.

Memory math (≈128 GB/Mac; Studios can have far more):

| Macs | Pooled memory | Unlocks |
|---|---|---|
| 2 | ~256 GB | DeepSeek **V4-Flash** Q4 (~110 GB), 70B dense, mid MoEs |
| 3–4 | ~384–512 GB | **MiniMax M3** (1M context), 235B-class MoE |
| 6 | ~768 GB | DeepSeek **V4-Pro** (1.6T) at low quant — the open frontier |

> V4-Pro is ~400 GB even at Q2 → ~6 high-memory Macs. V4-Flash (284B total / 13B active) is the realistic frontier target for a 2–4 Mac cluster. **1M context** is only practical via MiniMax M3 on a 3–4 Mac pool — KV cache for 1M tokens is huge, so size memory for context, not just weights.

## Step 1 — pool the Macs with exo

On every Mac (head + workers), install [exo](https://github.com/exo-explore/exo) and start it. They auto-discover over Thunderbolt and elect a head that serves an OpenAI-compatible API (default `:8000`). Pull/serve the model you want — exo fetches the weights and shards them across the pool. Verify from the head:

```bash
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"deepseek-v4-flash","messages":[{"role":"user","content":"hi"}]}'
```

(MLX distributed, llama.cpp `rpc-server`, or GPUStack are alternatives — any runtime that exposes one OpenAI endpoint across the boxes works.)

## Step 2 — run porten on the head Mac

Point the agent's `openai` engine at exo and tell it the **aggregate** usable VRAM:

```bash
export PORTEN_HUB_URL=wss://porten.ai
export PORTEN_ENGINE=openai
export PORTEN_ENGINE_URL=http://localhost:8000     # exo's endpoint
export PORTEN_VRAM_BUDGET_MB=262144                # pooled total across the Macs
export PORTEN_ENGINE_MODELS=deepseek-v4-flash      # what exo is serving

porten login                                 # approve in your browser (no token)
porten service install                       # run as a service
```

It registers as **one node** advertising `PORTEN_VRAM_BUDGET_MB`, so the fleet can place a model that needs that much memory on it.

## Step 3 — make the model available

Two things have to line up for the cluster to serve a frontier model:

1. **It's offered by the Hub.** The Hub serves a curated catalog, so a model only routes if it's been enabled there. If you operate the Hub, enable it; if you don't, ask whoever does to offer the model you're serving.
2. **Your node advertises it.** The cluster node advertises whatever exo serves, via `PORTEN_ENGINE_MODELS` — so the canonical id must match exactly what exo serves.

Once both hold, point any client at `https://porten.ai/v1` with `model` set to that id.

## Caveats

- **Throughput is modest** — a Mac cluster trades speed for memory. Expect tens of tok/s on big MoEs with RDMA, single digits without.
- **Quality** — the best open frontier (V4-Pro ~80.6% SWE-bench) is close to but below the closed flagships (~88.7%).
- **Keep the runtime current** — frontier MoE architectures need recent exo / MLX / llama.cpp.
- If you just want a strong coding agent on one machine, you don't need a cluster — a 64–128 GB Mac runs Qwen3-Coder-Next. See the [Hardware guide](/docs/hardware).
