# Porten — developer documentation > OpenAI-compatible LLM inference, routed across a fleet of community and self-hosted GPU nodes. Point any OpenAI SDK at one base URL, pick a model, and requests route to a node that can serve it — loading the model on demand if needed. EU-sovereign by design. Source: https://porten.ai/docs --- # Overview # Overview **Porten is OpenAI-compatible LLM inference, routed across a fleet of GPU nodes.** You point any OpenAI SDK at one base URL, pick a model, and your request is routed to a node that can serve it. If no node has that model loaded yet, the fleet loads it on demand and your request completes once it's ready. It is built to be **EU-sovereign**: the Hub and its nodes run on European-owned infrastructure, and API keys can be pinned to a region so traffic never leaves it. ## The shape of it ``` your app ──OpenAI SDK──▶ Porten Hub ──▶ a node that serves the model (base_url + API key) (routing, billing, (your Mac, a 3090 box, on-demand loading) a Thunderbolt cluster…) ``` There are two sides to the marketplace: - **Build** — you're a developer. Get an API key, call the API, ship. See the [Quickstart](/docs/quickstart). - **Earn** — you have a GPU. Enroll a node, serve models, get paid for the tokens it produces. See [Run a node](/docs/running-a-node). ## What makes it different - **One API, many models.** A single endpoint exposes every model the fleet offers — small instruct models, coding models, reasoning models, and frontier MoEs running on multi-machine clusters. - **Models load on demand.** You can select any *offered* model even if nothing is serving it this second. The first request triggers the fleet to load it (the [playground](/build/playground) shows a real download/progress bar; API clients just see a slightly longer first request). Idle models are unloaded automatically to free VRAM. See [Models & on-demand loading](/docs/models). - **Demand-driven capacity.** The Hub measures what's being requested and places models on nodes that fit them — evicting lower-demand models to make room when something is urgently needed. - **Self-hostable end to end.** The Hub is a single Go binary (Postgres + Redis). Nodes are a thin agent. You can run the whole thing yourself. ## Compatibility at a glance | Endpoint | Status | |---|---| | `POST /v1/chat/completions` (streaming + non-streaming) | ✅ | | `POST /v1/embeddings` | ✅ | | `GET /v1/models` | ✅ | Tool/function calling, JSON mode (`response_format`), vision (inline `data:` images), and separated reasoning output (`reasoning_content`) are supported where the underlying model can do them. Errors follow OpenAI's `{"error":{...}}` shape. Full details in the [API reference](/docs/api-reference). ## Next steps - **[Quickstart](/docs/quickstart)** — your first request in a few minutes. - **[Use it from your tools](/docs/integrations)** — OpenCode, Cursor, the OpenAI SDKs, LangChain. - **[Hardware guide](/docs/hardware)** — what to build to run good models locally. - **[Build a combined machine](/docs/cluster-thunderbolt)** — pool Macs over Thunderbolt 5 for frontier-size models. > **For machines:** every page here is available as raw Markdown — append `.md` to any docs URL (e.g. `/docs/overview.md`). There's also an [`/llms.txt`](/llms.txt) index and a single-file [`/llms-full.txt`](/llms-full.txt). --- # Quickstart # Quickstart You'll get an API key, list the models on offer, and stream your first chat completion. ## 1. Get an API key Sign in to the portal, open **API keys**, and create one. It looks like `sk-porten-…`. Treat it like a password. - Portal: [`/build/keys`](/build/keys) - Keys can be restricted to a region (e.g. EU-only) — see [Regions & data sovereignty](/docs/regions). Set it in your shell so the examples below work as-is: ```bash export PORTEN_API_KEY="sk-porten-…" export PORTEN_BASE_URL="https://porten.ai/v1" ``` ## 2. List the models on offer ```bash curl "$PORTEN_BASE_URL/models" \ -H "Authorization: Bearer $PORTEN_API_KEY" ``` Every **offered** model is listed, whether or not it's loaded this instant. Each carries an `x_porten` block telling you whether a node is serving it right now (`ready`) or whether it will load on first use. ## 3. Stream a chat completion ```bash curl "$PORTEN_BASE_URL/chat/completions" \ -H "Authorization: Bearer $PORTEN_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5-coder-32b", "messages": [ {"role": "system", "content": "You are a concise coding assistant."}, {"role": "user", "content": "Write a Go function that reverses a string rune-safely."} ], "stream": true }' ``` The response is OpenAI-style Server-Sent Events (`data: {…}` chunks ending in `data: [DONE]`). > **First call to a cold model takes longer.** If you pick a model that isn't loaded yet, the fleet loads it on demand and your request blocks until it's ready (a big model's first load can take a few minutes while weights download). Subsequent calls are fast. There's nothing special to handle — the request just takes longer. See [Models & on-demand loading](/docs/models). ## 4. Use it from an SDK Any OpenAI SDK works — just override the base URL. Python: ```python from openai import OpenAI client = OpenAI( api_key="sk-porten-…", base_url="https://porten.ai/v1", ) stream = client.chat.completions.create( model="qwen2.5-coder-32b", messages=[{"role": "user", "content": "Hello!"}], stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content or "", end="") ``` JavaScript / TypeScript: ```ts import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.PORTEN_API_KEY, baseURL: "https://porten.ai/v1", }); const stream = await client.chat.completions.create({ model: "qwen2.5-coder-32b", messages: [{ role: "user", content: "Hello!" }], stream: true, }); for await (const chunk of stream) { process.stdout.write(chunk.choices[0]?.delta?.content ?? ""); } ``` ## 5. Try it without writing code The [playground](/build/playground) lets you chat with any model in the browser, watch a cold model's load progress as a real progress bar, and copy the request back out as a `curl` command. ## Next - **[API reference](/docs/api-reference)** — every endpoint, parameter, and error code. - **[Use it from your tools](/docs/integrations)** — wire it into OpenCode, Cursor, LangChain. --- # Use it from your tools # Use it from your tools Anything that speaks the OpenAI API works with Porten. The recipe is always the same: set the **base URL** to `https://porten.ai/v1` and the **API key** to your `sk-porten-…` key. ## OpenCode [OpenCode](https://opencode.ai) is a terminal coding agent. Add Porten as a provider in `~/.config/opencode/opencode.jsonc`: ```jsonc { "provider": { "porten": { "npm": "@ai-sdk/openai-compatible", "name": "Porten", "options": { "baseURL": "https://porten.ai/v1", "apiKey": "{env:PORTEN_API_KEY}" }, "models": { "qwen3-coder-next": { "name": "Qwen3-Coder-Next" }, "qwen2.5-coder-32b": { "name": "Qwen2.5 Coder 32B" } } } } } ``` Then select a `porten/…` model in OpenCode. Because models load on demand, the first turn against a cold model takes longer while the fleet provisions it — the request simply waits rather than failing. ## Cursor In **Settings → Models**, enable *Override OpenAI Base URL*, set it to `https://porten.ai/v1`, and paste your key. Add the model id (e.g. `qwen2.5-coder-32b`) under custom models. Cursor will route chat through Porten. ## Continue (VS Code / JetBrains) In `~/.continue/config.json`: ```json { "models": [ { "title": "Porten — Qwen Coder", "provider": "openai", "model": "qwen2.5-coder-32b", "apiBase": "https://porten.ai/v1", "apiKey": "sk-porten-…" } ] } ``` ## LangChain (Python) ```python from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="qwen2.5-coder-32b", base_url="https://porten.ai/v1", api_key="sk-porten-…", ) print(llm.invoke("Summarize the CAP theorem in two sentences.").content) ``` ## LlamaIndex (Python) ```python from llama_index.llms.openai_like import OpenAILike llm = OpenAILike( model="qwen2.5-coder-32b", api_base="https://porten.ai/v1", api_key="sk-porten-…", is_chat_model=True, ) ``` ## Anything else If your tool has an "OpenAI-compatible" or "custom OpenAI endpoint" option, use it. Set: - **Base URL / API base:** `https://porten.ai/v1` - **API key:** `sk-porten-…` - **Model:** any id from [`GET /v1/models`](/docs/api-reference#get-v1models) > **Tip:** if a tool sends a model id the catalog doesn't recognize, you'll get a `404 model_not_found`. List the models first and copy the exact canonical id. --- # API reference # API reference Base URL: `https://porten.ai/v1`. Authenticate with `Authorization: Bearer sk-porten-…`. The surface is OpenAI-compatible, so any OpenAI SDK works by overriding `base_url`. ## POST /v1/chat/completions The core endpoint. Streaming and non-streaming. **Request:** ```json { "model": "qwen2.5-coder-32b", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Write a haiku about the aurora." } ], "temperature": 0.7, "max_tokens": 256, "stream": false } ``` **Response (non-streaming):** ```json { "id": "chatcmpl-porten-7f3a2b", "object": "chat.completion", "model": "qwen2.5-coder-32b", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Green fire dances…" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 32, "completion_tokens": 41, "total_tokens": 73 } } ``` **Response (streaming, `stream: true`)** — OpenAI-style SSE: ``` data: {"id":"…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Green"},"finish_reason":null}]} data: {"id":"…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: [DONE] ``` ### Parameters | Param | Status | Behaviour | |---|---|---| | `model` | ✅ | Canonical id; unknown → `404 model_not_found` | | `messages` | ✅ | `system` / `user` / `assistant` / `tool` roles | | `stream` | ✅ | SSE chunks | | `stream_options.include_usage` | ✅ | Hub fills `usage` from its own count | | `max_tokens` / `max_completion_tokens` | ✅ | Both accepted | | `temperature`, `top_p`, `stop`, `seed` | ⚠️ | Passed to the engine; `seed` honoured only if it supports it | | `presence_penalty`, `frequency_penalty` | ⚠️ | Passed through; ignored if the engine lacks them | | `response_format` | ✅ | JSON mode / JSON schema, forwarded to the engine (best-effort per engine) | | `tools`, `tool_choice` | ✅ | Routed to a model whose catalog entry declares `tools` (the same capability `/v1/models` shows); response carries `tool_calls` + `finish_reason: "tool_calls"`. A model without tool support → `400` (clear message), not a capacity error | | `content` with `image_url` | ✅ | Inline `data:` images forwarded to vision models. Remote `http(s)` image URLs are **not** fetched (SSRF protection) — inline them as data URLs | | `n` | ⚠️ | Only `n=1`; `n>1` → `400 unsupported_parameter` | | `user`, `metadata` | ✅ | Logged for usage/abuse | **Principle:** unknown *convenience* params are ignored silently (forward-compatible); params that would change semantics but can't be honoured (`n>1`) are rejected with `400` rather than silently producing the wrong result. ### Reasoning models Models that "think" (e.g. DeepSeek-R1 family) return their reasoning separately as `reasoning_content` (a delta field in streaming), kept distinct from the answer text — so you can show or hide the chain of thought. **Budget enough tokens.** Reasoning is generated *before* the answer and counts against `max_tokens`. With a small cap the model can spend the whole budget thinking and you get `finish_reason: "length"` with empty `content`. For reasoning models, set `max_tokens` to at least **4096** (more for hard prompts). ## POST /v1/embeddings ```json { "model": "nomic-embed-text", "input": ["text to embed", "and another"] } ``` ```json { "object": "list", "data": [ { "object": "embedding", "index": 0, "embedding": [0.0123, -0.045] }, { "object": "embedding", "index": 1, "embedding": [0.0210, -0.011] } ], "model": "nomic-embed-text", "usage": { "prompt_tokens": 12, "total_tokens": 12 } } ``` ## GET /v1/models Every offered model, aggregated and deduplicated across the fleet. ```json { "object": "list", "data": [ { "id": "qwen2.5-coder-32b", "object": "model", "owned_by": "porten", "x_porten": { "ready": true, "type": "chat", "ctx": 32768 } }, { "id": "qwen3-coder-next", "object": "model", "owned_by": "porten", "x_porten": { "ready": false, "type": "chat", "ctx": 262144 } } ] } ``` `ready: false` means the model is offered but not loaded this instant — your first request will trigger an on-demand load. See [Models & on-demand loading](/docs/models). ## Headers | Header | Direction | Note | |---|---|---| | `Authorization: Bearer sk-porten-…` | in | Required | | `X-Request-Id` | out | Correlation id, echoed in logs | | `X-Porten-Node` | out | Which node served the request | | `Retry-After` | out | On `429` / `503` | ## Errors All errors follow OpenAI's format: `{"error":{"message","type","code","param"}}`. | HTTP | `code` | Meaning | |---|---|---| | 401 | `invalid_api_key` | Invalid or revoked key | | 403 | `model_not_allowed` | The key may not use this model (region/policy) | | 404 | `model_not_found` | No node advertises this model and it isn't offered | | 429 | `rate_limit_exceeded` | Quota/rate exhausted (`Retry-After`) | | 502 | `node_error` | All candidate nodes failed | | 503 | `no_available_node` / `model_warming` | Model exists but no free/healthy node, or it's still loading (`Retry-After`) | | 504 | `gateway_timeout` | Total timeout exceeded | > A `503 model_warming` is expected the first time you hit a cold model and the load takes longer than the request's warm-up budget. Retry — it'll be ready shortly. Most clients won't see it because the request blocks until the model is ready. --- # Models & on-demand loading # Models & on-demand loading ## The catalog is curated Porten doesn't expose every model under the sun — it offers an **opinionated, curated catalog**. An operator enables the models worth serving (good quality-per-VRAM, current architectures) and disables the rest, so hardware isn't wasted on models that a newer one strictly beats. [`GET /v1/models`](/docs/api-reference#get-v1models) returns exactly what's offered. Each model has a canonical id (e.g. `qwen2.5-coder-32b`), a context length, capability flags (`chat`, `tools`, `json`, vision), and a VRAM footprint. ## Loading on demand **You can request any offered model, loaded or not.** The catalog decouples "offered" from "currently in memory": 1. You send a request for a model that no node has loaded. 2. The Hub recognizes it's offered and tells a node that can fit it to load it — for a built-in node, that means downloading the weights (once) and starting the engine. 3. Your request **blocks** until the model is serving, then completes normally. For an interactive client like the [playground](/build/playground), this shows as a real progress bar — byte-level download percentage, then "loading into memory", then the answer streams. For an API client (OpenCode, a script), there's nothing to handle: **the request just takes longer** on the first cold call. A first load of a large model can take several minutes while weights download; after that it's warm and fast. If a load takes longer than the request's warm-up budget, you get a `503` with code `model_warming` — retry and it'll be ready. ### Checking warm-up progress The portal polls a public endpoint for live load progress: ``` GET /public/v1/models/warming?model=qwen3-coder-next → { "model": "…", "ready": false, "phase": "downloading", "done_mb": 22272, "total_mb": 49131, "pct": 45 } ``` `phase` moves `downloading → loading → ready`. ## Idle eviction VRAM is finite, so the fleet **unloads models you're not using**. A provisioned model that goes a full demand window (default ~10 minutes) with no requests is unloaded to free room — its weights stay on disk, so it reloads quickly next time. ## Demand-driven placement and eviction The Hub continuously measures demand per model (served and unmet requests). When a model is requested and can't fit alongside what's already loaded, the planner **evicts a lower-demand model immediately** to make room — a just-requested model doesn't have to wait out an anti-thrash delay. This is what lets you "just ask for" a big model and have the fleet rearrange itself to serve it. ## Picking a model - **Coding agent:** `qwen3-coder-next` (frontier-ish open coding model; needs a large-VRAM node or a cluster) or `qwen2.5-coder-32b` (excellent, fits a single 24–48 GB GPU or a 64 GB Mac). - **General chat:** a current 7–14B instruct model is fast and cheap. - **Reasoning:** a DeepSeek-R1-distill class model when you want visible chain-of-thought. - **Frontier / huge context:** see [Build a combined machine](/docs/cluster-thunderbolt) — some models only fit on a multi-machine cluster. What actually fits depends on your hardware — see the [Hardware guide](/docs/hardware) for the VRAM-per-model table. --- # Regions & data sovereignty # Regions & data sovereignty Porten is built to be **EU-sovereign**: the Hub and the nodes it routes to run on European-owned infrastructure, not the EU regions of US-owned clouds. If data residency matters to you, this is the point of the project. ## Region-pinned API keys An API key can be restricted to one or more regions. A request made with that key only routes to nodes in those regions; if none can serve the model there, you get a clear error rather than a silent fall-back to another region. - Set a key's regions in the portal under [API keys](/build/keys). - A request that can't be satisfied within the key's allowed regions returns `403 model_not_allowed` or `404 model_not_found` (no eligible node), never a cross-region route. ## Node trust classes Nodes carry an operator-assigned **trust class** (e.g. community, verified, trusted, confidential). A key's policy can require a minimum class, so sensitive workloads only land on vetted nodes. Region and class are operator-authoritative — a node can't self-declare them — so routing guarantees like "EU-only, trusted-only" hold. ## What's logged Usage is metered per request (model, token counts, timestamp) for billing and abuse prevention. Prompt and completion **content** is not part of the durable usage record. See the portal's account and usage views for what's retained. > If you're evaluating Porten for a regulated workload, the combination of region-pinned keys + minimum node class is the mechanism that keeps inference on infrastructure you trust. --- # Sovereign inference — your hardware, your region # Sovereign inference — your hardware, your region For regulated and public-sector workloads you can guarantee that inference runs **only on machines you own** and **only in the region you choose**. The policy is enforced at routing time, not merely promised: if no compliant node can serve, the request **fails** — it never falls back to hardware or a region you didn't allow. This builds on [Regions & data sovereignty](/docs/regions) and [Run a node](/docs/running-a-node). ## What you get - **Own hardware only** — requests route exclusively to nodes you operate. - **Your jurisdiction** — pinned to the EU (or a region you set). - **Fail-closed** — no compliant node → a clean `503`, never a silent leak. - **Managed for you** — on-demand model loading, scheduling, routing and uptime. Your team ships against one OpenAI-compatible API without operating ML infra. ## 1. Bring your hardware online On each of your machines (run them in the EU for an EU-sovereign setup): ```bash curl -fsSL https://porten.ai/install.sh | sh ``` It installs the node, you approve it in your account from the browser, and it registers as **yours**. See [Run a node](/docs/running-a-node). A node carries the region it's onboarded in — make sure yours are tagged for your jurisdiction. ## 2. Lock the key down In **Build → API keys**, on the key your application uses: - **Fleet → "My nodes only"** — routing is restricted to nodes you own. - **Region → EU** (the region picker) — routing is restricted to EU nodes. Both constraints are stored on the key and applied **together** at routing time. Operators can also set them per key via the admin API. ## 3. Point your app at the API ```bash curl https://porten.ai/v1/chat/completions \ -H "Authorization: Bearer $PORTEN_API_KEY" \ -d '{"model":"…","messages":[{"role":"user","content":"…"}]}' ``` Every request from this key now runs only on your hardware, in your region. ## 4. Verify it's enforced (fail-closed) Stop your nodes (or take them off the network) and send a request. You get: ```json 503 {"error":{"code":"no_available_node","type":"service_unavailable", ... }} ``` — rather than the request routing somewhere else. That's the guarantee: **no compliant node, no inference.** It can't silently spill to a machine or region you didn't allow. ## Total control: self-host the platform For an air-gapped or fully in-house deployment, run the **entire** platform — the Hub included — on your own infrastructure. It's a single Go binary plus Postgres and Redis (a Docker Compose stack), so nothing, including the control plane, leaves your network. A self-host bundle is available on request. ## What about the control plane? With the **hosted** platform, a request transits the Hub (European-owned) on its way to your node and back — inference still happens only on your hardware, but the prompt passes through the Hub in transit. With a **self-hosted** Hub it never leaves your network at all. Hardware-enforced privacy on *shared* infrastructure (confidential compute via TEE + remote attestation) is on the roadmap. --- # Run a node & earn # Run a node & earn Have a GPU — a gaming card, a Mac with lots of unified memory, or a cluster? Serve models on Porten and get paid for the tokens you produce. A node is a small agent (`porten`) that runs an inference engine and streams results back to the Hub. ## Install in one line On any Mac or Linux box: ```bash curl -fsSL https://porten.ai/install.sh | sh ``` It downloads the agent, picks an engine (uses [Ollama](https://ollama.com) if it's running, otherwise the built-in engine), enrolls the machine **through your browser** — you approve it in the portal, no token to paste — and installs the background service. On a Mac it installs the [desktop app](/download) instead. That's it: the machine appears under **Earn** and starts serving. ## How enrollment works (browser login) `porten login` prints a short code and a URL. Open it on any device, sign in, and approve the machine — its key is bound to your account at that moment. The agent then keeps a persistent connection to the Hub and registers what it can serve; the Hub routes matching requests to it, and you're paid per token (see Payouts in the portal). No screen on the box? Fine — the link prints in the terminal and you approve it from your phone or laptop. ## Engines `porten` supports several back-ends — pick the one that matches your setup: | Engine | Use when | Notes | |---|---|---| | `builtin` | Single box, you want the Hub to manage models | Runs `llama-server`; **auto-downloads** GGUF weights from the catalog's `download_ref` and loads/unloads on demand. This is what powers on-demand loading. | | `ollama` | You already run [Ollama](https://ollama.com) | The agent advertises an allow-list of your Ollama models. | | `openai` | You front a runtime that exposes an OpenAI endpoint | e.g. an [exo](/docs/cluster-thunderbolt) Mac cluster, vLLM, or llama.cpp server. | ## Configuration The installer picks sensible defaults. To tune them, set env vars **before** `porten service install` (they're baked into the service unit): ```bash export PORTEN_ENGINE=builtin # builtin | ollama | openai export PORTEN_VRAM_BUDGET_MB=24000 # how much VRAM the agent may use export PORTEN_AUTO_DOWNLOAD_MB=200000 # disk budget for cached weights ``` For an Ollama node, set `PORTEN_ENGINE=ollama` and an allow-list (`PORTEN_ENGINE_MODELS=llama3.2:3b,deepseek-r1:latest`). For a cluster, see [Build a combined machine](/docs/cluster-thunderbolt). The agent runs as a service — `launchd` on macOS, a `systemd --user` unit on Linux (run `loginctl enable-linger ` so it survives without a login session). ## Automated & bulk setup For scripted provisioning (cloud-init, fleets of identical machines) you can skip the browser: mint an enrollment token in the portal (**Earn → Add node**, [`/earn/add`](/earn/add)) and bake it in. ```bash porten enroll porten service install ``` ## What "provisionable" means A built-in node is **provisionable**: the Hub may fetch and run any enabled catalog model that fits its VRAM budget, on demand. It packs as many *demanded* models as fit, unloads idle ones, and downloads new weights when something is requested that isn't on disk yet. You set the VRAM and disk budgets; the Hub handles placement. Ollama / OpenAI-engine nodes serve a fixed set you've installed — they aren't auto-provisioned, and their models aren't idle-evicted by the Hub. ## Getting paid Connect a payout account in the portal. Serving is metered per token; payouts run on a schedule against your served volume. The portal's **Earn** section shows your nodes, what they're serving, and your earnings. ## What to build See the [Hardware guide](/docs/hardware) for which models fit which hardware, and the [Thunderbolt cluster guide](/docs/cluster-thunderbolt) for pooling multiple Macs into one big node. --- # Hardware guide — what to build # Hardware guide — what to build The single binding constraint for local inference is **memory** — GPU VRAM, or a Mac's unified memory. This page tells you how much you need for the model you want, and what to actually buy. ## The memory math A model needs room for its **weights** plus a **KV cache** for the context it's processing. **Weights** scale with parameter count and quantization (bits per weight): | Quant | Bits/weight | ≈ GB per 1B params | Quality | |---|---|---|---| | FP16 | 16 | ~2.0 | reference, rarely needed locally | | Q8 | 8 | ~1.1 | near-lossless | | **Q4_K_M** | ~4.5 | **~0.55** | the sweet spot for local serving | | Q3 / Q2 | 3 / 2 | ~0.4 / ~0.3 | last resort to fit a bigger model | So at Q4_K_M: a **7B** ≈ 4–5 GB, a **14B** ≈ 8–9 GB, a **32B** ≈ 18–20 GB, a **70B dense** ≈ ~40 GB, an **80B MoE** (e.g. Qwen3-Coder-Next) ≈ ~50 GB. > **MoE models** (mixture-of-experts) hold *all* expert weights in memory (so memory tracks **total** params) but only activate a few per token (so speed tracks **active** params). That's why an 80B MoE can be fast yet still needs ~50 GB resident. **KV cache** grows with context length. Budget a few GB on top at typical context (8–32k); at very long context (100k+) the KV cache can rival the weights, so size memory for the context you actually use — not just the weights. ## GPU tiers — what fits | Memory | Example hardware | Runs well at Q4 | |---|---|---| | 8–12 GB | RTX 3060 / 4060, base Macs | 7–8B instruct, small coding models | | 16 GB | RTX 4060 Ti 16GB, 4070 Ti Super | 13–14B, 7B with long context | | **24 GB** | **RTX 3090 / 4090** | **32B coding (e.g. Qwen2.5-Coder-32B)** ✓, 14B comfortably | | 32 GB | RTX 5090 | 32B with more context, 34B | | 48 GB | 2× RTX 3090, RTX 6000 Ada, A6000 | 70B dense, 32B with big context | | 64–96 GB | Apple Silicon Mac (unified) | **Qwen3-Coder-Next (80B MoE ~50 GB)** ✓ | | 128 GB | Mac Studio / high-mem | 80B MoE with headroom, 70B at higher quant | | 256 GB+ | 2–6 Mac Thunderbolt cluster | frontier MoEs — see the [cluster guide](/docs/cluster-thunderbolt) | The RTX 3090 (24 GB, used, cheap) is the best value-per-VRAM for a single-box node. Two of them (48 GB) is the classic home-lab build for 70B. ## Apple Silicon notes Macs are uniquely good for **large** local models because unified memory is shared between CPU and GPU — a 128 GB Mac can dedicate most of it to a model. Caveats: - By default macOS reserves a chunk for the system; you can raise the GPU's share with `sudo sysctl iogpu.wired_limit_mb=…`. - Bandwidth, not just capacity, sets token speed. Studio (Max/Ultra) chips have far more memory bandwidth than the base chips — prefer them for big models. - A single 64–128 GB Mac is the cheapest way to run an 80B MoE coding model without a cluster. ## Building a flagship-competitive coding agent The realistic goal for a self-hosted coding agent that gets close to closed flagships: - **Best single-box pick:** **Qwen3-Coder-Next** (80B MoE, ~50 GB at Q4). Runs on a **64 GB+ Mac**, a 96–128 GB Mac comfortably, or a 2× 24 GB + offload setup. This is the strongest open coding model you can self-host on one machine. - **Excellent and cheaper:** **Qwen2.5-Coder-32B** (~20 GB) on a single **RTX 3090/4090 or a 32–64 GB Mac**. A great daily-driver coding model. - **Frontier / 1M context:** beyond a single box — pool Macs over Thunderbolt 5. See [Build a combined machine](/docs/cluster-thunderbolt). > **Reality check (mid-2026):** the best *open* models (e.g. DeepSeek V4-class MoEs ~80% SWE-bench) are close to but still below the best closed flagships (~88% SWE-bench), and the very largest only run on multi-machine clusters. A single 64–128 GB Mac running Qwen3-Coder-Next gets you a genuinely useful, private coding agent today; a Thunderbolt cluster gets you to the open frontier. ## Throughput expectations - A 24 GB GPU on a 32B Q4 model: comfortably interactive (tens of tok/s). - A single Mac on an 80B MoE: usable but slower than a discrete GPU on a smaller model — bandwidth-bound. - A Thunderbolt cluster on a huge MoE: tens of tok/s *with* RDMA, single digits without. It trades speed for the ability to hold the model at all. ## How this maps to a node Set `PORTEN_VRAM_BUDGET_MB` on the agent to the memory you want to allow for models, and `PORTEN_AUTO_DOWNLOAD_MB` to the disk you'll allow for cached weights. The Hub packs as many demanded models as fit your budget and unloads idle ones. See [Run a node](/docs/running-a-node). --- # Build a combined machine (Thunderbolt 5) # Build a combined machine (Thunderbolt 5) Pool **2–6 Apple-Silicon Macs** over Thunderbolt into a **single logical node** with their combined unified memory, so Porten can route a model no single box could hold — large MoEs, 1M-context models, the open frontier. To the Hub it's **one node**: one agent, one tunnel, advertising the *aggregate* VRAM. No Hub changes — this is purely host setup. ## How it fits together ``` [ Mac 1 ]──TB5──[ Mac 2 ]──TB5──[ Mac 3 ] exo pools unified memory and └────────── exo (MLX distributed) ──────┐ shards ONE model across the boxes, ▼ exposing ONE OpenAI endpoint. head Mac: porten (engine=openai) │ PORTEN_VRAM_BUDGET_MB = pooled total ▼ Porten Hub ── one node, big VRAM ``` The cluster does the cross-box tensor/pipeline parallelism internally; tokens stream back through the single tunnel. The Hub's `openai` engine proxies and health-checks the endpoint, and `PORTEN_VRAM_BUDGET_MB` overrides detected VRAM — so the fleet treats the whole cluster as one fat node. ## Hardware - **2–6 Apple-Silicon Macs** (Studio/Max preferred — more memory bandwidth). More unified memory per box = fewer boxes for a given model. - **Thunderbolt 5** between them. Use **RDMA over TB5** — it's the difference between ~5 tok/s and ~25 tok/s on a big MoE. TB5 gives ~80 Gb/s, which is what makes sharding across boxes practical. - Daisy-chain or hub the Macs over TB5; exo auto-discovers them. Memory math (≈128 GB/Mac; Studios can have far more): | Macs | Pooled memory | Unlocks | |---|---|---| | 2 | ~256 GB | DeepSeek **V4-Flash** Q4 (~110 GB), 70B dense, mid MoEs | | 3–4 | ~384–512 GB | **MiniMax M3** (1M context), 235B-class MoE | | 6 | ~768 GB | DeepSeek **V4-Pro** (1.6T) at low quant — the open frontier | > V4-Pro is ~400 GB even at Q2 → ~6 high-memory Macs. V4-Flash (284B total / 13B active) is the realistic frontier target for a 2–4 Mac cluster. **1M context** is only practical via MiniMax M3 on a 3–4 Mac pool — KV cache for 1M tokens is huge, so size memory for context, not just weights. ## Step 1 — pool the Macs with exo On every Mac (head + workers), install [exo](https://github.com/exo-explore/exo) and start it. They auto-discover over Thunderbolt and elect a head that serves an OpenAI-compatible API (default `:8000`). Pull/serve the model you want — exo fetches the weights and shards them across the pool. Verify from the head: ```bash curl http://localhost:8000/v1/models curl http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"deepseek-v4-flash","messages":[{"role":"user","content":"hi"}]}' ``` (MLX distributed, llama.cpp `rpc-server`, or GPUStack are alternatives — any runtime that exposes one OpenAI endpoint across the boxes works.) ## Step 2 — run porten on the head Mac Point the agent's `openai` engine at exo and tell it the **aggregate** usable VRAM: ```bash export PORTEN_HUB_URL=wss://porten.ai export PORTEN_ENGINE=openai export PORTEN_ENGINE_URL=http://localhost:8000 # exo's endpoint export PORTEN_VRAM_BUDGET_MB=262144 # pooled total across the Macs export PORTEN_ENGINE_MODELS=deepseek-v4-flash # what exo is serving porten login # approve in your browser (no token) porten service install # run as a service ``` It registers as **one node** advertising `PORTEN_VRAM_BUDGET_MB`, so the fleet can place a model that needs that much memory on it. ## Step 3 — make the model available Two things have to line up for the cluster to serve a frontier model: 1. **It's offered by the Hub.** The Hub serves a curated catalog, so a model only routes if it's been enabled there. If you operate the Hub, enable it; if you don't, ask whoever does to offer the model you're serving. 2. **Your node advertises it.** The cluster node advertises whatever exo serves, via `PORTEN_ENGINE_MODELS` — so the canonical id must match exactly what exo serves. Once both hold, point any client at `https://porten.ai/v1` with `model` set to that id. ## Caveats - **Throughput is modest** — a Mac cluster trades speed for memory. Expect tens of tok/s on big MoEs with RDMA, single digits without. - **Quality** — the best open frontier (V4-Pro ~80.6% SWE-bench) is close to but below the closed flagships (~88.7%). - **Keep the runtime current** — frontier MoE architectures need recent exo / MLX / llama.cpp. - If you just want a strong coding agent on one machine, you don't need a cluster — a 64–128 GB Mac runs Qwen3-Coder-Next. See the [Hardware guide](/docs/hardware). ---