Documentation

Models & on-demand loading

The catalog is curated

Porten doesn't expose every model under the sun — it offers an opinionated, curated catalog. An operator enables the models worth serving (good quality-per-VRAM, current architectures) and disables the rest, so hardware isn't wasted on models that a newer one strictly beats. GET /v1/models returns exactly what's offered.

Each model has a canonical id (e.g. qwen2.5-coder-32b), a context length, capability flags (chat, tools, json, vision), and a VRAM footprint.

Loading on demand

You can request any offered model, loaded or not. The catalog decouples "offered" from "currently in memory":

  1. You send a request for a model that no node has loaded.
  2. The Hub recognizes it's offered and tells a node that can fit it to load it — for a built-in node, that means downloading the weights (once) and starting the engine.
  3. Your request blocks until the model is serving, then completes normally.

For an interactive client like the playground, this shows as a real progress bar — byte-level download percentage, then "loading into memory", then the answer streams. For an API client (OpenCode, a script), there's nothing to handle: the request just takes longer on the first cold call. A first load of a large model can take several minutes while weights download; after that it's warm and fast.

If a load takes longer than the request's warm-up budget, you get a 503 with code model_warming — retry and it'll be ready.

Checking warm-up progress

The portal polls a public endpoint for live load progress:

GET /public/v1/models/warming?model=qwen3-coder-next
→ { "model": "…", "ready": false, "phase": "downloading",
    "done_mb": 22272, "total_mb": 49131, "pct": 45 }

phase moves downloading → loading → ready.

Idle eviction

VRAM is finite, so the fleet unloads models you're not using. A provisioned model that goes a full demand window (default ~10 minutes) with no requests is unloaded to free room — its weights stay on disk, so it reloads quickly next time.

Demand-driven placement and eviction

The Hub continuously measures demand per model (served and unmet requests). When a model is requested and can't fit alongside what's already loaded, the planner evicts a lower-demand model immediately to make room — a just-requested model doesn't have to wait out an anti-thrash delay. This is what lets you "just ask for" a big model and have the fleet rearrange itself to serve it.

Picking a model

  • Coding agent: qwen3-coder-next (frontier-ish open coding model; needs a large-VRAM node or a cluster) or qwen2.5-coder-32b (excellent, fits a single 24–48 GB GPU or a 64 GB Mac).
  • General chat: a current 7–14B instruct model is fast and cheap.
  • Reasoning: a DeepSeek-R1-distill class model when you want visible chain-of-thought.
  • Frontier / huge context: see Build a combined machine — some models only fit on a multi-machine cluster.

What actually fits depends on your hardware — see the Hardware guide for the VRAM-per-model table.

📄 Reading as a machine? This page is available as raw Markdown at https://porten.ai/docs/models.md — or grab the whole site via llms.txt / llms-full.txt.