# Models & on-demand loading

# Models & on-demand loading

## The catalog is curated

Porten doesn't expose every model under the sun — it offers an **opinionated, curated catalog**. An operator enables the models worth serving (good quality-per-VRAM, current architectures) and disables the rest, so hardware isn't wasted on models that a newer one strictly beats. [`GET /v1/models`](/docs/api-reference#get-v1models) returns exactly what's offered.

Each model has a canonical id (e.g. `qwen2.5-coder-32b`), a context length, capability flags (`chat`, `tools`, `json`, vision), and a VRAM footprint.

## Loading on demand

**You can request any offered model, loaded or not.** The catalog decouples "offered" from "currently in memory":

1. You send a request for a model that no node has loaded.
2. The Hub recognizes it's offered and tells a node that can fit it to load it — for a built-in node, that means downloading the weights (once) and starting the engine.
3. Your request **blocks** until the model is serving, then completes normally.

For an interactive client like the [playground](/build/playground), this shows as a real progress bar — byte-level download percentage, then "loading into memory", then the answer streams. For an API client (OpenCode, a script), there's nothing to handle: **the request just takes longer** on the first cold call. A first load of a large model can take several minutes while weights download; after that it's warm and fast.

If a load takes longer than the request's warm-up budget, you get a `503` with code `model_warming` — retry and it'll be ready.

### Checking warm-up progress

The portal polls a public endpoint for live load progress:

```
GET /public/v1/models/warming?model=qwen3-coder-next
→ { "model": "…", "ready": false, "phase": "downloading",
    "done_mb": 22272, "total_mb": 49131, "pct": 45 }
```

`phase` moves `downloading → loading → ready`.

## Idle eviction

VRAM is finite, so the fleet **unloads models you're not using**. A provisioned model that goes a full demand window (default ~10 minutes) with no requests is unloaded to free room — its weights stay on disk, so it reloads quickly next time.

## Demand-driven placement and eviction

The Hub continuously measures demand per model (served and unmet requests). When a model is requested and can't fit alongside what's already loaded, the planner **evicts a lower-demand model immediately** to make room — a just-requested model doesn't have to wait out an anti-thrash delay. This is what lets you "just ask for" a big model and have the fleet rearrange itself to serve it.

## Picking a model

- **Coding agent:** `qwen3-coder-next` (frontier-ish open coding model; needs a large-VRAM node or a cluster) or `qwen2.5-coder-32b` (excellent, fits a single 24–48 GB GPU or a 64 GB Mac).
- **General chat:** a current 7–14B instruct model is fast and cheap.
- **Reasoning:** a DeepSeek-R1-distill class model when you want visible chain-of-thought.
- **Frontier / huge context:** see [Build a combined machine](/docs/cluster-thunderbolt) — some models only fit on a multi-machine cluster.

What actually fits depends on your hardware — see the [Hardware guide](/docs/hardware) for the VRAM-per-model table.
