Models & on-demand loading
The catalog is curated
Porten doesn't expose every model under the sun — it offers an opinionated, curated catalog. An operator enables the models worth serving (good quality-per-VRAM, current architectures) and disables the rest, so hardware isn't wasted on models that a newer one strictly beats. GET /v1/models returns exactly what's offered.
Each model has a canonical id (e.g. qwen2.5-coder-32b), a context length, capability flags (chat, tools, json, vision), and a VRAM footprint.
Loading on demand
You can request any offered model, loaded or not. The catalog decouples "offered" from "currently in memory":
- You send a request for a model that no node has loaded.
- The Hub recognizes it's offered and tells a node that can fit it to load it — for a built-in node, that means downloading the weights (once) and starting the engine.
- Your request blocks until the model is serving, then completes normally.
For an interactive client like the playground, this shows as a real progress bar — byte-level download percentage, then "loading into memory", then the answer streams. For an API client (OpenCode, a script), there's nothing to handle: the request just takes longer on the first cold call. A first load of a large model can take several minutes while weights download; after that it's warm and fast.
If a load takes longer than the request's warm-up budget, you get a 503 with code model_warming — retry and it'll be ready.
Checking warm-up progress
The portal polls a public endpoint for live load progress:
GET /public/v1/models/warming?model=qwen3-coder-next
→ { "model": "…", "ready": false, "phase": "downloading",
"done_mb": 22272, "total_mb": 49131, "pct": 45 }
phase moves downloading → loading → ready.
Idle eviction
VRAM is finite, so the fleet unloads models you're not using. A provisioned model that goes a full demand window (default ~10 minutes) with no requests is unloaded to free room — its weights stay on disk, so it reloads quickly next time.
Demand-driven placement and eviction
The Hub continuously measures demand per model (served and unmet requests). When a model is requested and can't fit alongside what's already loaded, the planner evicts a lower-demand model immediately to make room — a just-requested model doesn't have to wait out an anti-thrash delay. This is what lets you "just ask for" a big model and have the fleet rearrange itself to serve it.
Picking a model
- Coding agent:
qwen3-coder-next(frontier-ish open coding model; needs a large-VRAM node or a cluster) orqwen2.5-coder-32b(excellent, fits a single 24–48 GB GPU or a 64 GB Mac). - General chat: a current 7–14B instruct model is fast and cheap.
- Reasoning: a DeepSeek-R1-distill class model when you want visible chain-of-thought.
- Frontier / huge context: see Build a combined machine — some models only fit on a multi-machine cluster.
What actually fits depends on your hardware — see the Hardware guide for the VRAM-per-model table.
https://porten.ai/docs/models.md — or grab the
whole site via llms.txt / llms-full.txt.