# API reference

# API reference

Base URL: `https://porten.ai/v1`. Authenticate with `Authorization: Bearer sk-porten-…`. The surface is OpenAI-compatible, so any OpenAI SDK works by overriding `base_url`.

## POST /v1/chat/completions

The core endpoint. Streaming and non-streaming.

**Request:**

```json
{
  "model": "qwen2.5-coder-32b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Write a haiku about the aurora." }
  ],
  "temperature": 0.7,
  "max_tokens": 256,
  "stream": false
}
```

**Response (non-streaming):**

```json
{
  "id": "chatcmpl-porten-7f3a2b",
  "object": "chat.completion",
  "model": "qwen2.5-coder-32b",
  "choices": [
    { "index": 0,
      "message": { "role": "assistant", "content": "Green fire dances…" },
      "finish_reason": "stop" }
  ],
  "usage": { "prompt_tokens": 32, "completion_tokens": 41, "total_tokens": 73 }
}
```

**Response (streaming, `stream: true`)** — OpenAI-style SSE:

```
data: {"id":"…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Green"},"finish_reason":null}]}

data: {"id":"…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]
```

### Parameters

| Param | Status | Behaviour |
|---|---|---|
| `model` | ✅ | Canonical id; unknown → `404 model_not_found` |
| `messages` | ✅ | `system` / `user` / `assistant` / `tool` roles |
| `stream` | ✅ | SSE chunks |
| `stream_options.include_usage` | ✅ | Hub fills `usage` from its own count |
| `max_tokens` / `max_completion_tokens` | ✅ | Both accepted |
| `temperature`, `top_p`, `stop`, `seed` | ⚠️ | Passed to the engine; `seed` honoured only if it supports it |
| `presence_penalty`, `frequency_penalty` | ⚠️ | Passed through; ignored if the engine lacks them |
| `response_format` | ✅ | JSON mode / JSON schema, forwarded to the engine (best-effort per engine) |
| `tools`, `tool_choice` | ✅ | Routed to a model whose catalog entry declares `tools` (the same capability `/v1/models` shows); response carries `tool_calls` + `finish_reason: "tool_calls"`. A model without tool support → `400` (clear message), not a capacity error |
| `content` with `image_url` | ✅ | Inline `data:` images forwarded to vision models. Remote `http(s)` image URLs are **not** fetched (SSRF protection) — inline them as data URLs |
| `n` | ⚠️ | Only `n=1`; `n>1` → `400 unsupported_parameter` |
| `user`, `metadata` | ✅ | Logged for usage/abuse |

**Principle:** unknown *convenience* params are ignored silently (forward-compatible); params that would change semantics but can't be honoured (`n>1`) are rejected with `400` rather than silently producing the wrong result.

### Reasoning models

Models that "think" (e.g. DeepSeek-R1 family) return their reasoning separately as `reasoning_content` (a delta field in streaming), kept distinct from the answer text — so you can show or hide the chain of thought.

**Budget enough tokens.** Reasoning is generated *before* the answer and counts against `max_tokens`. With a small cap the model can spend the whole budget thinking and you get `finish_reason: "length"` with empty `content`. For reasoning models, set `max_tokens` to at least **4096** (more for hard prompts).

## POST /v1/embeddings

```json
{ "model": "nomic-embed-text", "input": ["text to embed", "and another"] }
```

```json
{
  "object": "list",
  "data": [
    { "object": "embedding", "index": 0, "embedding": [0.0123, -0.045] },
    { "object": "embedding", "index": 1, "embedding": [0.0210, -0.011] }
  ],
  "model": "nomic-embed-text",
  "usage": { "prompt_tokens": 12, "total_tokens": 12 }
}
```

## GET /v1/models

Every offered model, aggregated and deduplicated across the fleet.

```json
{
  "object": "list",
  "data": [
    { "id": "qwen2.5-coder-32b", "object": "model", "owned_by": "porten",
      "x_porten": { "ready": true, "type": "chat", "ctx": 32768 } },
    { "id": "qwen3-coder-next", "object": "model", "owned_by": "porten",
      "x_porten": { "ready": false, "type": "chat", "ctx": 262144 } }
  ]
}
```

`ready: false` means the model is offered but not loaded this instant — your first request will trigger an on-demand load. See [Models & on-demand loading](/docs/models).

## Headers

| Header | Direction | Note |
|---|---|---|
| `Authorization: Bearer sk-porten-…` | in | Required |
| `X-Request-Id` | out | Correlation id, echoed in logs |
| `X-Porten-Node` | out | Which node served the request |
| `Retry-After` | out | On `429` / `503` |

## Errors

All errors follow OpenAI's format: `{"error":{"message","type","code","param"}}`.

| HTTP | `code` | Meaning |
|---|---|---|
| 401 | `invalid_api_key` | Invalid or revoked key |
| 403 | `model_not_allowed` | The key may not use this model (region/policy) |
| 404 | `model_not_found` | No node advertises this model and it isn't offered |
| 429 | `rate_limit_exceeded` | Quota/rate exhausted (`Retry-After`) |
| 502 | `node_error` | All candidate nodes failed |
| 503 | `no_available_node` / `model_warming` | Model exists but no free/healthy node, or it's still loading (`Retry-After`) |
| 504 | `gateway_timeout` | Total timeout exceeded |

> A `503 model_warming` is expected the first time you hit a cold model and the load takes longer than the request's warm-up budget. Retry — it'll be ready shortly. Most clients won't see it because the request blocks until the model is ready.
