API Overview

Pellet AI provides an OpenAI-compatible inference API with intelligent routing. Send your prompts to one endpoint and Pellet automatically selects the best model based on task type, complexity, and your routing preferences.

10+

Models available

~200ms

Routing overhead

$2.50

Free tier credit

Base URL

https://getpellet.io/v1

Authentication

All API requests require a valid API key sent in the Authorization header. You can create API keys from the API Keys page.

bash
curl https://getpellet.io/v1/chat/completions \
  -H "Authorization: Bearer pk_live_your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}]}'

Keep your API key secret. Do not expose it in client-side code, public repositories, or browser requests. All API calls should be made from your server.

POST /v1/chat/completions

POST/v1/chat/completions

OpenAI-compatible chat completions endpoint. Drop-in replacement for the OpenAI SDK. Supports text, vision (images), streaming, and intelligent routing.

Request body

messagesarrayrequired

List of message objects. Each message has a role (system, user, assistant, or tool) and content (string or array of content parts for multimodal). Tool-role messages must include a tool_call_id. Assistant messages may include a tool_calls array.

modelstring | nullDefault: null

Model ID to use. Set to null or omit for auto-routing (recommended). Specify a model ID (e.g. meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo) to bypass routing.

streambooleanDefault: false

If true, responses are streamed as server-sent events (SSE).

temperaturefloatDefault: 0.7

Sampling temperature between 0 and 2. Lower values are more deterministic.

max_tokensintegerDefault: 1024

Maximum number of tokens to generate.

top_pfloatDefault: 1.0

Nucleus sampling threshold.

stopstring | array | nullDefault: null

Stop sequence(s). Generation stops when encountered.

toolsarray | nullDefault: null

List of tool (function) definitions the model may call. Each tool has a type of function and a function object with name, description, and parameters (JSON Schema). See Tool Use section below.

tool_choicestring | object | nullDefault: null

Controls tool calling behavior. auto lets the model decide, required forces a tool call, none disables tools. Pass an object like {"type":"function","function":{"name":"..."}} to force a specific tool.

pellet_configobjectDefault: {}

Pellet-specific routing configuration. See Routing Modes below.

tagsstring[] | nullDefault: null

Optional list of labels to attach to this request. Tags are stored with the request log and can be filtered in the dashboard. Max 10 tags, each max 64 characters, lowercase alphanumeric with hyphens, underscores, dots, or colons. Tags are echoed in pellet_metadata for non-streaming responses.

Example — text completion

bash
curl -X POST https://getpellet.io/v1/chat/completions \
  -H "Authorization: Bearer pk_live_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in 3 sentences."}
    ],
    "temperature": 0.5,
    "max_tokens": 256
  }'

Response

json
{
  "id": "pel_a1b2c3d4e5f6g7h8",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing uses quantum bits (qubits)..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 64,
    "total_tokens": 92
  },
  "pellet_metadata": {
    "routing_decision": "auto",
    "task_type": "qa",
    "complexity_score": 2.1,
    "model_confidence": 0.87,
    "latency_ms": 342,
    "cost_usd": 0.000031
  }
}

Tool Use (Function Calling)

Enable your model to call functions by passing tools in the request. When the model decides to use a tool, the response includes a tool_calls array instead of text content. You execute the function, then send the result back as a tool-role message to continue the conversation.

OpenAI-compatible. Tool use follows the same format as the OpenAI API. Existing code using the OpenAI SDK works without changes. For best results, use models with strong tool-use support such as meta-llama/Llama-3.3-70B-Instruct-Turbo or deepseek-ai/DeepSeek-V3.1.

Tool definition format

typestringrequired

Must be function.

function.namestringrequired

The function name the model will reference when calling this tool.

function.descriptionstring

Describes what the function does. Helps the model decide when to use it.

function.parametersobject

A JSON Schema object defining the function's parameters.

Step 1 — Send tools with your request

bash
curl -X POST https://getpellet.io/v1/chat/completions \
  -H "Authorization: Bearer pk_live_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
    "messages": [
      {"role": "user", "content": "What is the weather in Delhi?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather for a city",
          "parameters": {
            "type": "object",
            "properties": {
              "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
          }
        }
      }
    ],
    "tool_choice": "auto"
  }'

Response — model calls the tool

json
{
  "id": "pel_a1b2c3d4e5f6g7h8",
  "object": "chat.completion",
  "model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"city\": \"Delhi\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": { "prompt_tokens": 85, "completion_tokens": 22, "total_tokens": 107 }
}

Step 2 — Send the tool result back

bash
curl -X POST https://getpellet.io/v1/chat/completions \
  -H "Authorization: Bearer pk_live_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
    "messages": [
      {"role": "user", "content": "What is the weather in Delhi?"},
      {"role": "assistant", "tool_calls": [{"id": "call_abc123", "type": "function", "function": {"name": "get_weather", "arguments": "{\"city\": \"Delhi\"}"}}]},
      {"role": "tool", "tool_call_id": "call_abc123", "content": "{\"temp_c\": 38, \"condition\": \"Sunny\"}"}
    ],
    "tools": [...]
  }'

Example — Python with OpenAI SDK

python
from openai import OpenAI
import json

client = OpenAI(
    base_url="https://getpellet.io/v1",
    api_key="pk_live_your_key"
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    }
}]

# First call — model may request a tool call
response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "What's the weather in Delhi?"}],
    tools=tools,
    tool_choice="auto",
)

message = response.choices[0].message

if message.tool_calls:
    # Execute the function (your logic here)
    tool_call = message.tool_calls[0]
    result = {"temp_c": 38, "condition": "Sunny"}

    # Second call — send the result back
    final = client.chat.completions.create(
        model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
        messages=[
            {"role": "user", "content": "What's the weather in Delhi?"},
            message,  # assistant message with tool_calls
            {
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result),
            },
        ],
        tools=tools,
    )
    print(final.choices[0].message.content)
else:
    print(message.content)

Vision (Image Understanding)

POST/v1/chat/completions

Send images alongside text using the same chat completions endpoint. Use multimodal content parts in the content field. When image content is detected, Pellet automatically routes to a vision model.

Vision models: Pellet routes to Llama 3.2 11B Vision (fastest/auto) or Llama 3.2 90B Vision (quality mode). You can also specify a vision model directly via the model parameter.

Content parts format

typestringrequired

Either text or image_url.

textstring

Text content (when type is text).

image_urlobject

Image object with a url field. Supports base64 data URLs:data:image/png;base64,...

Example — describe an image

bash
curl -X POST https://getpellet.io/v1/chat/completions \
  -H "Authorization: Bearer pk_live_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "data:image/png;base64,iVBORw0KGgo..."
            }
          },
          {
            "type": "text",
            "text": "What is in this image? Extract any text you see."
          }
        ]
      }
    ],
    "max_tokens": 512
  }'

Example — Python with base64 image

python
import base64
import requests

# Read image and encode to base64
with open("receipt.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode("utf-8")

resp = requests.post(
    "https://getpellet.io/v1/chat/completions",
    headers={"Authorization": "Bearer pk_live_your_key"},
    json={
        "messages": [{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
                {"type": "text", "text": "Extract all items and totals from this receipt as JSON."}
            ]
        }],
        "max_tokens": 1024
    }
)

print(resp.json()["choices"][0]["message"]["content"])
Supported image formats: PNG, JPEG, WebP, GIF

POST /v1/audio/transcriptions

POST/v1/audio/transcriptions

OpenAI-compatible speech-to-text endpoint. Upload an audio file and get a text transcription powered by Whisper models for ultra-fast results.

Request (multipart form)

filefilerequired

The audio file to transcribe. Max 25MB. Supported formats: mp3, wav, ogg, flac, m4a, webm.

modelstringDefault: whisper-large-v3-turbo

STT model to use. Options: whisper-large-v3-turbo (fastest), whisper-large-v3 (highest quality), distil-whisper-large-v3-en (English-only, fast).

languagestringDefault: auto-detect

ISO 639-1 language code (e.g. en, es, fr). Auto-detected if not specified.

response_formatstringDefault: json

Response format: json (text only), verbose_json (with segments and metadata), text (plain text).

Example — cURL

bash
curl -X POST https://getpellet.io/v1/audio/transcriptions \
  -H "Authorization: Bearer pk_live_your_key" \
  -F file=@meeting_recording.mp3 \
  -F model=whisper-large-v3-turbo \
  -F response_format=verbose_json

Example — Python

python
import requests

resp = requests.post(
    "https://getpellet.io/v1/audio/transcriptions",
    headers={"Authorization": "Bearer pk_live_your_key"},
    files={"file": open("recording.mp3", "rb")},
    data={"model": "whisper-large-v3-turbo"}
)

print(resp.json()["text"])

Response (verbose_json)

json
{
  "text": "Hello, welcome to our weekly standup meeting...",
  "task": "transcribe",
  "language": "en",
  "duration": 127.5,
  "segments": [
    {
      "start": 0.0,
      "end": 3.2,
      "text": "Hello, welcome to our weekly standup meeting."
    },
    {
      "start": 3.5,
      "end": 7.1,
      "text": "Let's start with updates from the engineering team."
    }
  ],
  "pellet_metadata": {
    "id": "pel_a1b2c3d4e5f6g7h8",
    "model": "whisper-large-v3-turbo",
    "latency_ms": 1842
  }
}
Default

whisper-large-v3-turbo

Fastest. Multilingual.

Quality

whisper-large-v3

Highest accuracy. Multilingual.

English

distil-whisper-large-v3-en

English-only. Very fast.

Streaming

Set stream: true to receive responses as server-sent events (SSE). Works with both text and vision requests. Each event contains a delta chunk.

bash
curl -X POST https://getpellet.io/v1/chat/completions \
  -H "Authorization: Bearer pk_live_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Write a haiku about APIs"}],
    "stream": true
  }'

SSE event format

data: {"id":"pel_...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Silent"},"index":0}]}

data: {"id":"pel_...","object":"chat.completion.chunk","choices":[{"delta":{"content":" endpoints"},"index":0}]}

data: [DONE]

Example — Python with OpenAI SDK

python
from openai import OpenAI

client = OpenAI(
    base_url="https://getpellet.io/v1",
    api_key="pk_live_your_key"
)

stream = client.chat.completions.create(
    model=None,  # auto-route
    messages=[{"role": "user", "content": "Explain REST APIs"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Routing Modes

Control how Pellet selects models using the pellet_config object. The routing engine classifies your task (12+ types), scores complexity (1-5), and picks the optimal model.

autoBalanced — picks the best model for the task type and complexity. Uses smaller models for simple tasks, larger ones for complex tasks.
fastestMinimum latency — prioritizes the smallest, fastest models. Great for real-time apps and simple tasks.
cheapestLowest cost — selects the most cost-effective model that can handle the task.
qualityBest output — selects the largest, most capable model. Use for complex reasoning, long-form content, or critical tasks.

Full pellet_config options

json
{
  "pellet_config": {
    "routing_mode": "auto",
    "max_latency_ms": 500,
    "max_cost_per_token": 0.0001,
    "model_allowlist": ["meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"],
    "model_blocklist": ["deepseek-ai/DeepSeek-R1"]
  }
}
routing_modestringDefault: auto

One of auto, fastest, cheapest, quality.

max_latency_msinteger | null

Maximum acceptable latency. Filters out slower models.

max_cost_per_tokenfloat | null

Maximum cost per token. Filters out expensive models.

model_allowlistarray | null

Only consider these models. Overrides routing engine selection.

model_blocklistarray | null

Exclude these models from consideration.

Task types detected

classificationextractionsummarizationtranslationcode_genqastructured_outputsentimentcontent_genformattingmoderationreasoningvisionspeech_to_text

POST /v1/routing/explain

POST/v1/routing/explain

Preview what model Pellet would select without making an inference call. Useful for debugging routing decisions and understanding model selection. No tokens are consumed.

bash
curl -X POST https://getpellet.io/v1/routing/explain \
  -H "Authorization: Bearer pk_live_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Translate this to French: Hello world"}
    ]
  }'

Response

json
{
  "model": "Qwen/Qwen2.5-7B-Instruct-Turbo",
  "confidence": 0.87,
  "task_type": "translation",
  "complexity_score": 1.5,
  "alternatives": [
    {"model": "llama-3.1-8b-instant", "confidence": 0.68}
  ],
  "reasoning": "Detected task type 'translation' with complexity 1.5/5.0. Selected Qwen/Qwen2.5-7B-Instruct-Turbo with confidence 0.87."
}

GET /v1/models

GET/v1/models

List all available models. Returns an OpenAI-compatible model list.

bash
curl https://getpellet.io/v1/models \
  -H "Authorization: Bearer pk_live_your_key"

Datasets

Curate training datasets from your production logs, JSONL uploads, or manual entries. Export as JSONL for fine-tuning. Accessed via the dashboard at /dashboard/datasets or programmatically via the dashboard API (JWT auth).

Creating datasets

Three ways to populate a dataset:

  • From tagged logs — select tags when creating a dataset; the backend pulls matching logged requests and converts them to training examples (up to 1000 per creation).
  • JSONL upload — drop a .jsonl or .json file (max 10MB) where each line is an OpenAI-format example: {"messages": [...]}.
  • Manual entry — add examples one at a time via the dashboard row editor.

Example format

json
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."}
  ]
}

REST API endpoints

All dataset endpoints use the dashboard API with JWT authentication (obtained via OAuth login). Base path: /api/datasets.

MethodPathPurpose
POST/api/datasetsCreate (empty or from tags)
GET/api/datasetsList user's datasets
GET/api/datasets/:idGet dataset details
PATCH/api/datasets/:idUpdate name/description
DELETE/api/datasets/:idDelete dataset + examples
GET/api/datasets/:id/examplesList examples (paginated)
POST/api/datasets/:id/examples/manualAdd manual example
POST/api/datasets/:id/uploadUpload JSONL file
PATCH/api/datasets/:id/examples/:eidEdit example messages
DELETE/api/datasets/:id/examples/:eidDelete example
GET/api/datasets/:id/exportDownload as JSONL

Organizations

Pellet uses a multi-tenant organization model. Each org is a shared workspace with its own API keys, datasets, billing wallet, and usage analytics. Users can belong to multiple orgs and switch between them via the dashboard sidebar.

Roles

RolePermissions
AdminEverything — members, datasets, API keys, billing, settings, delete org
EditorDataset CRUD, fine-tuning; read-only API keys; no billing access
ViewerRead-only datasets, logs, usage analytics; no API keys or billing

Key Concepts

  • Every user gets a Personal workspace on signup (cannot be deleted or left)
  • Create additional orgs for team collaboration via the org switcher
  • Invite members by email — pending invites auto-link when the invitee signs up
  • All resource endpoints are scoped: /api/orgs/:slug/keys, /api/orgs/:slug/datasets, etc.
  • Dashboard URLs are org-scoped: /dashboard/:slug/datasets
  • Orgs are soft-deleted — data is preserved and the slug is freed for reuse

Invite Flow

Admins invite members via the Members page. If the invitee already has a Pellet account, they see a banner on their dashboard with Accept/Decline buttons. If the invitee doesn't have an account yet, the invitation waits — when they sign up with the matching email, they're automatically added to the org.

Error Handling

Pellet uses standard HTTP status codes. Errors return a JSON body with an error object.

400Bad request — invalid parameters or unsupported file type
401Unauthorized — missing or invalid API key
422Validation error — invalid request body or message role
413File too large — exceeds 25MB limit
429Rate limited — too many requests
502Upstream error — model returned an error
503Service unavailable — feature not configured

Error response format

json
{
  "detail": {
    "error": {
      "message": "Rate limit exceeded. Try again in 3 seconds.",
      "type": "rate_limit_error"
    }
  }
}

Rate Limits

Rate limits are applied per API key using a sliding window. When rate limited, the API returns a 429 status code.

PlanRequests / minRequests / day
Free1001,000
Developer10050,000

SDK / Libraries

Use the official Pellet SDKs for typed responses, intelligent routing configuration, and Pellet-specific features like pellet_metadata and routing.explain().

Python — pellet-ai on PyPI · GitHub

bash
pip install pellet-ai
python
from pellet import Pellet

client = Pellet(api_key="pk_live_your_key")

response = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    tags=["production", "onboarding-v2"]
)
print(response.choices[0].message.content)
print(response.pellet_metadata.task_type)   # e.g. "general_qa"
print(response.pellet_metadata.cost_usd)    # e.g. 0.00012
print(response.pellet_metadata.tags)        # ["production", "onboarding-v2"]

Node.js / TypeScript — pellet-ai on npm · GitHub

bash
npm install pellet-ai
typescript
import Pellet from "pellet-ai";

const client = new Pellet({ apiKey: "pk_live_your_key" });

const response = await client.chat.completions.create({
  messages: [
    { role: "user", content: "What is machine learning?" }
  ],
  tags: ["production", "onboarding-v2"],
});
console.log(response.choices[0].message.content);
console.log(response.pelletMetadata?.taskType);   // e.g. "general_qa"
console.log(response.pelletMetadata?.costUsd);     // e.g. 0.00012
console.log(response.pelletMetadata?.tags);        // ["production", "onboarding-v2"]

cURL

bash
curl -X POST https://getpellet.io/v1/chat/completions \
  -H "Authorization: Bearer pk_live_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "pellet_config": {"routing_mode": "fastest"}
  }'

OpenAI SDK compatible: You can also use the official OpenAI Python or Node SDK with base_url="https://getpellet.io/v1".