Public beta · 12 models live

Same quality. 87% less cost.

Pellet intelligently routes every request to the smallest model that can handle it — zero quality loss.
Deploy in your VPC for full data sovereignty.

Start Building Free Read the Docs

87%

Cost Saved

vs GPT-4o

Models

all open-source

<15ms

Routing

p99 latency

100%

OpenAI Compat

drop-in SDK

How it works

How we save you 87%

Completely transparent to your code.

Task Detection

Your prompt is classified into one of 15+ task types — extraction, reasoning, code, speech-to-text, document parsing, and more.

Complexity Score

A lightweight scorer rates complexity 1-5 based on prompt length, vocabulary, and task signals.

Model Selection

The routing matrix picks the smallest capable model from our curated catalog and routes to the optimal provider.

Platform

More than just routing

A complete inference optimization platform that learns, adapts, and scales with your workload.

Phase 2

Adaptive Context Engine

Stateful learning layer that tracks performance per user and task, continuously improving routing accuracy over time.

Live

Workload Analysis

Automated profiling of your API traffic — task distribution, quality scores, cost insights, and optimization recommendations.

Phase 3

Synthetic Data Pipeline

Generate high-quality training data from your production workloads. Automated seed extraction, generation, and QA.

Phase 3

Fine-tuning Pipeline

Managed LoRA/QLoRA fine-tuning with evaluation gates and one-click deployment. Train models tailored to your workload.

Phase 3

Self-Hosted Models

Deploy Pellet’s routing and inference stack inside your own VPC. Full data sovereignty, regulatory compliance, and zero data leaving your infrastructure.

Phase 3

Human-in-the-Loop

Quality validation network for edge cases. Human annotators ensure accuracy where model confidence is low.

Use Cases

Every task, the right model

Pellet knows which model fits each workload. No prompt engineering required.

Data Extraction

Pull structured data from unstructured text at a fraction of the cost.

Qwen 2.5 7B$0.001/1K·~80ms

Classification

Route, tag, or score content at scale with lightweight models.

Gemma 3n E4B$0.0005/1K·~50ms

Code Generation

Specialized code models for generation, review, and debugging.

DeepSeek V3.1$0.012/1K·~200ms

Summarization

Condense long documents efficiently with great coherence.

Mixtral 8x7B$0.004/1K·~120ms

Translation

50+ languages with near-human quality at 95% less cost.

Qwen 2.5 7B$0.001/1K·~70ms

Reasoning

Complex multi-step tasks routed to larger models only when needed.

DeepSeek R1$0.010/1K·~300ms

Speech to Text

Transcribe audio and video with Whisper models. Fast and accurate.

Whisper v3 Turbo$0.003/min·~1.2s

Document Parsing

Extract tables and structured data from PDFs and scanned documents.

Llama 3.2 Vision$0.008/page·~400ms

Content Generation

Blog posts, marketing copy, and creative writing with quality-optimized routing.

Mistral Small 24B$0.004/1K·~150ms

Cost Comparison

Stop burning budget on GPT-4o

Cost per 1,000 requests (typical prompt + completion)

Task	GPT-4o	Claude Sonnet	Pellet	Savings
Classification	$0.050	$0.030	$0.001	↓ 98%
Data Extraction	$0.080	$0.048	$0.002	↓ 97%
Summarization	$0.100	$0.060	$0.005	↓ 95%
Code Generation	$0.120	$0.075	$0.012	↓ 90%
Translation	$0.070	$0.042	$0.001	↓ 99%
Complex Reasoning	$0.200	$0.120	$0.015	↓ 92%
Speech to Text	$0.006	—	$0.003	↓ 50%
Document Parsing	$0.150	$0.090	$0.008	↓ 95%

Integration

Integrate in minutes

100% OpenAI SDK compatible. No new libraries to learn.

from openai import OpenAI

client = OpenAI(
    api_key="plt_your_api_key",
    base_url="https://getpellet.io/v1",
)

response = client.chat.completions.create(
    model="auto",          # Pellet picks the best model
    messages=[
        {"role": "user", "content": "Extract the invoice total"},
    ],
)

print(response.choices[0].message.content)
# pellet_metadata in response.model_extra

Catalog

Curated models, one API

Open-source models from 4B to 685B. All accessible with a single endpoint.

Gemma 3n E4B

google/gemma-3n-E4B-it

classificationsentimentformatting

Llama 3 8B Lite

meta-llama/Meta-Llama-3-8B-Instruct-Lite

classificationmoderation

Qwen 2.5 7B

Qwen/Qwen2.5-7B-Instruct-Turbo

qaextractiontranslation

Qwen 3.5 9B

Qwen/Qwen3.5-9B

code genreasoningstructured output

Mistral Small 24B

24B

mistralai/Mistral-Small-24B-Instruct-2501

reasoningcontent genextraction

Mixtral 8x7B

47B

mistralai/Mixtral-8x7B-Instruct-v0.1

summarizationcontent genqa

Llama 3.3 70B

70B

meta-llama/Llama-3.3-70B-Instruct-Turbo

reasoningqacode gen

DeepSeek V3.1

685B

deepseek-ai/DeepSeek-V3.1

code genreasoning

DeepSeek R1

685B

deepseek-ai/DeepSeek-R1

reasoningcode gen

Whisper Large v3 Turbo

809M

openai/whisper-large-v3-turbo

speech to_text

Llama 3.2 Vision 11B

11B

meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo

document parsing

Llama 3.2 Vision 90B

90B

meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo

document parsing

Pricing

Save 60-97% vs GPT-4o

You pay the actual provider cost — no token markup. We charge a flat $0.001 per request for intelligent routing. That's the entire business model.

Pay as you go

$0.001per request

+ at-cost inference · $2.50/mo free credits

$2.50 / month in free credits
At-cost inference — zero token markup
$0.001 flat fee per request for routing
All 12 models with intelligent auto-routing
Full analytics dashboard
100 requests / minute

Start Building Free

Your savings

Cost per 1K requests vs GPT-4o

Classification

$0.050$0.001↓98%

Extraction

$0.080$0.002↓97%

Code Gen

$0.120$0.012↓90%

Reasoning

$0.200$0.015↓92%

How it works: Pellet routes to the smallest model that can handle your task. You were paying $15/M tokens with GPT-4. Now you pay $0.36/M with Pellet.

FAQ

Frequently asked

Everything you need to know before you ship.

Yes. Change your base_url and API key. Your existing code, LangChain, and LlamaIndex integrations work without modification.

Pellet classifies each request by task type and complexity, then selects the optimal model from our curated catalog. Choose auto, fastest, cheapest, or quality mode — or override with pellet_config.

Pellet routes to optimized inference providers to ensure low latency and high availability for every model in the catalog.

Below 0.6 confidence, Pellet falls back to larger models (24B+ or MoE). Quality is never sacrificed.

Every account gets $2.50/month in free credits. After that, you pay at-cost inference (no token markup) plus a flat $0.001 per request routing fee. That's it. Billing via Paddle.

Every response includes pellet_metadata with model, task type, confidence, latency, and cost. Use /v1/routing/explain to preview routing decisions before sending requests.

All accounts start at 100 requests/minute. Need more? Contact us to increase your limits.

Roadmap

Where we're headed

A clear path from intelligent routing to a full inference optimization platform.

Phase 1Completed

Foundation

Core inference layer: vLLM backend, 10+ SLMs hosted, OpenAI-compatible API
Static routing classifier with task detection
Developer dashboard: usage analytics, model comparison, routing config
Free tier launch and developer content push

Phase 2In progress

Intelligence

Context engine v1: performance tracking, per-user routing personalization
Calibration flow for new users
Workload analytics on dashboard (task distribution, quality scores, cost insights)
Team tier launch

Phase 3

Enterprise

Synthetic data pipeline: workload analysis, seed extraction, generation, QA
Fine-tuning pipeline: managed LoRA/QLoRA, eval gates, one-click deployment
Private cloud deployment: run routing + inference inside your VPC for data sovereignty and compliance
Enterprise features: SSO, audit logs, SLA, dedicated endpoints

Phase 4

Scale

Context engine v2: collaborative filtering, automatic fine-tuning recommendations
Edge deployment: serve fine-tuned SLMs at the edge for ultra-low latency
Marketplace: community routing strategies and fine-tuned model templates
Multi-modal routing: extend to vision and audio SLMs

Ready to save 87%?

Join developers shipping production AI apps at a fraction of the cost. No credit card required to start.

Start Building Free