Same quality. 87% less cost.
Pellet intelligently routes every request to the smallest model that can handle it — zero quality loss.
Deploy in your VPC for full data sovereignty.
87%
Cost Saved
vs GPT-4o
12
Models
all open-source
<15ms
Routing
p99 latency
100%
OpenAI Compat
drop-in SDK
How it works
How we save you 87%
Completely transparent to your code.
Task Detection
Your prompt is classified into one of 15+ task types — extraction, reasoning, code, speech-to-text, document parsing, and more.
Complexity Score
A lightweight scorer rates complexity 1-5 based on prompt length, vocabulary, and task signals.
Model Selection
The routing matrix picks the smallest capable model from our curated catalog and routes to the optimal provider.
Platform
More than just routing
A complete inference optimization platform that learns, adapts, and scales with your workload.
Adaptive Context Engine
Stateful learning layer that tracks performance per user and task, continuously improving routing accuracy over time.
Workload Analysis
Automated profiling of your API traffic — task distribution, quality scores, cost insights, and optimization recommendations.
Synthetic Data Pipeline
Generate high-quality training data from your production workloads. Automated seed extraction, generation, and QA.
Fine-tuning Pipeline
Managed LoRA/QLoRA fine-tuning with evaluation gates and one-click deployment. Train models tailored to your workload.
Self-Hosted Models
Deploy Pellet’s routing and inference stack inside your own VPC. Full data sovereignty, regulatory compliance, and zero data leaving your infrastructure.
Human-in-the-Loop
Quality validation network for edge cases. Human annotators ensure accuracy where model confidence is low.
Use Cases
Every task, the right model
Pellet knows which model fits each workload. No prompt engineering required.
Data Extraction
Pull structured data from unstructured text at a fraction of the cost.
Classification
Route, tag, or score content at scale with lightweight models.
Code Generation
Specialized code models for generation, review, and debugging.
Summarization
Condense long documents efficiently with great coherence.
Translation
50+ languages with near-human quality at 95% less cost.
Reasoning
Complex multi-step tasks routed to larger models only when needed.
Speech to Text
Transcribe audio and video with Whisper models. Fast and accurate.
Document Parsing
Extract tables and structured data from PDFs and scanned documents.
Content Generation
Blog posts, marketing copy, and creative writing with quality-optimized routing.
Cost Comparison
Stop burning budget on GPT-4o
Cost per 1,000 requests (typical prompt + completion)
| Task | GPT-4o | Claude Sonnet | Pellet | Savings |
|---|---|---|---|---|
| Classification | $0.050 | $0.030 | $0.001 | ↓ 98% |
| Data Extraction | $0.080 | $0.048 | $0.002 | ↓ 97% |
| Summarization | $0.100 | $0.060 | $0.005 | ↓ 95% |
| Code Generation | $0.120 | $0.075 | $0.012 | ↓ 90% |
| Translation | $0.070 | $0.042 | $0.001 | ↓ 99% |
| Complex Reasoning | $0.200 | $0.120 | $0.015 | ↓ 92% |
| Speech to Text | $0.006 | — | $0.003 | ↓ 50% |
| Document Parsing | $0.150 | $0.090 | $0.008 | ↓ 95% |
Integration
Integrate in minutes
100% OpenAI SDK compatible. No new libraries to learn.
from openai import OpenAI
client = OpenAI(
api_key="plt_your_api_key",
base_url="https://getpellet.io/v1",
)
response = client.chat.completions.create(
model="auto", # Pellet picks the best model
messages=[
{"role": "user", "content": "Extract the invoice total"},
],
)
print(response.choices[0].message.content)
# pellet_metadata in response.model_extraCatalog
Curated models, one API
Open-source models from 4B to 685B. All accessible with a single endpoint.
Gemma 3n E4B
4Bgoogle/gemma-3n-E4B-it
Llama 3 8B Lite
8Bmeta-llama/Meta-Llama-3-8B-Instruct-Lite
Qwen 2.5 7B
7BQwen/Qwen2.5-7B-Instruct-Turbo
Qwen 3.5 9B
9BQwen/Qwen3.5-9B
Mistral Small 24B
24Bmistralai/Mistral-Small-24B-Instruct-2501
Mixtral 8x7B
47Bmistralai/Mixtral-8x7B-Instruct-v0.1
Llama 3.3 70B
70Bmeta-llama/Llama-3.3-70B-Instruct-Turbo
DeepSeek V3.1
685Bdeepseek-ai/DeepSeek-V3.1
DeepSeek R1
685Bdeepseek-ai/DeepSeek-R1
Whisper Large v3 Turbo
809Mopenai/whisper-large-v3-turbo
Llama 3.2 Vision 11B
11Bmeta-llama/Llama-3.2-11B-Vision-Instruct-Turbo
Llama 3.2 Vision 90B
90Bmeta-llama/Llama-3.2-90B-Vision-Instruct-Turbo
Pricing
Save 60-97% vs GPT-4o
You pay the actual provider cost — no token markup. We charge a flat $0.001 per request for intelligent routing. That's the entire business model.
Pay as you go
+ at-cost inference · $2.50/mo free credits
- $2.50 / month in free credits
- At-cost inference — zero token markup
- $0.001 flat fee per request for routing
- All 12 models with intelligent auto-routing
- Full analytics dashboard
- 100 requests / minute
Your savings
Cost per 1K requests vs GPT-4o
How it works: Pellet routes to the smallest model that can handle your task. You were paying $15/M tokens with GPT-4. Now you pay $0.36/M with Pellet.
FAQ
Frequently asked
Everything you need to know before you ship.
Yes. Change your base_url and API key. Your existing code, LangChain, and LlamaIndex integrations work without modification.
Pellet classifies each request by task type and complexity, then selects the optimal model from our curated catalog. Choose auto, fastest, cheapest, or quality mode — or override with pellet_config.
Pellet routes to optimized inference providers to ensure low latency and high availability for every model in the catalog.
Below 0.6 confidence, Pellet falls back to larger models (24B+ or MoE). Quality is never sacrificed.
Every account gets $2.50/month in free credits. After that, you pay at-cost inference (no token markup) plus a flat $0.001 per request routing fee. That's it. Billing via Paddle.
Every response includes pellet_metadata with model, task type, confidence, latency, and cost. Use /v1/routing/explain to preview routing decisions before sending requests.
All accounts start at 100 requests/minute. Need more? Contact us to increase your limits.
Roadmap
Where we're headed
A clear path from intelligent routing to a full inference optimization platform.
Foundation
- Core inference layer: vLLM backend, 10+ SLMs hosted, OpenAI-compatible API
- Static routing classifier with task detection
- Developer dashboard: usage analytics, model comparison, routing config
- Free tier launch and developer content push
Intelligence
- Context engine v1: performance tracking, per-user routing personalization
- Calibration flow for new users
- Workload analytics on dashboard (task distribution, quality scores, cost insights)
- Team tier launch
Enterprise
- Synthetic data pipeline: workload analysis, seed extraction, generation, QA
- Fine-tuning pipeline: managed LoRA/QLoRA, eval gates, one-click deployment
- Private cloud deployment: run routing + inference inside your VPC for data sovereignty and compliance
- Enterprise features: SSO, audit logs, SLA, dedicated endpoints
Scale
- Context engine v2: collaborative filtering, automatic fine-tuning recommendations
- Edge deployment: serve fine-tuned SLMs at the edge for ultra-low latency
- Marketplace: community routing strategies and fine-tuned model templates
- Multi-modal routing: extend to vision and audio SLMs
Ready to save 87%?
Join developers shipping production AI apps at a fraction of the cost. No credit card required to start.
Start Building Free