LLM Cost Optimization: Cut Your API Bill by 80%
The $15K Wake-Up Call
Month 3 of our AI feature: OpenAI bill hits $15,000. CEO asks questions. We optimize. Month 4: $3,000 for the same workload.
Here's exactly what we did.
The Cost Breakdown
Where LLM costs come from:
- Input tokens (prompts)
- Output tokens (completions)
- Model choice (GPT-4 vs GPT-3.5 vs Claude)
- Request volume
The math:
GPT-4: $30/1M input tokens, $60/1M output tokens
Claude Sonnet: $3/1M input, $15/1M output
GPT-3.5: $0.50/1M input, $1.50/1M output
Small changes compound at scale.
1. Prompt Compression: The Quick Win
Before (350 tokens):
You are a highly experienced customer service representative with deep knowledge of our product... [250 tokens of unnecessary preamble]
Customer question: {question}
Please provide a helpful response.
After (80 tokens):
Answer customer questions accurately using the provided context.
Context: {relevant_context}
Question: {question}
Answer:
Savings: 77% fewer input tokens
Key tactics:
- Remove flowery language
- Cache system messages (Anthropic)
- Use abbreviations where clear
- Strip unnecessary instructions
2. Model Routing: Right Tool, Right Job
Not every task needs GPT-4.
Our routing logic:
def route_to_model(task_type, complexity):
if task_type == "simple_classification":
return "gpt-3.5-turbo" # $0.50/M input
elif task_type == "extraction" and complexity == "low":
return "gpt-3.5-turbo"
elif task_type == "reasoning" and complexity == "high":
return "gpt-4" # $30/M input
elif task_type == "long_context":
return "claude-sonnet" # Better $/performance for long context
else:
return "gpt-3.5-turbo" # Default to cheap
Result: 60% of our tasks ran on GPT-3.5 instead of GPT-4.
Savings: ~$7K/month
3. Response Truncation: Stop When Done
Models keep generating until max_tokens or stop sequence.
Bad:
response = openai.chat.completions.create(
model="gpt-4",
max_tokens=1000, # Often generates 800+ unnecessary tokens
messages=[...]
)
Good:
response = openai.chat.completions.create(
model="gpt-4",
max_tokens=200, # Tight constraint
stop=["###", "---"], # Stop early if model signals done
messages=[...]
)
Savings: 60-70% fewer output tokens
4. Batch Processing: Group API Calls
Before: 1 API call per item
for item in items:
result = llm.process(item)
Cost: N API calls
After: Batch items per request
def batch_process(items, batch_size=10):
for i in range(0, len(items), batch_size):
batch = items[i:i+batch_size]
prompt = f"Process these items:\n{json.dumps(batch)}"
result = llm.process(prompt)
Savings:
- Fewer API calls = lower overhead
- Shared system message across items
- Often 40-50% cost reduction
Pro tip: Balance batch size with latency needs.
5. Caching: Don't Recompute
Anthropic's Prompt Caching:
response = anthropic.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_system_prompt, # Cached
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": user_query}]
)
Cost:
- First request: Full price
- Cache hits (next 5 min): 90% discount on cached portion
Our savings: $2K/month on system message repetition alone.
6. Smart Sampling: Don't Call LLM When You Don't Need To
Before: Every user action triggered LLM
def handle_user_input(input):
return llm.generate_response(input)
After: Filter first
def handle_user_input(input):
# Quick heuristics
if is_faq(input):
return cached_faq_response(input)
if is_simple_query(input):
return rule_based_response(input)
# Only complex queries hit LLM
return llm.generate_response(input)
Result: 35% of requests handled without LLM calls.
Savings: ~$4K/month
7. Output Constraints: Force Brevity
Bad:
"Explain this concept to the user."
Average output: 300 tokens
Good:
"Explain this concept in exactly 2 sentences."
Average output: 50 tokens
Savings: 83% fewer output tokens
Tactics:
- Specify exact sentence/word count
- Request bullet points instead of paragraphs
- Use JSON output (shorter than prose)
- Add "be concise" instruction
8. Model Cascading: Try Cheap First
Pattern:
def generate_with_cascade(prompt):
# Try cheap model first
response = gpt_35(prompt)
# Check if response is good enough
if quality_check(response) > 0.8:
return response
# Fall back to expensive model only if needed
return gpt_4(prompt)
Result: 70% of requests satisfied by cheap model.
Savings: Massive (only pay for GPT-4 when necessary)
9. Fine-Tuning: Long-Term Optimization
When it makes sense:
- Repetitive task (classification, extraction)
- High volume (>10K requests/day)
- Stable task definition
Cost comparison (1M requests):
- GPT-4 prompting: $30K
- GPT-3.5 prompting: $500
- Fine-tuned GPT-3.5: $300 (training) + $400 (inference) = $700 total
Break-even: ~50K requests
Our fine-tuning wins:
- Support ticket classification: GPT-4 → fine-tuned 3.5 (same accuracy, 98% cost reduction)
- Sentiment analysis: Same accuracy as Claude, 95% cheaper
10. Monitoring: Track Cost Per Feature
Essential metrics:
{
"feature_name": "content_generation",
"daily_requests": 1200,
"avg_input_tokens": 250,
"avg_output_tokens": 180,
"model": "gpt-4",
"daily_cost": "$45",
"monthly_projection": "$1350"
}
Dashboard we built:
- Cost per feature
- Cost per user
- Token usage trends
- Model distribution
Result: Identified that 80% of costs came from 20% of features. Optimized those first.
11. Open-Source Models: The Nuclear Option
When to consider:
- Stable, well-defined task
- Willing to host infrastructure
- High volume (>1M requests/day)
Cost comparison (1M requests):
- GPT-4 API: $30K
- Self-hosted Llama 3 70B: ~$2K/month infra + one-time setup
Break-even: ~15K requests/day
Trade-offs:
- Lower accuracy (often)
- Ops overhead
- Upfront investment
When it's worth it: Mature products with stable, high-volume use cases.
12. The Latency-Cost Trade-off
Fast & expensive:
response = gpt_4(prompt, max_tokens=1000)
Slow & cheap:
# Use cheaper model
response = gpt_35(prompt, max_tokens=200)
Smart middle ground:
# Stream response, stop early if user is satisfied
for chunk in gpt_4_stream(prompt):
yield chunk
if user_stopped_reading():
break # Don't generate remaining tokens
Savings: 20-30% on output tokens
Real Numbers: Our Optimization Journey
Month 1 (baseline):
- Requests: 500K
- Cost: $15,000
- Cost per request: $0.03
Month 2 (after optimizations):
- Requests: 600K (20% growth)
- Cost: $3,200
- Cost per request: $0.0053
What we did:
- Prompt compression → 30% savings
- Model routing → 45% savings
- Caching → 15% savings
- Batching → 10% savings
Total: 83% cost reduction
The Optimization Framework
Step 1: Measure
- Track cost per feature/endpoint
- Measure avg tokens per request
- Identify high-cost areas
Step 2: Low-Hanging Fruit
- Compress prompts
- Add output limits
- Remove unnecessary calls
Step 3: Architecture
- Route to cheaper models
- Cache aggressively
- Batch where possible
Step 4: Long-Term
- Fine-tune for high-volume tasks
- Consider open-source for stable workloads
- Continuously monitor and optimize
Gotchas to Avoid
Don't:
- Over-optimize at the expense of quality
- Batch to the point of bad UX (latency)
- Fine-tune before validating task stability
- Remove safety checks to save tokens
Do:
- A/B test optimizations
- Monitor quality metrics alongside cost
- Start with quick wins (prompt compression)
- Automate cost tracking
Tools & Code
Cost tracking:
import anthropic
from functools import wraps
def track_cost(func):
@wraps(func)
def wrapper(*args, **kwargs):
response = func(*args, **kwargs)
input_cost = response.usage.input_tokens * MODEL_INPUT_COST
output_cost = response.usage.output_tokens * MODEL_OUTPUT_COST
log_cost(
feature=func.__name__,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
cost=input_cost + output_cost
)
return response
return wrapper
@track_cost
def generate_content(prompt):
return anthropic.messages.create(...)
Model router:
class ModelRouter:
def __init__(self):
self.costs = {
"gpt-4": 30,
"gpt-3.5": 0.5,
"claude-sonnet": 3
}
def route(self, task_type, budget):
if budget == "low":
return "gpt-3.5"
elif task_type == "simple":
return "gpt-3.5"
elif task_type == "long_context":
return "claude-sonnet"
else:
return "gpt-4"
Start Here
- Track costs: Implement per-feature cost monitoring
- Compress prompts: Remove unnecessary tokens
- Route smartly: Use GPT-3.5 where possible
- Set output limits: Add max_tokens constraints
- Cache aggressively: Use prompt caching
You'll see 40-60% savings in the first week.
How much are you spending? Let me know your optimization wins on Twitter or email.
Enjoying this article?
Get deep technical guides like this delivered weekly.
Get AI growth insights weekly
Join engineers and product leaders building with AI. No spam, unsubscribe anytime.
Keep reading
5 Common RAG Pipeline Mistakes (And How to Fix Them)
Retrieval-Augmented Generation is powerful, but these common pitfalls can tank your accuracy. Here's what to watch for.
AIFine-tuning vs Prompting: The Real Trade-offs
An honest look at when each approach makes sense, with real cost comparisons and performance data.
Prompt EngineeringPrompt Engineering in 2026: What Actually Works
Forget the 'act as an expert' templates. After shipping dozens of LLM features in production, here are the prompt engineering techniques that actually improve outputs, reduce costs, and scale reliably.