LLM Cost Optimization: Cut Your API Bill by 80%

The $15K Wake-Up Call

Month 3 of our AI feature: OpenAI bill hits $15,000. CEO asks questions. We optimize. Month 4: $3,000 for the same workload.

Here's exactly what we did.

The Cost Breakdown

Where LLM costs come from:

Input tokens (prompts)
Output tokens (completions)
Model choice (GPT-4 vs GPT-3.5 vs Claude)
Request volume

The math:

GPT-4: $30/1M input tokens, $60/1M output tokens
Claude Sonnet: $3/1M input, $15/1M output  
GPT-3.5: $0.50/1M input, $1.50/1M output

Small changes compound at scale.

1. Prompt Compression: The Quick Win

Before (350 tokens):

You are a highly experienced customer service representative with deep knowledge of our product... [250 tokens of unnecessary preamble]

Customer question: {question}
Please provide a helpful response.

After (80 tokens):

Answer customer questions accurately using the provided context.

Context: {relevant_context}
Question: {question}
Answer:

Savings: 77% fewer input tokens

Key tactics:

Remove flowery language
Cache system messages (Anthropic)
Use abbreviations where clear
Strip unnecessary instructions

2. Model Routing: Right Tool, Right Job

Not every task needs GPT-4.

Our routing logic:

def route_to_model(task_type, complexity):
    if task_type == "simple_classification":
        return "gpt-3.5-turbo"  # $0.50/M input
    
    elif task_type == "extraction" and complexity == "low":
        return "gpt-3.5-turbo"
    
    elif task_type == "reasoning" and complexity == "high":
        return "gpt-4"  # $30/M input
    
    elif task_type == "long_context":
        return "claude-sonnet"  # Better $/performance for long context
    
    else:
        return "gpt-3.5-turbo"  # Default to cheap

Result: 60% of our tasks ran on GPT-3.5 instead of GPT-4.

Savings: ~$7K/month

3. Response Truncation: Stop When Done

Models keep generating until max_tokens or stop sequence.

Bad:

response = openai.chat.completions.create(
    model="gpt-4",
    max_tokens=1000,  # Often generates 800+ unnecessary tokens
    messages=[...]
)

Good:

response = openai.chat.completions.create(
    model="gpt-4",
    max_tokens=200,  # Tight constraint
    stop=["###", "---"],  # Stop early if model signals done
    messages=[...]
)

Savings: 60-70% fewer output tokens

4. Batch Processing: Group API Calls

Before: 1 API call per item

for item in items:
    result = llm.process(item)

Cost: N API calls

After: Batch items per request

def batch_process(items, batch_size=10):
    for i in range(0, len(items), batch_size):
        batch = items[i:i+batch_size]
        prompt = f"Process these items:\n{json.dumps(batch)}"
        result = llm.process(prompt)

Savings:

Fewer API calls = lower overhead
Shared system message across items
Often 40-50% cost reduction

Pro tip: Balance batch size with latency needs.

5. Caching: Don't Recompute

Anthropic's Prompt Caching:

response = anthropic.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # Cached
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

Cost:

First request: Full price
Cache hits (next 5 min): 90% discount on cached portion

Our savings: $2K/month on system message repetition alone.

6. Smart Sampling: Don't Call LLM When You Don't Need To

Before: Every user action triggered LLM

def handle_user_input(input):
    return llm.generate_response(input)

After: Filter first

def handle_user_input(input):
    # Quick heuristics
    if is_faq(input):
        return cached_faq_response(input)
    
    if is_simple_query(input):
        return rule_based_response(input)
    
    # Only complex queries hit LLM
    return llm.generate_response(input)

Result: 35% of requests handled without LLM calls.

Savings: ~$4K/month

7. Output Constraints: Force Brevity

Bad:

"Explain this concept to the user."

Average output: 300 tokens

Good:

"Explain this concept in exactly 2 sentences."

Average output: 50 tokens

Savings: 83% fewer output tokens

Tactics:

Specify exact sentence/word count
Request bullet points instead of paragraphs
Use JSON output (shorter than prose)
Add "be concise" instruction

8. Model Cascading: Try Cheap First

Pattern:

def generate_with_cascade(prompt):
    # Try cheap model first
    response = gpt_35(prompt)
    
    # Check if response is good enough
    if quality_check(response) > 0.8:
        return response
    
    # Fall back to expensive model only if needed
    return gpt_4(prompt)

Result: 70% of requests satisfied by cheap model.

Savings: Massive (only pay for GPT-4 when necessary)

9. Fine-Tuning: Long-Term Optimization

When it makes sense:

Repetitive task (classification, extraction)
High volume (>10K requests/day)
Stable task definition

Cost comparison (1M requests):

GPT-4 prompting: $30K
GPT-3.5 prompting: $500
Fine-tuned GPT-3.5: $300 (training) + $400 (inference) = $700 total

Break-even: ~50K requests

Our fine-tuning wins:

Support ticket classification: GPT-4 → fine-tuned 3.5 (same accuracy, 98% cost reduction)
Sentiment analysis: Same accuracy as Claude, 95% cheaper

10. Monitoring: Track Cost Per Feature

Essential metrics:

{
  "feature_name": "content_generation",
  "daily_requests": 1200,
  "avg_input_tokens": 250,
  "avg_output_tokens": 180,
  "model": "gpt-4",
  "daily_cost": "$45",
  "monthly_projection": "$1350"
}

Dashboard we built:

Cost per feature
Cost per user
Token usage trends
Model distribution

Result: Identified that 80% of costs came from 20% of features. Optimized those first.

11. Open-Source Models: The Nuclear Option

When to consider:

Stable, well-defined task
Willing to host infrastructure
High volume (>1M requests/day)

Cost comparison (1M requests):

GPT-4 API: $30K
Self-hosted Llama 3 70B: ~$2K/month infra + one-time setup

Break-even: ~15K requests/day

Trade-offs:

Lower accuracy (often)
Ops overhead
Upfront investment

When it's worth it: Mature products with stable, high-volume use cases.

12. The Latency-Cost Trade-off

Fast & expensive:

response = gpt_4(prompt, max_tokens=1000)

Slow & cheap:

# Use cheaper model
response = gpt_35(prompt, max_tokens=200)

Smart middle ground:

# Stream response, stop early if user is satisfied
for chunk in gpt_4_stream(prompt):
    yield chunk
    if user_stopped_reading():
        break  # Don't generate remaining tokens

Savings: 20-30% on output tokens

Real Numbers: Our Optimization Journey

Month 1 (baseline):

Requests: 500K
Cost: $15,000
Cost per request: $0.03

Month 2 (after optimizations):

Requests: 600K (20% growth)
Cost: $3,200
Cost per request: $0.0053

What we did:

Prompt compression → 30% savings
Model routing → 45% savings
Caching → 15% savings
Batching → 10% savings

Total: 83% cost reduction

The Optimization Framework

Step 1: Measure

Track cost per feature/endpoint
Measure avg tokens per request
Identify high-cost areas

Step 2: Low-Hanging Fruit

Compress prompts
Add output limits
Remove unnecessary calls

Step 3: Architecture

Route to cheaper models
Cache aggressively
Batch where possible

Step 4: Long-Term

Fine-tune for high-volume tasks
Consider open-source for stable workloads
Continuously monitor and optimize

Gotchas to Avoid

Don't:

Over-optimize at the expense of quality
Batch to the point of bad UX (latency)
Fine-tune before validating task stability
Remove safety checks to save tokens

Do:

A/B test optimizations
Monitor quality metrics alongside cost
Start with quick wins (prompt compression)
Automate cost tracking

Tools & Code

Cost tracking:

import anthropic
from functools import wraps

def track_cost(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        response = func(*args, **kwargs)
        
        input_cost = response.usage.input_tokens * MODEL_INPUT_COST
        output_cost = response.usage.output_tokens * MODEL_OUTPUT_COST
        
        log_cost(
            feature=func.__name__,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            cost=input_cost + output_cost
        )
        
        return response
    return wrapper

@track_cost
def generate_content(prompt):
    return anthropic.messages.create(...)

Model router:

class ModelRouter:
    def __init__(self):
        self.costs = {
            "gpt-4": 30,
            "gpt-3.5": 0.5,
            "claude-sonnet": 3
        }
    
    def route(self, task_type, budget):
        if budget == "low":
            return "gpt-3.5"
        elif task_type == "simple":
            return "gpt-3.5"
        elif task_type == "long_context":
            return "claude-sonnet"
        else:
            return "gpt-4"

Start Here

Track costs: Implement per-feature cost monitoring
Compress prompts: Remove unnecessary tokens
Route smartly: Use GPT-3.5 where possible
Set output limits: Add max_tokens constraints
Cache aggressively: Use prompt caching

You'll see 40-60% savings in the first week.

How much are you spending? Let me know your optimization wins on Twitter or email.