The invoice surprise
Most teams don't know their AI API costs until the invoice arrives.
You ship a feature. It works. Nobody thinks too hard about what it costs per request. Then three months later, someone opens the AWS or OpenAI billing dashboard and the number is... larger than expected. You start digging: which features are expensive? Which team? Which model? The provider dashboard shows a monthly total. It doesn't tell you that your support chat is cheap, your document summarizer is expensive, and one engineer left a test loop running for two weeks.
This is normal. It's also completely avoidable.
Why provider dashboards aren't enough
OpenAI, Anthropic, and Google all provide usage dashboards. They're fine for seeing monthly totals and model breakdowns. But they don't answer the questions that actually matter for a production system:
- Which feature in your app is driving spend?
- Which team or customer is responsible for the most usage?
- Did a code change last Tuesday spike your token count?
- Is your staging environment leaking into production?
Provider dashboards show you what you spent. They don't show you why. For that, you need attribution — tagging every request with context so you can slice the data by feature, team, user, or environment.
Without attribution, cost reduction is guesswork. You can switch to a cheaper model and hope it helps, but you can't prioritize. You don't know which 20% of your requests are generating 80% of your spend.
The main drivers of high AI API costs
Before jumping to solutions, it's worth understanding the common patterns that push costs up.
Oversized models for simple tasks. GPT-4o costs $2.50 per million input tokens. GPT-4o-mini costs $0.15. That's a 16x difference. If you're using GPT-4o to classify support tickets into five categories or extract a date from a sentence, you're spending 16x more than you need to. Not every task needs frontier model reasoning.
Verbose system prompts. System prompts run on every request. A 2,000-token system prompt on a feature that handles 100,000 requests per month adds 200 million tokens of input cost before a user has typed a single word. Teams often accumulate context in system prompts over time — old examples, caveats, edge case handling — without realizing each addition multiplies across every call.
No caching. Many AI workloads involve repeated or near-identical prompts. FAQ answering, document summarization of the same documents, code explanation for a shared codebase. If the same prompt hits the API twice, you pay twice. Semantic caching — matching new queries against recent responses — can cut costs dramatically for read-heavy workloads.
Retries without backoff. A transient API error triggers a retry. The retry fails and triggers another. Exponential backoff is table stakes, but it's often not implemented correctly — especially in async workers and queue processors where error handling gets complicated. Each failed retry is a paid API call.
Long conversation histories. Chat applications typically append the full conversation history to every request so the model has context. A conversation that's 50 messages deep means every new message carries 50 messages of input tokens. Without a truncation or summarization strategy, conversation costs grow linearly — and users who chat the most become your most expensive users.
Model costs: what you're actually paying
Here's a snapshot of current pricing for the models most teams are running:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude Sonnet 3.7 | $3.00 | $15.00 |
| Claude Haiku 3.5 | $0.80 | $4.00 |
| Gemini 2.0 Flash | $0.10 | $0.40 |
| Gemini 2.5 Pro | $1.25 | $10.00 |
Output tokens are typically 3-4x more expensive than input. That matters for tasks that generate long responses. A feature that asks for a 1,000-word analysis costs significantly more than one that asks for a yes/no classification — even at the same model.
The 10x-20x spread between budget and frontier models is the biggest lever most teams haven't pulled yet.
Practical cost reduction strategies
1. Route by task complexity
Not every request needs the same model. A classification task, an entity extraction, a format conversion — these don't need frontier reasoning. Reserve expensive models for genuinely complex tasks: multi-step reasoning, synthesis across many sources, nuanced generation.
A simple routing layer looks at task type and picks the model:
function selectModel(taskType: string): string {
const cheapTasks = ["classify", "extract", "format", "summarize-short"];
return cheapTasks.includes(taskType) ? "gpt-4o-mini" : "gpt-4o";
}
More sophisticated routing can look at prompt complexity, user tier, or confidence thresholds. If GPT-4o-mini returns a low-confidence classification, escalate to GPT-4o. You pay the premium only when you need it.
2. Trim system prompts
Audit your system prompts. Count the tokens. Look for:
- Old examples that accumulate over time
- Negative instructions ("don't do X, don't do Y") that can often be collapsed
- Boilerplate context that could be moved into the user turn or a retrieval step
- Duplicate instructions that got added at different points
Cutting a 1,500-token system prompt to 500 tokens is a 3x reduction in per-request input cost for that feature. At scale, that's significant.
3. Cache repeated queries
If your application answers the same or similar questions repeatedly, cache at the semantic level. A user asking "how do I reset my password?" and another asking "what's the process for changing my password?" should hit the same cached response.
Build a lookup layer that embeds incoming queries, checks for similarity against recent responses (cosine similarity above a threshold), and returns the cached answer. Cache hits cost only the embedding call — a fraction of a full completion.
For exact repetitions (same prompt, same parameters), a simple key-value cache on the prompt hash is enough.
4. Truncate conversation history
Instead of sending the full conversation history, implement a rolling window or a summarization step. Options:
- Fixed window: keep only the last N messages (simple, effective for most chat UIs)
- Summarization: when history exceeds a token budget, ask the model to summarize the conversation so far, then continue with the summary as context
- Retrieval: embed all prior messages and retrieve only the most relevant ones for each new turn
The right approach depends on whether your users have long conversations and how much context continuity matters for your use case.
5. Set token limits and alerts
Most API clients let you set max_tokens on the response. Use it. If you're building a feature that returns a one-sentence answer, cap the output at 100 tokens. Models will pad responses if you don't constrain them.
Set up cost alerts before you need them. OpenAI and Anthropic both support billing alerts, but they trigger on total spend — not per-feature or per-team. For fine-grained alerting, you need attribution at the request level.
Cost allocation: knowing who's spending what
Reducing costs is easier when you know where they're coming from.
The pattern that works: add a metadata object to every API request with the context you care about — feature name, team, user ID, environment. Then aggregate on that metadata to produce the breakdowns you need.
// Every AI call tagged with context
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages,
// OpenAI user field for attribution
user: `team:${teamId}|feature:support-chat|env:production`,
});
This is crude but workable. The limitation: you're stuck with whatever OpenAI or Anthropic exposes in their usage API. You can't easily cross-reference against your own cost data, set per-team budgets, or get alerts when a specific feature spikes.
For that level of control, you need a layer between your application and the providers.
How Grepture helps
Grepture is an AI gateway that sits between your application and your LLM providers. Every request flows through it — or is traced by the SDK — so cost data is captured with full context at the point of the call, not reconstructed from provider invoices after the fact.
Per-request cost attribution. Attach metadata to any request — feature, team, user, environment — and the dashboard aggregates it automatically. You can filter spend by any dimension: "show me this week's cost for the support-chat feature, production only, broken down by model."
Token usage breakdown. Every request in the traffic log shows input tokens, output tokens, and cost. You can see which prompts are token-heavy, which responses are long, and whether a recent code change moved the numbers.
Team-level cost tracking. Create separate API keys per team or per environment. Costs are attributed to the key automatically. No manual tagging needed for team-level breakdowns.
Cost trends over time. The dashboard shows daily spend curves, model mix, and provider distribution. You can spot when costs started rising and correlate it with deploys.
If you don't want to route traffic through a proxy — for latency-sensitive workloads, or while you're evaluating — trace mode gives you the same cost visibility with zero proxy overhead. The SDK captures usage data asynchronously after each response and sends it to Grepture in the background. Your requests go directly to the provider; traces arrive in the dashboard seconds later.
For a full walkthrough of setting up cost tracking with request attribution, see the track AI API costs guide. For the broader observability setup — logging, search, and conversation tracing — see how to monitor and log LLM API calls.