How to Reduce LLM API Costs: Caching and Routing in 2026
In short
I cut the LLM bill on my reputation SaaS from $41.60 to $8.90 per 1,000 AI-generated review replies, a 79% reduction, without any measurable drop in reply quality. The four levers, in order of effort: model routing, prompt caching, semantic caching, and batching with output caps. The pricing landscape moved again this spring: Claude Opus 4.8 fast mode launched May 28, 2026 at $10/$50 per million tokens, and GitHub Copilot switched to flex billing on June 1, 2026, so the routing math in older guides is stale. This post walks through my actual invoice math, which lever saved what, and the routing code I run in production.

On this page
- Where does the money actually go in an LLM bill?
- Lever 1: How does model routing cut LLM costs?
- Lever 2: How do you structure prompts for prompt caching?
- Lever 3: What is semantic caching and what hit rate is realistic?
- Lever 4: What do batching and output caps add?
- What did I deliberately not do, and why?
- What was the combined result?
- Key takeaways
Where does the money actually go in an LLM bill?
In my April 2026 invoice, input tokens were roughly 80% of the cost, output tokens about 14%, and retries the rest. That surprises people. Everyone fixates on expensive output tokens, but a prompt-heavy workload pays mostly for context it resends on every single call, which is exactly what makes caching the highest-leverage fix.
The concrete numbers from that month: 11,400 review replies generated, $474 invoiced, so $41.60 per 1,000 replies. Each call looked like this:
- ~3,200 input tokens: a system prompt, the tenant's brand voice profile, a few-shot example bank, and finally the actual review text
- ~130 output tokens: replies are capped at 3 to 4 lines, a product decision I made long before it was a cost decision
- ~8% retry overhead: timeouts, malformed outputs, and the occasional regenerate-on-demand from a tenant
At Opus 4.8 fast mode rates that is $0.032 of input against $0.0065 of output per reply. The bill is an input bill. This matters even more if your workload includes retrieval, because agentic retrieval loops multiply input tokens with every hop, something I dug into when I wrote about whether RAG is dead in 2026 ↗. And note this whole post is about the running bill, not the build. I covered build budgets separately in how much a RAG chatbot costs to build ↗.
Lever 1: How does model routing cut LLM costs?
Model routing sends easy requests to a cheap fast tier and hard ones to a frontier model. In my SaaS, 71% of review replies route to a small model billing at roughly a tenth of frontier rates, which dropped my cost from $41.60 to $17.10 per 1,000 replies, a 59% cut and the largest single lever.
The June 2026 price context makes the gap concrete: Claude Opus 4.8 fast mode costs $10/$50 per million tokens and launched May 28, 2026. A small fast tier runs near a tenth of that. When one tier is 10x cheaper, you do not need a clever router. You need an honest answer to "which requests actually need the big model?"
For review replies the answer is boring heuristics, not an LLM classifier:
import Anthropic from "@anthropic-ai/sdk";
type Tier = "small" | "frontier";
interface Review {
rating: number; // 1-5
text: string;
flagged: boolean; // refund demands, legal language, health and safety
}
const HARD_PATTERNS = /refund|lawyer|legal|health|injur|allerg|discriminat|scam|fraud/i;
export function pickTier(review: Review): Tier {
if (review.rating <= 2) return "frontier"; // angry customers get the good model
if (review.flagged || HARD_PATTERNS.test(review.text)) return "frontier";
if (review.text.length > 600) return "frontier"; // long reviews carry nuance
return "small"; // short 4-5 star praise: cheap tier
}
const MODELS: Record<Tier, { id: string; maxTokens: number }> = {
small: { id: process.env.SMALL_MODEL!, maxTokens: 160 },
frontier: { id: process.env.FRONTIER_MODEL!, maxTokens: 220 },
};
export async function draftReply(client: Anthropic, review: Review, tenantPrompt: string) {
const { id, maxTokens } = MODELS[pickTier(review)];
return client.messages.create({
model: id,
max_tokens: maxTokens,
system: [
// Stable prefix first: warm cache reads bill at ~10% of the input rate
{ type: "text", text: tenantPrompt, cache_control: { type: "ephemeral" } },
],
messages: [{ role: "user", content: renderReview(review) }],
});
}
Two honest costs of routing. First, I spent about two weeks building an eval set of 400 graded replies before I trusted the split. Second, roughly 6% of small-tier drafts fail my quality check (too generic, wrong business detail) and re-run on the frontier tier, an escalation tax baked into the $17.10 figure.

Lever 2: How do you structure prompts for prompt caching?
Put stable content first, volatile content last, and set a cache breakpoint at the boundary. Cache reads bill at roughly a tenth of the normal input rate, and restructuring my prompts this way took my cost from $17.10 to $12.80 per 1,000 replies, with time to first token dropping from about 1.7s to 0.9s on warm requests.
The steps I followed, in order:
- Classify every piece of the prompt by stability. System prompt and few-shot bank: never change. Tenant brand voice profile: changes monthly. Review text: changes every call.
- Reorder so stability strictly decreases. Anything volatile that sits early in the prompt invalidates everything after it, because caching is a byte-exact prefix match.
- Place
cache_controlon the last stable block, which in my case caches about 2,700 of 3,200 input tokens. - Verify with
usage.cache_read_input_tokensin the API response, not with vibes. - Hunt silent invalidators. My first deploy showed zero cache reads for a full day. The culprit: I interpolated the current date into the system prompt header, so every request had a unique prefix. Moving the date into the user message fixed it instantly.
The economics: writes cost about 1.25x normal input, reads about 0.1x, so the cache pays for itself by the second request on the same prefix. My warm-hit rate sits around 75% because each tenant's review sync run fires dozens of requests inside the cache TTL window.
Lever 3: What is semantic caching and what hit rate is realistic?
Semantic caching is the practice of storing past model responses keyed by an embedding of the request, so a new request that is similar enough in meaning is served from the cache without calling the model at all. Published 2026 benchmarks show semantic caching hits 60-85% in support-style workloads with a ~97% latency cut on cache hits. My production number is lower and worth being honest about.
Five-star reviews of a restaurant are semantically near-identical, which makes review replies look like ideal cache material. But my replies must name the reviewer and reference one concrete detail from their review, so I can only reuse within the same tenant, rating band, and topic cluster, behind a 0.90 cosine similarity threshold. The result: 22% of all replies serve from cache (42% within the eligible pool of short 4-5 star reviews). On hits, response time is about 70ms instead of 2.4s, which matches the ~97% latency figure almost exactly. The lever saved another $2.40, landing me at $10.40 per 1,000 replies.
The implementation is just pgvector plus the same embedding pipeline I already run for retrieval in my RAG development work ↗, so the marginal infrastructure cost was an afternoon and about $0.25 per 1,000 replies in embedding calls.
Lever 4: What do batching and output caps add?
Batch APIs price non-urgent requests at a 50% discount, and output caps stop runaway generations. Together they took me from $10.40 to $8.90 per 1,000 replies. Small in percentage terms, but they required two config changes and zero new code paths, the best effort-to-savings ratio of the four levers.
About 30% of my reply volume is tenants who auto-publish on a schedule rather than reviewing drafts live, and that traffic now runs through a nightly batch job at half price. For caps, my replies were already constrained to 3 to 4 lines by the prompt, so tightening max_tokens from 220 to 160 on the small tier saved little directly. What it actually killed was the long tail of malformed retries that used to burn full-length generations before failing validation.
What did I deliberately not do, and why?
I skipped fine-tuning and self-hosting, and I would skip them again at my scale. Route by task difficulty, cache by prompt structure, and cap output tokens before you ever consider fine-tuning. Those three are reversible, model-agnostic, and require no training data pipeline.
Fine-tuning fails my math twice. The few-shot example bank in my cached prefix already buys most of the style consistency a fine-tune would, and at 0.1x cache-read pricing those examples are nearly free. Worse, a fine-tune welds you to one model snapshot, and 2026 has shipped a price-shifting release every few weeks. Self-hosting fails harder: a rented GPU for an open-weights model starts around a flat monthly cost that exceeds my entire optimized bill, before counting the ops time I would rather spend on product.
The deeper point is that none of this required re-architecting the app. Every lever bolted onto the existing pipeline, the same philosophy I described in adding AI features to an existing SaaS without a rewrite ↗, and the same checklist I now run early in my AI solutions work ↗ for clients, because retrofitting cost discipline later is always more expensive.
What was the combined result?
My cost per 1,000 AI replies fell from $41.60 to $8.90, a 79% reduction, which lands near the top of the 47-80% LLM spend reduction that 2026 playbooks document from routing, caching, and batching combined. My workload is unusually cache-friendly, so treat my number as a ceiling, not a baseline.
| Stage (applied cumulatively) | Cost per 1,000 replies | Reduction vs baseline |
| Baseline: everything on Opus 4.8 fast mode | $41.60 | 0% |
| + Model routing (71% to small tier) | $17.10 | 59% |
| + Prompt caching (2,700-token stable prefix) | $12.80 | 69% |
| + Semantic caching (22% full-reply hits) | $10.40 | 75% |
| + Batching and output caps | $8.90 | 79% |
In absolute terms: routing saved $24.50 per 1,000, prompt caching $4.30, semantic caching $2.40, batching and caps $1.50. The monthly invoice went from $474 to about $108 at slightly higher volume.
One caveat on attribution: the lever you apply first always looks like the hero. On a holdout slice where I applied prompt caching first, it alone cut 38% of the bill, with zero quality risk and one afternoon of work, while routing took two weeks of eval building and still carries its 6% escalation tax.
Routing gets the headlines, but caching pays the rent. Routing savings depend on a price gap between model tiers that providers can reprice overnight; cache hits are a structural discount you collect every month your prompts stay stable.
Cost optimization stopped being a year-end cleanup task and became a product discipline. Every new AI feature I ship now gets a cost-per-1,000-operations estimate next to its latency budget, in the spec, before any code.
Key takeaways
- My bill was 80% input tokens. Profile your invoice before optimizing; prompt-heavy workloads should attack input cost first.
- Routing was my biggest absolute saver ($24.50 per 1,000 replies) but cost two weeks of eval work and a 6% escalation tax.
- Prompt caching was the best ROI per hour: stable-first prompt structure plus one breakpoint cut 38% on its own in an afternoon.
- Semantic caching benchmarks at 60-85% hits in support workloads; my personalized replies managed 22%, so validate against your own constraints.
- Skip fine-tuning and self-hosting until routing, caching, and batching are exhausted; at small-SaaS scale they rarely pencil out.
FAQ
How much does prompt caching save?
Cache reads bill at roughly a tenth of the normal input rate, so on input-heavy prompts the ceiling is large: applied alone, it cut 38% of my total bill. Real savings depend on what share of your tokens sit in a stable prefix and how often requests land inside the cache TTL.
Is model routing worth it for a small SaaS?
Yes, once monthly LLM spend clears a few hundred dollars and your traffic has an obvious easy majority. Routing saved me $24.50 per 1,000 replies, but it needed an eval set and carries a 6% escalation tax. Below roughly $100 a month, do caching and output caps first.
Should I fine-tune a model to reduce costs?
Almost never as a first move. A cached few-shot example bank delivers most of the style consistency at 0.1x input pricing, stays portable across model releases, and ships in a day. Fine-tuning adds training data pipelines and re-tuning work every time providers ship a better base model, which in 2026 is constantly.
What semantic cache hit rate should I expect?
Published 2026 numbers show 60-85% in support-style workloads with a ~97% latency cut on hits. My review-reply system measured 22% overall and 42% within its eligible pool, because each reply needs personal details. Start with a 0.90 cosine threshold scoped per tenant, then tune against quality samples.
Working on something like this?
I build web apps, AI features, and mobile products for clients. If this article matches a problem you have, tell me about it.
Start a conversationMalik Hamza Shabbir · Full-Stack & AI Engineer
I build full-stack and AI products solo: a reputation SaaS in production, RAG pipelines, and React Native apps. I write from what I ship, not from documentation summaries.
Related articles
Is RAG Dead in 2026? Agentic Retrieval in Production
No. I rebuilt my production SaaS pipeline as agentic retrieval: cost per query down 36%, accuracy up from 68% to 89%. Only naive top-k RAG died in 2026.
Reliable JSON From LLMs: Structured Outputs Compared 2026
Strict structured outputs hold ~99.9% schema compliance while plain JSON mode fails 8-15% of the time. I compare OpenAI, Claude, and Gemini with one Zod schema.
Do AI Agents Need a Memory Layer? Mem0 vs Letta vs Zep
Most AI agents don't need a memory vendor. Unless you need consolidation, decay, or cross-agent state, Postgres with pgvector covers memory for $0 extra.