AI inference costs often look small in prototypes and then spike after launch. That is not because GPUs are “mysteriously expensive”; it is because production traffic exposes everything you did not model: long prompts, verbose outputs, low utilization, retries, and multi-call workflows.
This guide focuses on three foundational levers you can apply to almost any inference stack: caching, batching, and quantization. You will also learn a practical cost model, the metrics that matter, and a rollout playbook that reduces cost without breaking latency SLAs or quality.
Execution order that works
Most teams get the best results by optimizing in this order: measure → reduce tokens → increase reuse (caching) → increase utilization (batching) → reduce compute per token (quantization).
1. Why inference costs grow in production
Inference spend is rarely driven by one mistake. It is usually a cluster of “small” design decisions that compound:
- Context creep: system prompts accumulate policies, examples, tool instructions, and long chat history.
- Output creep: you ship a helpful assistant, then users push for detailed answers, and outputs get longer over time.
- Low utilization: expensive accelerators sit partially idle because requests are served one-by-one or batches are underfilled.
- Multi-call flows: RAG, tools, moderation, and retries turn one user action into multiple model calls.
- Tail latency: slow outliers trigger client retries, escalating cost and making traffic “spikier”.
- Reliability overhead: safety buffers, autoscaling headroom, and failover capacity cost money even when idle.
A typical hidden multiplier
A “simple” chat turn can become: (1) embeddings call, (2) retrieval, (3) generation call, (4) optional safety re-check, plus (5) retries on timeouts. Your unit cost is now 2–5× what the UI implies.
2. The simple cost model (tokens, prefill, decode)
To optimize costs, you need a mental model that is good enough to guide decisions. A practical decomposition is:
- Prefill cost: processing the input context (system prompt + user prompt + retrieved context).
- Decode cost: generating output tokens (token-by-token autoregressive generation).
- Memory cost: context length impacts KV cache size, which drives GPU memory pressure and OOM risk.
- Utilization: how well you keep the GPU busy (batching, concurrency, scheduling).
Two implications matter immediately:
- If your prompts are long, token reduction and KV caching tend to deliver fast wins.
- If your traffic is high-volume, batching tends to deliver fast wins by improving utilization.
Do not optimize blind
You cannot pick the right lever without knowing whether your cost is dominated by long contexts (prefill + memory), long outputs (decode), or low utilization (scheduling and batching).
3. Measure first: dashboards you actually need
Cost optimization fails when teams measure only one number (like “average latency”) and miss the drivers. Track these metrics at the endpoint level and slice them by model, route, and tenant (if multi-tenant):
3.1 Cost and usage
- Cost per successful request (exclude failed calls and retries; track separately).
- Input tokens (p50/p95) and output tokens (p50/p95).
- Tokens per user outcome (the best metric if you can define “outcome”).
- Calls per user action (especially for tool/RAG pipelines).
3.2 Performance and saturation
- Latency percentiles: p50, p90, p95, p99 (tail latency is where retries are born).
- Queue time: how long requests wait before inference starts (critical when batching is enabled).
- Throughput: tokens/sec and requests/sec under steady load.
- GPU utilization: compute and memory, plus “active time” vs “idle time”.
3.3 Reliability and quality
- Timeout rate, OOM rate, and provider error rate.
- Retry rate and duplicate request rate.
- Quality KPIs (task success, acceptance rate, escalation rate, CSAT).
A practical dashboard set
If you want a minimal set that still works: (1) token distributions, (2) p95/p99 latency with queue time, (3) cache hit rates, (4) batching stats (batch size distribution), (5) error/timeout/OOM rates, (6) quality KPI trend.
4. Token reduction: the highest-leverage first step
Before caching, batching, or quantization, reduce the number of tokens you send and generate. This is the one lever that improves cost and often improves latency at the same time.
4.1 Shrink prompts without losing behavior
- Remove repeated policy text: centralize shared instructions into shorter, stable templates.
- Use structured constraints: prefer bullet rules over long paragraphs.
- Move examples out of the hot path: keep few-shot examples only when they materially improve quality.
- Summarize chat history: keep only relevant turns; summarize the rest.
4.2 Cap and shape outputs
- Set sensible max output tokens per endpoint and validate real user needs.
- Prefer structured outputs (JSON schemas) for workflows; reduce verbosity and rework.
- Stop early when done: terminate generation when the required structure is complete.
Token limits are a product decision
“Unlimited output” is rarely a feature users need. It is usually a default you forgot to set. Output length is one of the most predictable cost drivers—treat it as a product contract.
5. Caching: prompt, semantic, and KV cache (deep dive)
Caching is the most “economic” lever: it reduces cost by not doing work. But caching is not one thing. Treat it as a portfolio of techniques, each with specific safety constraints.
5.1 Choose the right caching layer
- Edge/app cache: avoids model calls entirely (best savings) but must be safe and consistent.
- Embedding/retrieval cache: reduces RAG overhead and stabilizes latency.
- Inference engine cache (KV/prefix): reduces compute for long contexts and multi-turn sessions.
5.2 Exact prompt/result caching (high ROI, low complexity)
Exact caching returns a stored response for a request that matches a previous request. It is best for: FAQs, templates, standard summaries, internal tooling prompts, and “help text” endpoints.
- Key design: build cache keys from the normalized prompt + model ID + system prompt version + tool/routing config.
- TTL strategy: short TTL for time-sensitive content; longer TTL for stable answers.
- Safety rule: do not cache personalized or user-specific outputs unless the key is user-scoped.
Normalization that improves hit rate
Normalize whitespace, trim boilerplate, and canonicalize parameter ordering (e.g., JSON keys) before hashing. High cache miss rates often come from superficial prompt differences.
5.3 “Safe caching” rules most teams need
- Scope caches: public cache, tenant cache, user cache (do not mix).
- Do not cache secrets: if prompts can contain PII or secrets, cache only in encrypted stores with strict retention.
- Version everything: invalidate when you change system prompts, tools, retrieval, or policies.
- Keep observability separate: log cache metadata, not content (unless you have an explicit secure debug mode).
5.4 Semantic caching (higher hit rate, higher risk)
Semantic caching reuses a previous answer when a new query is “close enough” to a cached query, typically using embeddings. It can be powerful for natural-language search-like queries, but it can fail in subtle ways.
If you want semantic caching to be reliable, you need guardrails:
- Similarity threshold: enforce a strict cutoff; tune using real traffic, not toy examples.
- Domain boundaries: do not use semantic caching for legal/medical/financial advice or any high-stakes output.
- Personalization boundaries: do not reuse answers across users unless the content is truly generic.
- Fallback policy: if similarity is borderline, reuse only retrieved facts/snippets and regenerate the final answer.
Semantic caching failure mode
Users ask two similar questions with one critical difference. A reused answer looks confident but is wrong. If you cannot tolerate “wrong but plausible,” keep semantic caching off for that endpoint.
5.5 KV cache and prefix caching (compute savings for long contexts)
KV caching stores intermediate attention states so the model does not recompute the same prefix every time it generates a new token. This matters most when:
- Prompts are long (large system instructions, large retrieved contexts, long chat histories).
- Users have multi-turn sessions (the same prefix persists across turns).
- You stream output and decode dominates runtime.
A closely related concept is prefix caching (sometimes called “prompt prefix cache”): if many requests share the same long system prompt or template prefix, the engine can reuse the cached prefix states across requests (implementation depends on your serving stack).
5.6 Caching KPIs you should track
- Cache hit rate per cache type (exact, semantic, retrieval, KV/prefix).
- Hit rate by endpoint (it is normal that some endpoints do not benefit).
- Latency saved per hit (some hits save more than others).
- Eviction rate and memory pressure (especially for KV caches).
- Quality regressions on hits (semantic cache must be quality monitored).
6. Batching: dynamic and continuous batching (deep dive)
Batching reduces cost by increasing utilization: doing more useful work per unit of GPU time. But batching interacts strongly with latency and memory. Treat it as a controlled scheduling problem.
6.1 Static vs dynamic vs continuous batching
- Static batching: you explicitly group requests (common in offline jobs, not great for interactive products).
- Dynamic batching: the server batches requests arriving within a short time window.
- Continuous batching: the scheduler continuously merges and advances requests so the GPU stays busy, even while requests arrive over time.
If you run an interactive LLM endpoint, dynamic/continuous batching is usually where the money is.
6.2 The key trade-off: utilization vs tail latency
Larger batches often reduce cost per token, but waiting to form batches increases queue time. The “right” setup depends on your SLA and traffic pattern:
- High traffic, tight SLA: small max-wait (micro-batching) + continuous batching if available.
- Bursty traffic: queue-aware autoscaling and conservative max batch tokens to avoid OOM.
- Mixed prompt lengths: bucketing (grouping similar lengths) helps both latency and utilization.
Practical batching guardrail
Always cap batching by tokens, not just by number of requests. Token-based caps handle “one huge prompt” better than request-count caps.
6.3 Bucketing and padding: stop letting long requests dominate
When batching, heterogeneous sequence lengths can waste compute because shorter requests “pad” to match longer ones. A simple mitigation is to bucket requests by approximate prompt length (and sometimes expected output length).
- Length buckets: e.g., 0–512, 513–1024, 1025–2048 tokens.
- Priority lanes: separate “interactive” vs “background” workloads with different batching windows.
- Max context policy: enforce upper bounds to protect the system from worst-case memory blowups.
6.4 Streaming and batching: what changes
Streaming improves perceived latency but does not automatically reduce compute. With streaming, your scheduler must handle many concurrent decodes. This is where continuous batching and KV cache efficiency become important.
If your stack supports it, measure these separately:
- Time to first token (TTFT): strongly influenced by queue time, prefill, and scheduling.
- Tokens per second during decode: strongly influenced by batching, KV cache, and quantization.
6.5 Batching failure modes (and how to detect them)
- Queue explosions: p95 latency rises because requests wait too long. Detect via queue time and queue depth.
- OOM spikes: batches + long contexts exceed memory. Detect via OOM rate and memory utilization.
- Long-request poisoning: one giant request slows the whole batch. Detect via batch composition stats (max tokens per batch).
- Retry amplification: timeouts trigger retries, increasing load and making batching worse. Detect via retry rate correlated with latency.
Batching can increase total spend
If batching increases tail latency enough to trigger retries, total compute increases and costs go up. Always monitor retry rate and duplicate requests during batching rollouts.
7. Quantization: INT8 and 4-bit in practice (deep dive)
Quantization reduces cost by lowering numeric precision for model weights (and sometimes activations / KV cache). The primary benefits are lower memory and potentially higher throughput, depending on hardware and kernels.
7.1 What gets quantized
- Weights: the most common target (weight-only quantization is widely used for inference).
- Activations: can improve speed but often requires more careful kernel support.
- KV cache: lowering KV precision can reduce memory pressure for long contexts.
7.2 INT8 vs 4-bit: choosing the right level
A practical decision rule:
- INT8: start here if you want safer quality, good memory reduction, and broad support.
- 4-bit: consider when memory is the primary constraint (fit larger models on smaller GPUs), or when throughput gains are worth a stricter quality validation plan.
Use a stepwise rollout
Do not jump from FP16/BF16 straight to aggressive 4-bit settings. Move stepwise, measure quality and latency, then proceed. Quantization failures are easiest to diagnose when changes are incremental.
7.3 PTQ vs QAT: what you should do first
- Post-training quantization (PTQ): fastest path; quantize after training; typically the first attempt.
- Quantization-aware training (QAT): can preserve quality better but requires training infrastructure and careful experimentation.
For many production teams, PTQ plus strong evaluation and a rollback plan is a pragmatic starting point.
7.4 Calibration and evaluation: the non-negotiables
Quantization is not “set a flag and forget it.” You must validate:
- Task-specific evals: use representative prompts and scoring; avoid relying on a handful of manual checks.
- Regression tests: especially for structured outputs, extraction accuracy, and tool routing behavior.
- Safety checks: verify that refusals and policy behaviors do not degrade.
Quality regressions can be subtle
Many quantization regressions do not show up as “wrong answers” but as: more verbosity, more hedging, worse formatting, more tool-call failures, or higher hallucination rates. Track the outcomes you actually care about.
7.5 Operational considerations
- Keep a rollback path: ability to route back to the baseline precision quickly.
- Pin kernels and versions: quantized performance is sensitive to runtime changes.
- Watch memory: quantization can reduce weight memory but KV cache can still dominate for long contexts.
- Benchmark realistically: measure with your real prompt length distribution and concurrency pattern.
8. Putting it together: a production optimization playbook
Below is a practical sequence that works well in production, especially when multiple teams touch the pipeline. Each step is designed to be measurable and reversible.
8.1 Step 0: freeze a baseline
- Pick the highest-volume endpoint (or the most expensive per request).
- Capture a baseline for tokens, latency percentiles, throughput, error rates, and quality KPIs.
- Document current configs: model version, context limit, decoding settings, routing rules, and autoscaling.
8.2 Step 1: token reduction
- Shorten system prompts and templates.
- Introduce output caps and structured outputs where appropriate.
- Summarize or truncate history with clear heuristics.
8.3 Step 2: caching (start exact, then consider semantic)
- Add exact caching for templated or repeated queries.
- Cache embeddings and retrieval results to stabilize RAG latency.
- Enable KV caching and/or prefix caching where supported.
- Only then consider semantic caching for safe endpoints.
8.4 Step 3: batching
- Enable dynamic or continuous batching.
- Set strict max-wait and max batch tokens.
- Introduce bucketing if prompt lengths are highly variable.
- Validate queue time and retry rate under real traffic.
8.5 Step 4: quantization
- Benchmark INT8 first; then evaluate 4-bit if needed.
- Run task-specific evals and monitor quality KPIs in an A/B rollout.
- Keep fast rollback and enforce version pinning.
The compounding effect
The biggest savings usually come from combining levers: shorter prompts reduce KV memory, which allows larger batches, which improves throughput, which makes quantization gains more visible. Optimize as a system, not as isolated tweaks.
9. Reference architectures that reduce cost
9.1 Split “interactive” from “background” inference
Serve latency-sensitive user requests separately from background jobs (summaries, offline enrichment). This lets you use different batching windows, different max tokens, and different autoscaling policies.
9.2 CPU for retrieval and routing, GPU for generation
Many pipelines benefit from doing lightweight steps on CPU (routing, embeddings, filtering) and using GPU time only for generation. This can reduce GPU occupancy by “non-GPU work” and improve effective utilization.
9.3 Multi-model routing
Route easy requests to smaller/cheaper models and hard requests to larger models. This reduces average cost while preserving quality for edge cases. Routing works best when you define clear criteria (intent, complexity, allowed latency, allowed budget) and you instrument outcomes.
10. Practical checklist (copy/paste)
- Instrument cost per successful request, input/output tokens, p95/p99 latency, queue time, and retries.
- Reduce tokens: shorten prompts, summarize history, cap outputs, prefer structured responses.
- Add exact caching for safe, repeated requests (version keys, TTLs, correct scope).
- Cache retrieval (embeddings and top-k results) to stabilize RAG latency and reduce repeated work.
- Enable KV/prefix caching where supported; monitor memory pressure and eviction.
- Enable batching with max-wait and max batch tokens; add bucketing for mixed lengths.
- Quantize stepwise (FP16/BF16 → INT8 → 4-bit) with task-specific evals and rollback.
- Monitor quality continuously; treat regressions as first-class incidents.
11. Frequently Asked Questions
What is the first thing I should do if inference spend is out of control?
Measure cost per successful request and token distributions, then reduce tokens (prompt trimming and output caps). Many cost “emergencies” are simply long prompts and uncapped outputs under load.
Should I start with caching or batching?
If you have repeated prompts or templated flows, start with caching. If you have high steady traffic and low utilization, start with batching. In practice, most teams do token reduction → exact caching → batching.
Is semantic caching worth it?
It can be, but only for endpoints where “close enough” answers are acceptable and where you can enforce strict thresholds and safe scoping. If wrong-but-plausible answers are costly, avoid semantic caching and cache only retrieval or structured facts.
Does quantization always make inference cheaper?
Not always. Quantization typically reduces memory and can improve throughput, but real-world benefits depend on hardware and kernel support. Benchmark on your workload and watch for quality regressions and formatting/tooling failures.
Key terms (quick glossary)
- Prefill
- The phase where the model processes the input context (system prompt + user prompt + retrieved context) before generating output tokens.
- Decode
- The token-by-token generation phase. Decode cost typically scales with the number of output tokens and concurrency.
- KV cache
- Stored key/value attention states used during autoregressive generation to avoid recomputing the same prefix.
- Exact prompt cache
- Caching complete responses for requests that match exactly (after normalization and versioning).
- Semantic cache
- Caching based on similarity (often via embeddings). Higher hit rates, but requires strict guardrails to avoid incorrect reuse.
- Dynamic batching
- Forming batches automatically from requests arriving within a small time window to improve throughput.
- Continuous batching
- A scheduling approach that keeps the GPU busy by continuously merging and advancing requests as they arrive.
- Quantization
- Using lower-precision formats (e.g., INT8 or 4-bit) for weights and/or activations to reduce memory and potentially increase throughput.
- Tail latency
- High-latency outliers (p95/p99). Tail latency often triggers retries and amplifies cost under load.
Worth reading
Recommended guides from the category.