Article

When AI Costs Drift in Production: How to Detect It Early and Build Reliable LLM Unit Economics

A practical guide to identifying silent AI cost drift in production systems, instrumenting the right metrics, and controlling spend without hurting response quality.

A few weeks into production, an AI system can look perfectly healthy while quietly getting expensive.

This happened in a system we recently reviewed.

From the outside, nothing looked broken:

  • responses were correct,
  • latency was acceptable,
  • usage was growing steadily.

In early testing, cost was too small to worry about:

  • one LLM call per query,
  • moderate prompts,
  • limited traffic.

Then production usage arrived and the economics changed fast.

After digging in, we found:

  • per-query cost had moved from about $0.02 to $0.25-$0.30,
  • average request size was consistently 40k-60k tokens,
  • some user flows were triggering 2-4 model calls per interaction,
  • similar queries were being recomputed repeatedly.

The system was not obviously inefficient. It was simply not designed with cost as a first-class constraint.

This post is a practical playbook for teams building LLM products who want to answer four production questions:

  1. How do we detect cost drift early?
  2. What exactly should we track?
  3. How do we define useful unit metrics and per-user economics?
  4. What changes reduce cost without hurting quality?

Why Cost Drift Happens Even in “Good” Systems

Most teams monitor quality and latency first.

That is correct, but incomplete.

LLM systems have a third axis: economics under load. If this is not instrumented from day one, drift is almost guaranteed.

The most common pattern is not one giant bug. It is several small decisions that compound:

  • context windows grow over time and are never trimmed,
  • retrieval is tuned for “safety” (more chunks) instead of precision,
  • multi-step flows call models even when nothing changed,
  • no caching on repeat queries or intermediate steps,
  • all traffic uses the same heavy path regardless of complexity,
  • no baseline definition of cost per request or workflow.

Each choice feels reasonable locally. Together they create a cost curve that steepens with adoption.

Detection: How to Know You Have Cost Drift

Cost drift is easiest to catch when you track ratio metrics, not just total cloud bill.

These are the early warning signals that matter most.

1) Cost per Request Is Rising Faster Than Product Value

If query volume doubles and total cost doubles, that may be fine.

If volume is steady but cost per request rises from $0.02 to $0.10, then to $0.25, that is structural drift.

Track this daily:

  • Cost per successful request = total LLM spend / successful requests

Segment by:

  • endpoint,
  • workflow,
  • user tier,
  • model,
  • query class.

2) Tokens per Request Trend Upward Week by Week

Raw token volume is noisy. Normalized token volume is not.

Track:

  • Input tokens per request (p50, p90, p99)
  • Output tokens per request (p50, p90, p99)
  • Total tokens per request = input + output

If p90 input tokens keep increasing, context is likely bloating.

3) Calls per Interaction Increases Quietly

Agent-like flows can add retries, tool loops, and chained calls over time.

Track:

  • Model calls per user interaction (mean, p95)

If this grows from ~1.1 to ~2.8, spend can double without any visible quality gain.

4) Repeat Work Ratio Is High

Many expensive systems recompute nearly identical results.

Track:

  • Cache hit rate for final responses,
  • Intermediate cache hit rate for retrieval/tool outputs,
  • Duplicate query ratio over rolling windows.

5) Marginal Cost Is Rising While Latency and Quality Stay Flat

This is the classic danger zone.

If quality and latency are stable but cost climbs, your architecture may be over-serving complexity.

Instrumentation: What to Log on Every Request

If you cannot explain a single expensive request from logs, your observability is incomplete.

At minimum, capture the following at request level.

Request Metadata

  • request id,
  • user id or tenant id,
  • endpoint/workflow name,
  • timestamp,
  • model/provider,
  • route class (simple vs complex path).

LLM Usage Metadata (Per Call)

  • input tokens,
  • output tokens,
  • cached tokens (if provider reports this),
  • unit prices used for billing math,
  • computed call cost,
  • call duration,
  • retry count.

Retrieval/Context Metadata

  • number of chunks retrieved,
  • total context tokens injected,
  • chunk source types,
  • reranker on/off,
  • truncation flags.

Workflow Metadata

  • number of model calls in the interaction,
  • tool calls count,
  • branch path chosen,
  • cache hits/misses for each stage.

Outcome Metadata

  • success/failure,
  • user-visible quality proxy (thumbs up/down, task completion, escalation),
  • latency by stage and total.

This creates a full cost trace, not just a final invoice number.

Metrics Framework: The Minimum Dashboard That Actually Helps

A useful dashboard combines spend, behavior, and outcomes.

Use this as a baseline.

Cost Metrics

  • Total LLM spend/day
  • Cost per request
  • Cost per interaction
  • Cost per workflow (e.g., search, summary, support resolution)
  • Cost per active user (daily and monthly)
  • Cost per successful outcome

Token Metrics

  • Input tokens/request (p50, p90, p99)
  • Output tokens/request (p50, p90, p99)
  • Context tokens/request
  • Tokens per workflow step

Efficiency Metrics

  • Model calls/interaction
  • Cache hit rate (response + intermediate)
  • Retrieval precision proxy (kept chunks / retrieved chunks)
  • Simple-route share vs complex-route share

Quality and Reliability Guardrails

  • task success rate,
  • user rating or explicit feedback,
  • fallback/escalation rate,
  • latency SLO attainment.

Never optimize cost alone. Always pair cost metrics with quality and latency guardrails.

Unit Economics: The Metrics Leaders Actually Need

Engineering teams often stop at total spend. Product and finance teams need unit economics.

These are the most practical unit metrics.

1) Per-Request Unit Cost

Useful for API businesses and high-volume chat/search products.

Formula:

Per-request cost = total variable inference cost / successful requests

Use alongside percentile distributions and endpoint segmentation.

2) Per-User Cost

Useful for subscription and seat-based products.

Formula:

Per-user cost (period) = total variable AI cost in period / active users in period

Track by tier:

  • free,
  • pro,
  • enterprise.

This quickly exposes tier imbalance and misuse.

3) Per-Workflow Cost

Useful when user interactions contain multiple AI steps.

Formula:

Workflow cost = sum(call costs + retrieval/tool costs) across one full task

Example workflows:

  • support ticket resolution,
  • report generation,
  • research synthesis.

4) Cost per Successful Outcome

Most meaningful business metric.

Formula:

Cost per successful outcome = total AI cost / number of completed successful outcomes

This links spend directly to value delivery.

5) Gross Margin Impact per Tier

If your product has monetized plans:

AI gross margin = (plan revenue - AI variable cost) / plan revenue

Even rough tracking here helps pricing and packaging decisions.

What to Look for During a Cost Investigation

When costs drift, run this checklist in order.

Context Bloat

Look for:

  • increasing prompt sizes over time,
  • old conversation history always included,
  • retrieval chunks added without a budget.

Fix:

  • hard token budgets,
  • recency/importance pruning,
  • selective memory summaries.

Retrieval Over-Inclusion

Look for:

  • top-k increased repeatedly “just to be safe”,
  • many low-value chunks passed to prompt,
  • no reranking stage.

Fix:

  • retrieve fewer candidates with better relevance,
  • rerank then keep only top evidence,
  • enforce context cap before generation.

Redundant Multi-Step Calls

Look for:

  • repeated planning/reasoning calls,
  • expensive verification calls on every request,
  • loops without change detection.

Fix:

  • stage-level caching,
  • idempotent checks,
  • stop conditions and max-step guards.

Missing Routing

Look for:

  • simple tasks going through full complex pipeline,
  • high-end models used universally.

Fix:

  • classify intent/complexity upfront,
  • route simple tasks to cheaper path,
  • escalate only when confidence is low.

Invisible Expensive Stages

Look for:

  • one workflow where costs are much higher but hidden in averages,
  • no stage-level cost attribution.

Fix:

  • per-stage cost accounting,
  • workflow-specific dashboards and alerts.

Practical Control Levers That Usually Work

In the system we reviewed, no model swap or major infrastructure rewrite was needed.

The highest impact came from structural controls.

1) Track Cost per Request Next to Latency

Cost should be a first-class SLO-adjacent metric.

If teams can see latency regressions immediately, they should see cost regressions the same way.

2) Enforce Token Budgets in Two Places

Set hard caps at:

  • retrieval output,
  • prompt assembly.

Do not rely on downstream truncation alone.

3) Cache Repeated Queries and Intermediate Outputs

Use:

  • semantic caching for near-identical user queries,
  • deterministic keys for intermediate planner/tool results.

Even moderate hit rates can materially lower spend.

4) Remove Redundant Calls in Agent Flows

Many flows run “just in case” checks that almost never change outcomes.

Audit each call:

  • what new information does this step add?
  • what decision changes if we skip it?

If neither answer is clear, remove or gate it.

5) Add Basic Routing by Complexity

Route classes can be simple:

  • class A: direct retrieval + one generation call,
  • class B: retrieval + reranker + generation,
  • class C: full agentic chain.

Most traffic is usually class A/B, not C.

Alerting: Catch Drift Before the Bill Arrives

Set alerts on normalized metrics, not only total spend.

Recommended alert examples:

  • Cost/request daily > baseline + 30% for 2 consecutive days
  • Input tokens/request p90 > threshold
  • Calls/interaction p95 > threshold
  • Cache hit rate drops below baseline
  • Cost rises while quality KPI does not improve

Tie alerts to actionable owners (platform, retrieval, agent logic), not a generic on-call queue.

A 30-Day Implementation Plan

If your system has no cost discipline yet, this sequence works well.

Week 1: Instrumentation and Baselines

  • log per-call token and cost data,
  • build request/workflow cost traces,
  • establish baseline dashboard for cost, tokens, calls/interaction, quality.

Week 2: Budget Controls

  • introduce retrieval and prompt token caps,
  • cap max model calls per interaction,
  • add stage-level timeout and loop guards.

Week 3: Caching + Routing

  • launch response cache for repeated patterns,
  • add intermediate cache where deterministic,
  • route simple queries to lightweight path.

Week 4: Tune Without Regressing Quality

  • compare pre/post on quality and latency guardrails,
  • tighten thresholds where safe,
  • finalize alerts and ownership.

Results You Should Expect

In our case, the pattern was clear after these structural changes:

  • token usage per request dropped around 40-60%,
  • cost became predictable under load,
  • response quality remained largely unchanged.

No model replacement. No major infra overhaul.

The core insight was simple:

Most production AI overspend does not come from the model itself. It comes from how the system uses the model.

Final Thought

If you are building with LLMs in production, treat cost as an engineering metric, not a finance postmortem.

Track it per request, per user, and per workflow.

Set explicit token and call budgets.

Design for precision, caching, and routing from the start.

When you do, you usually get all three together:

  • lower spend,
  • stable quality,
  • predictable scaling.

That combination is what turns an impressive demo into a durable product.

Related Services

← Back to blog