Skip to content

Production AI economics

LLM Cost Optimization For Production AI Systems

Peter treats LLMs as tools, not default plumbing: use them where they create leverage, remove them where cheaper signals carry the job.

£1,800 → £10

SThree CV skill tagging

Per-million-CV cost cut 180× by moving inference to retrieval, filters, and a 7 kB head.

+12.03 pp

Cheap signals first

A three-line top-k cap by risk score was the single largest precision lift in the programme.

4–8s → ~50ms

Operational latency

A ~100× faster tagging path replaced per-skill runtime LLM validation.

174× fewer calls

LLM traffic that earned nothing

A four-line patch cut Tier-5 rescue calls 174× with a +0.00 pp precision change.

Cost-aware private assistant

Personal Assistant AI

Routine paths stay efficient, while deeper reasoning is reserved for tasks where it changes the result.

Spending intelligence where it changes the outcome

Personal Assistant AI adds a private-product example of model-cost discipline: a memory-heavy assistant should not spend the same reasoning budget on every request.

  • The system separates routine paths from deeper reasoning so AI spend follows product value rather than spectacle.
  • Private memory work is designed around trust, latency, and cost discipline without publishing the implementation recipe.
  • The public article deliberately avoids provider and cost recipes because the exact product behaviour is part of the future executive-assistant business.

Hot-path removal as engineering judgement

The Skills Taxonomy project is a clean case of knowing when an LLM is the wrong production primitive. LLMs still do offline labelling and evaluation; the inference path became cheaper, faster, and more inspectable.

  • V1 ran an LLM-per-skill validator — roughly seven LLM calls per CV — at about £1,800 per million CVs and 4–8s latency.
  • The V3+head pipeline uses retrieval, deterministic filters, calibration, and a 7 kB classifier head at about £10 per million CVs and ~50ms per CV.
  • The 180× figure is against the as-shipped 2024 V1; a prompt-cached V1 would narrow the gap to ~30–60×.

Cheap signals before bigger models

The real move was not a bigger model. It was ranking and pruning the candidates retrieval had already produced.

  • A three-line top-k cap by per-URI risk score added +12.03 pp precision — the single largest lift in a 30-experiment programme.
  • The full seven-filter Wave-1 stack added +16.44 pp through calibration, voting, empirical-Bayes shrinkage, and isotonic regression.
  • A four-line patch cut Tier-5 LLM rescue calls 174× with a +0.00 pp precision change — the traffic was not earning quality.

A tiny classifier head and threshold trade-offs

The logistic-regression head shows how a small supervised layer can carry earlier LLM-judge labels into a cheap inference boundary when the task is well-shaped.

  • The head is a tiny joblib artefact over frozen MiniLM embeddings — the only orthogonal lift in the whole programme.
  • τ thresholds make the trade explicit: 96.06% precision / 0.21% hallucination at τ=0.5, 96.38% / 0.05% at τ=0.7.
  • The headline lift is the honest +4.2 pp GroupKFold-by-CV figure, not the easier +5.71 pp stratified split.

The limits, stated plainly

Cost wins only count if the quality story holds. The Skills Taxonomy work names its limits as part of the result.

  • Every precision number is mediated by a single LLM judge; a 200-CV human spot-check is scheduled but not yet run.
  • On certain real-PDF prose, V3 hybrid precision drops to 62.83% and the ESCO Tier-2 encoder collapses to 3.7% — the largest known production gap.
  • All evaluation CVs and skill manifests are English-language and anglophone-Western; there is no non-English production evaluation.

Continue exploring

Related work

Read all writing