A recruiter's search results are only as good as the tags underneath them.
If a CV says "no Kubernetes experience" and the system emits Kubernetes anyway, a recruiter mis-ranks the candidate and builds the wrong talent pool. If a developer's CV mentions SOAP and the tagger reads it as a cleaning skill instead of a web-services protocol, that candidate vanishes from the right search and surfaces in the wrong one. At SThree, a global recruitment specialist, the Skills Taxonomy pipeline had to take CV prose and attach standardised skill labels accurate enough to drive search, candidate-role matching, and talent-pool building across high CV volume — from a label universe of roughly 174,000 labels unified across five source taxonomies (ESCO, O*NET, ACM CCS, the UK SSC catalogue, and Canada's NOC).
The first production system, V1, used the obvious modern answer: retrieve candidate skills, then ask an LLM to confirm or reject each one. It worked, reaching about 98% precision. It also cost roughly £1,800 per million CVs and took 4 to 8 seconds per CV — the wrong shape for an ingestion path that has to keep up with volume.
So the work became a subtraction problem. Keep the judgement. Remove the per-skill LLM call from the runtime path. Then measure hard enough to know whether the cheaper system still knew when to say no. It took 30 experiments across five waves. Here is what survived.
SThree Skills Taxonomy at a glance
What changed when the LLM left the hot path
Largest single lift
E03 top-k risk filter
Sort each CV’s emitted skills by historical risk score and keep the lowest-risk 20. The single biggest lift in a 30-experiment programme.
Filter stack
Wave-1 V3 stack
Length gates, risk scoring, isotonic calibration, CV-level voting, empirical-Bayes shrinkage, and blacklist rules — all before any learned head.
The orthogonal lift
E24 logistic-regression head
A 768-dim feature scored by L2 logistic regression, trained on 19,627 judge-labelled tuples. The headline uses the harder GroupKFold-by-CV number, not the +5.71pp stratified split.
Final operating point
Diverse 500-CV benchmark
500 CVs across 43 industry categories. No category below 92%; hard hallucination 7x cleaner than V1 at τ=0.7.
Limits
V1’s 98.23% comes from a separate internal 314-CV bench — the ~2pp gap is cross-bench, not like-for-like. Every precision number is mediated by one LLM judge; the human spot-check is designed but not yet run. The head score is a τ-thresholding signal, not a calibrated probability (ECE 0.36). Evaluation is English-language only.
The bet: wrong tags leave cheap fingerprints
The question was deliberately narrow: could the LLM's judgement be reconstructed from signals the pipeline already produced, at a fraction of the cost, with no LLM on the runtime path?
The bet was that an LLM was the most expensive source of judgement available, not the only one. Wrong tags tend to share observable properties. They come from a small set of risky labels that the benchmark history shows go wrong often. They appear deep in the emitted list, after the strong matches. They attach to short or ambiguous sentence spans. They survive because the retrieval stage is too permissive, not because the system lacks another billion parameters of language understanding. If that is true, most bad emissions can be caught by cheap, deterministic signals, and the LLM is only needed for the genuinely ambiguous residue.
The first large result proved the bet. The idea: sort each CV's emitted skills by historical risk — for a given label, how often it had been wrong in past benchmark data — and keep only the lowest-risk 20. Three lines of Python. It lifted precision by +12 percentage points over the 75% baseline, with zero LLM calls, and it remained the single largest lift in the entire programme.
That result is almost rude in its simplicity, and that is the point. The biggest win came not from a bigger model but from noticing that a long tail of plausible-looking labels was doing disproportionate damage, then cutting that tail before it reached a recruiter. It is aggressive — it drops close to 80% of candidate emissions — and it carries a real cost: a brand-new label has no risk history, so it falls through the cap. But it reframed the whole programme. Stop trying to validate every skill; tighten the retrieval-and-filter stack before anything learned ever gets a vote.
A stack, not one trick
The loud result was the risk-score cap. Production quality came from composition. Seven cheap post-retrieval filters and calibrators were layered on top of the baseline — the risk cap, per-source calibration, shrinkage of the risk scores, CV-level voting to drop lone weak emissions, a length gate, a small hand-blacklist of known-bad labels, a header rule.
Stacked, the seven filters delivered +16 points of precision over the baseline, reaching about 91% with zero LLM calls. The filters do not simply sum; they overlap, each catching some of the same bad emissions. That overlap is the engineering pattern: not one magic patch, but a chain of weak deterministic signals layered so that bad emissions get harder to create and uncertain ones get easier to reject.
The stack also produced a useful diagnostic. Once it matured, most plausible improvements stopped moving quality at all — the system had reached a local optimum where any intervention inside the same signal space was shadowed by the dominant risk cap. Further lift had to come from a genuinely different axis. That told me where not to spend the next month.
Proving a model call wasn't earning its place
One four-line patch cut the low-confidence LLM rescue traffic — the calls fired on the most uncertain sentences — by 174×, from roughly 19 calls per CV down to about 0.11. The precision change was +0.00: bit-identical.
That reads like a failed experiment until you ask what it was testing. It was a question — is the remaining LLM traffic buying any measurable quality? — and the answer was no. The post-stack filters had already removed everything that rescue traffic would have caught. Cutting it did not make the precision number prettier; it made the system cheaper, simpler, and easier to operate, and it shipped. Proving that a model call is not earning its place is how a system gets smaller without getting worse.
The bug that made the model look worse than it was
The most instructive failure in the programme was not a model failure. It was text processing.
The first run on an external 500-CV benchmark reported 67% precision with a 22% hard-hallucination rate. That looked like a serious generalisation cliff — a model that worked on internal CVs and fell apart on external ones. The real cause was a 30-line corpus-construction bug. Two external sources stored CV text as CSV rows with no internal newlines or periods, so the document builder collapsed each CV into a single ~1,000-character paragraph. The sentence splitter could not segment a paragraph with no sentence boundaries, so it returned one giant span, and the matcher then matched every common-noun label inside that blob.
A 30-line fallback in the sentence splitter lifted precision on the same 500 CVs from 67% to 91% — a +22-point swing with no change to the model. The "out-of-distribution cliff" came entirely from data prep. In production CV processing, parsing is part of the model: a retrieval system can only be as sane as the spans it is asked to retrieve from. Diagnosing that correctly mattered more than any single filter, because the wrong diagnosis would have sent the next month chasing a generalisation problem that did not exist.
The 7 kB head that earned its place
After the filter stack hit its local optimum, the remaining lift came from one genuinely different addition: a tiny learned classifier head.
The setup was deliberately modest. A CV sentence and a candidate skill label are each embedded with a small public sentence-transformer, concatenated, and scored by a logistic regression trained on ~19,600 LLM-judge-labelled examples. The artefact is 7 kB. It trains sub-second on a CPU and scores an emission in about 0.2 ms. That size is what lets it live inside an ingestion path rather than a research notebook.
On the leak-resistant evaluation — where no CV's sentences appear in both training and test — the head lifts precision by +4.2 points, and it was the only such orthogonal lift in the whole 30-experiment programme. Every other late-stage idea (cross-encoder reranking, recalibration, query expansion, a local-LLM rescue route) was shadowed by the existing filters. The head worked because it learned something the deterministic filters could not: a boundary over the semantic relationship between a sentence and a label, not just the statistical risk of that label. It did not replace judgement everywhere; it made one retrieve-then-validate system good enough to change the cost curve.
The final operating point — and the honest trade
The shipped shape is retrieval, deterministic filters, calibration, and the 7 kB head. Zero LLM calls on the inference hot path. On a diverse final benchmark of 500 CVs across 43 industry categories it reaches about 96% precision at a hard-hallucination rate as low as 0.05%, and a single threshold (τ) acts as an explicit quality dial — raise it to trade recall for a cleaner emitted set.
The economics moved from £1,800 to about £10 per million CVs, and from 4–8 seconds to about 50 ms per CV. That is roughly 180× cheaper and 100× faster.
Two qualifiers belong on those headline multipliers, because the honest version of this result is a trade, not a free win. The 180× is measured against the as-shipped 2024 V1; against a hypothetical V1 rebuilt today with prompt caching, the cost gap would be closer to 30–60×. And the new system's ~96% is not a precision win over V1's ~98% — V1 stays roughly 2 points ahead. The trade is explicit: about 2 points of precision surrendered for 180× cost, 100× latency, and a markedly lower hard-hallucination rate. For a recruiter product processing CVs at ingestion volume, that is the right trade. It is still a trade.
What changed is the operating shape, not just the price. Tags arrive fast enough to be useful during ingestion. Cost no longer scales with one LLM call per candidate skill. And the quality boundary is a visible threshold a recruiter-facing team can audit, tune, and defend, rather than an opaque model verdict.
The detail behind those headline numbers — and the experiments that didn't work — sits in the next two sections. If you only wanted the result, you have it: zero LLM on the hot path, ~180× cheaper, ~2pp behind the model it replaced. Read on for how hard the measurement had to be to trust that.
The measurement, in full
The headline lift comes from the leak-resistant split. Under GroupKFold by CV — no CV's sentences in both train and eval folds — the head lifts precision by +4.2 pp at τ=0.5 (96.62% pooled precision, 76% kept fraction) and +5.30 pp at τ=0.7 (98.00% pooled, 64% kept). A simpler stratified split gives a flattering +5.71 pp, but stratified splits leak: sentences from the same CV land on both sides. The portfolio claim is the GroupKFold number. Per-fold precision at τ=0.5 was %, fold-std 0.43 pp, with a 95% CI on the lift of roughly [+3.83, +4.58] pp.
A caveat travels with it. About 31% of the head's training tuples share their source corpus with the headline evaluation set. CV IDs are disjoint, so there is no CV-level leakage, but corpus-level surface conventions can travel, so +4.2 pp is best read as an upper bound on cross-corpus generalisation; a corpus-held-out retrain is expected to land closer to +3.5–3.8 pp.
On the final 500-CV benchmark stratified across 43 categories, V3+head reaches 96.06% precision / 0.21% hard hallucination / 18.7 skills per CV at τ=0.5, and 96.38% / 0.05% / 15.8 at τ=0.7. Precision rises and hallucination falls monotonically with τ. Breadth holds: no category sits below 92% at τ=0.5, 31 of 43 exceed 95%, and Wilson 95% lower bounds put 41 of 43 categories at ≥85% (lowest: Human Resources at an 85.3% lower bound, n=75).
What did not work
Five of the nine final-wave experiments returned null or negative, and they belong in the record as much as the wins:
- Grey-zone-only LLM validation: zero discordant pairs at the survivor level — there was nothing left for it to fix.
- Post-stack cross-encoder rerank: Δ=0; the survivor pool is single-candidate by construction, so there is nothing to rerank.
- Stronger priors on the risk-score shrinkage: ≤0.17 pp, shadowed by the dominant cap.
- HyDE-style label expansion: −0.09 to −0.15 pp; it buys recall, not precision.
- A local 7B model as a rescue validator: agreed with the cloud LLM on only 47% of decisions and ran at ~4 requests/s on a shared GPU — failing on both quality and throughput.
The obvious model-shaped moves were tried. The ones that survived measurement stayed; the ones that did not were cut. None of the surviving machinery is novel — reciprocal rank fusion, isotonic calibration, empirical-Bayes shrinkage, a logistic-regression head on frozen embeddings, a single LLM-as-judge are all borrowed, established techniques. The contribution is the programme: the disciplined search, the composition, and the negative-result accounting that found which borrowed pieces actually earned a place in production.
Limitations
- Single LLM judge. Every precision number, and the head's training labels, are mediated by one judge model at temperature 0. A 200-CV human spot-check was designed but not run, so there is no human-labelled validation yet.
- V1 comparison is cross-bench. V1's ~98% comes from a different internal benchmark; it was never re-run on the diverse 500-CV set. The ~2 pp gap is between distributions, not between systems on a shared bench.
- Real-PDF prose collapse. On a 155-CV real-PDF retest, hybrid precision fell to 62.83%, with one Tier-2 encoder collapsing on a single noisy corpus. This is the largest known production gap; a per-source minimum-cosine gate is the planned mitigation.
- The head's score is not a probability. Calibration error is high (ECE 0.36); the head is systematically under-confident. Its ranking is preserved, so τ-thresholding works, but the raw score should not be read as a literal probability without temperature scaling.
- Anglophone-Western coverage only. All five manifests and every evaluation CV are English-language. There is no production-volume non-English evaluation.
This is a strong result for retrieve-then-validate systems. It is not evidence that a small classifier head can replace an LLM everywhere.
When the model is the product, and when it is plumbing
The thesis is plain. An LLM on the runtime hot path was an authority knob, not just a cost knob — and most of that authority could be reconstructed from signals the pipeline already produced. The conclusion was never "LLMs are bad." It was narrower and more useful:
Use the LLM where it creates leverage. Remove it where cheaper evidence can carry the job.
For SThree's CV tagging, the LLM kept real value offline — generating training labels, supporting evaluation, building the judge corpus. It did not need to sit on the inference hot path for every CV. The long, plausible-looking tail of labels is usually where the confidently-wrong tags hide; sorting a pipeline's own emissions by historical risk and capping that tail cost three lines and bought twelve points, before anything learned got a vote.
That is the skill the cloud bill eventually charges you to learn: not how to call a frontier model, but when the model is the product, when it is the teacher, when it is the judge, and when it is expensive plumbing a 7 kB classifier can replace.
