Aurum at SThree: Enterprise Agentic AI Engineering

Agentic AI near hiring needs engineering discipline

Recruitment AI looks harmless right up until it matters. A model summarising a CV reads as admin. A model comparing interview notes reads as efficiency. But the moment those outputs influence who gets seen, who gets challenged, or who gets quietly pushed down the pile, the system is sitting near access to work. The usual way teams test that kind of software is nowhere near good enough: a few toy CVs, a reassuring demo, and a prompt that behaved itself once.

Aurum was my answer at SThree to a harder question. How do you build enterprise agentic AI for a hiring-adjacent workflow so that generation, evaluation, review, adversarial testing, and improvement are all inspectable before the system reaches real candidates?

What that became is a 25-agent evaluation workbench for generating and stress-testing a complete synthetic candidate lifecycle, plus a self-improvement research stream that pushed evaluator quality from a 0.627 baseline to a 0.851 mean fitness against a frozen synthetic proxy. The numbers below are the IP-safe shape of that work. Aurum is employer work at SThree, so this describes system scale and results, not the operating mechanics, internal designs, or build sequence that stay private.

Product boundary

Aurum is SThree employer work and is shown here as enterprise agentic AI engineering evidence. The public portfolio describes the capability and judgment behind the work; employer-IP details that would expose protected operating mechanics, infrastructure, or build sequence remain private.

Demo video coming soon

A public demo video is planned. It will use anonymised or synthetic fixtures and will show the product experience without exposing customer data, private operating details, or the implementation recipe.

The real problem is agentic evaluation infrastructure

Generating a plausible candidate is the easy version of the problem, and it is not enterprise AI engineering. The hard version is building an environment where agentic hiring workflows can be stressed, reviewed, compared, improved, and challenged before they affect a real person.

A useful test environment needs controlled candidate evidence, interview-style context, evaluator outputs, fairness diagnostics, adversarial examples, and audit records that a human can inspect later. It also has to separate what the system discovered from what it merely generated; otherwise synthetic data is impressive to look at and weak as evidence.

Aurum treats synthetic hiring scenarios as controlled test fixtures. The point is not to replace real applicants or certify a hiring system. The point is to create repeatable conditions where teams can ask sharper questions:

Does the evaluator treat different evidence types consistently?
Do non-traditional career paths get flattened into false risk?
Do demographic proxies or formatting artifacts move scores in suspicious ways?
Can adversarial text or hidden instructions manipulate the process?
Can a reviewer reconstruct what happened when a result looks wrong?

That is the engineering gap Aurum addresses. Recruitment AI does not need more uninspectable confidence; it needs a controlled way to find failure modes before deployment, and it needs the agentic system itself to produce state that can be reviewed rather than merely admired.

The product shape

Aurum is an enterprise agentic AI workbench for recruitment evaluation. It creates controlled hiring simulations, evaluates the behaviour of AI evaluators, surfaces fairness and consistency risks, records adversarial tests, preserves reviewable state, and keeps a human in the loop.

The public shape is a controlled lifecycle: role context becomes synthetic candidate evidence, interview-style context, multi-modal evaluator outputs, fairness diagnostics, and review artifacts. Across the runs that produced fairness artifacts, the system generated 219 candidate records over 11 fairness-bearing runs — enough volume to compare evaluator behaviour across roles and strength bands rather than reasoning from a single demo.

Every candidate gets a full interview, not just a CV, for one reason: interview evidence then stands as an independent modality instead of an artifact of the same selection step. That single decision is what makes the tri-modal comparison mean anything. The candidates are test fixtures; the product is the evaluation infrastructure around them.

Multi-agent work where it creates control

The 25 agents are not a swarm for its own sake. They are concentrated where disagreement, review, and challenge matter most: role understanding, synthetic evidence, interview simulation, multi-modal evaluation, fairness review, and adversarial probing. I am not publishing the allocation table or the contracts between them. The lesson worth carrying into other systems is simpler: evaluation gets extra structure because that is where disagreement has to be made visible before anyone trusts the output.

Hiring evaluation contains different kinds of work — understanding a role, creating test evidence, scoring across modalities, detecting risk patterns, preparing review artifacts, proposing adversarial probes. Blurred together, the system is hard to govern. Separated behind clear review boundaries, the work leaves artifacts a person can inspect. The agent count followed from that decomposition.

That is the engineering lesson worth carrying out of this. Agentic systems earn their keep when they sharpen who is responsible for what, and they become dangerous when they hide accountability behind a swarm of fluent components. Aurum is built as the former: bounded work, reviewable state, and enough of a record for a human to reconstruct why the system behaved as it did.

Fairness review without false certainty

The fairness layer runs seven test families over the completed candidate bundle, then splits the results into small-N pass/fail checks and reporting-only metrics. A test running on too few candidates to be statistically meaningful contributes a number to look at rather than a pass or fail that would skew the overall rate. An underpowered test that quietly votes "pass" is how a fairness dashboard ends up lying with a green badge.

That split is the difference between a diagnostic and a verdict. Fairness analysis in Aurum is review evidence — for finding suspicious patterns, comparing controlled runs, and pointing a human at where to look harder. On synthetic candidates it surfaces failure modes and catches regressions; it is not legal sign-off, certification, or a guarantee that a deployed hiring process is fair, and the synthetic data cannot make it one.

Red-team testing belongs inside the workflow

Recruitment AI has an unusual attack surface: the untrusted input is often the applicant artifact itself. A CV, portfolio, cover letter, or interview transcript can carry hidden instructions, formatting tricks, identity cues, or manipulative framing. If a hiring evaluator can be nudged by those artifacts, the risk is not theoretical.

Aurum brings adversarial testing into the same review environment as ordinary evaluation. The progression is deliberate: operator-chosen checks, gated agent-assisted probes, and bounded search over variants all sit under human control. Clean and challenged versions are compared through the same evaluation discipline, which makes the before-and-after comparable rather than anecdotal without turning the public article into an attack recipe.

The more autonomy a probe has, the more oversight sits in front of it. The point is comparable, reviewable attack evidence under human control.

Production control is a design requirement

Recruitment AI safety cannot be solved by replacing one opaque decision-maker with another. The product has to give humans enough structure to review what happened, challenge the system, and decide what should change.

That is why Aurum is a product rather than a notebook. It has surfaces for configuring controlled tests, monitoring runs, inspecting evidence, reviewing risk signals, and exporting artifacts for later discussion. The system records enough state early enough that a run can still be reconstructed when something fails. Agentic AI workflows near a hiring decision need that: reviewable interfaces, durable records, and operational control, not just backend experiments. The full live interface stays out of the portfolio because the product is employer-IP bounded and the demo surface is being prepared separately.

Self-improving evaluators, without self-delusion

The hardest research question Aurum raised is the one that justifies all the infrastructure above: can an evaluator improve itself without quietly cheating?

A self-improving evaluator is easy to sell and easy to fool. Let it run, score its own failures, ask an optimizer for a better prompt, test the new version, keep the improvement, and the number goes up. The trouble is that the number can go up for the wrong reasons — by compressing scores into a safer range, by getting better at recovering a synthetic label it was never supposed to see, by improving mean fitness while ranking quietly gets worse. In recruitment, an evaluator that improves the wrong thing is a liability with a more convincing score.

So this stream was framed as a measurement problem before an optimisation one. The question was narrow and answerable: can evaluator architecture and reflective prompt evolution improve calibration and separation against a frozen synthetic proxy oracle, inside a checked-in benchmark, without the proxy label leaking into the evaluator's inputs? It is not a hiring-validity question, and it never pretends to be.

Results at a glance

What the self-improvement study actually showed

pipeline result files

completed Phase 0-4 runs

215

completed-run candidates

20/20

Exp5 validation units, zero failed

Phase 4 selection

MultiStep+GEPA

mean fitness 0.8507strong/weak separation 24.77

Came out top on specific CV-only generated-registry metrics; the top-3 ensemble still retained better rank metrics.

Exp5 validation

Fusion transfer

mean fitness 0.8558pipeline MAE 5.96trajectory accuracy 0.5914

Came out top on the selected aggregate metrics; interview-only transfer was weaker on rank and separation.

Boundary: this is a bounded synthetic proxy-oracle result set. It does not establish real hiring validity, human ground truth, job-performance truth, legal sign-off, production temperature settings, or universal GEPA superiority.

The labels are the synthetic strength_level — weak, medium, or strong — assigned during candidate generation, before any evaluator scores anything. That label is not independent of the CV and interview artifacts, because those artifacts were conditioned on it. So the benchmark tests recovery of intended synthetic strata, not discovery of real-world quality. That is a bounded, honest thing to measure: whether the evaluator can rebuild strata it never saw, not whether those strata track the world.

The gate that makes any of it trustworthy is the leakage gate, enforced in code. The data loader carries strength_level and prior scores for metrics and stratification, but only cv_content, interview_transcript, job_title, and job_requirements are ever marked model-visible. The hidden label is never handed to the evaluator, and an input-leakage audit makes that mechanically checkable. (A MultiStep evaluator may produce its own strength_level_estimate — that is allowed, because it is inferred from visible evidence, and it is not the hidden label.)

On top of that gate sits the comparison. Five evaluator designs — from default single-step up to a three-stage MultiStep evaluator and a top-3 ensemble — were run against a broad optimiser set: BootstrapFewShot, MIPROv2, SIMBA, InferRules, and GEPA, the genetic-Pareto reflective optimiser at the centre of it. GEPA earns that place because its learning medium is natural-language reflection over what failed, which is the shape of evaluator diagnosis: it does not just search the prompt space, it leaves a trail of why each prompt changed. The fitness function was written to be argued with — score-accuracy against explicit anchor bands, a direction penalty when a weak candidate outscores the medium anchor, and an evidence-quality term — so you can see what the optimiser is chasing and tell a prompt that improved calibration from one that merely reshuffled ranks.

The result is strong and deliberately narrow. MultiStep decomposition lifted baseline fitness from 0.627 to 0.790 before any optimiser ran — architecture, not prompt search, doing the early work. Adding GEPA produced the best CV-only design: 0.851 mean fitness and 24.77 strong-versus-weak separation, up from 16.4 points of separation at baseline. But MultiStep+GEPA does not win everything. A top-3 ensemble beats it on every rank metric — Spearman, Kendall tau-b, concordance — in all five seeds. A validation layer then moved the headline again: the best validated row was fusion transfer at 0.8558 fitness and 5.96 pipeline MAE, while interview-only transfer stayed weaker, so the work claims no uniform modality transfer. The whole campaign earned a bounded readiness score of 96.2 out of 100 — a measure of how well the evidence package documents itself, not a certification of hiring validity.

The platform lesson under all of that is governance. If an evaluator changes, a reviewer should be able to see what changed, why, what evidence supported it, and what limits still apply — so prompt lineage and prompt diffs are part of the evidence package, not throwaway text. A hiring-adjacent system should never silently rewrite its own judgment criteria and call it progress. Improvement is allowed here only with lineage, bounded claims, and human review.

What this adds up to

Pull the pieces together: 25 agents, 219 candidate records over 11 fairness-bearing runs, seven fairness test families with the underpowered ones held out of the pass rate, red-team evidence under human oversight, and an evaluator that moved from 0.627 to 0.851 mean fitness — with the rank-metric loss reported next to the win, and the proxy label held out of its inputs the whole time. The systems engineering and the research sit on the same discipline: the work leaves artifacts a reviewer can inspect.

The product engineering is what turns that into something a person uses rather than a folder of experiments. An abstract trust problem becomes a workbench with controlled test evidence, review surfaces, and audit artifacts — where synthetic results stay synthetic, a human stays in the loop, and compliance stays a deployment-context question a demo cannot answer. That last sentence is the whole discipline: the number going up only counts if you can show why it moved.

What stays private

This is not a build manual. It leaves out the operating mechanics, internal data structures, adversarial recipe details, vendor and model choices, infrastructure, cost assumptions, private screenshots, customer data, the evaluator rubrics and datasets, and the sequence someone could follow to reproduce the system. The scale and the results are the public surface; the recipe is employer IP. A demo video using synthetic fixtures is planned to show the product experience without crossing that line.