Skip to content

Multi-agent systems

Enterprise Agentic AI With Evaluation And Control

Aurum at SThree shows enterprise agentic AI engineering in a high-stakes domain: orchestration, evaluator architecture, self-improvement research, reviewable state, fairness diagnostics, red-team layers, and production control. Personal Assistant AI is the private daily-use counterpart. Conversico (with Greeta as the voice agent) adds real-time voice operations.

Real-time voice operations

Conversico (Greeta voice agent)

AI receptionist venture with bounded voice workflows, escalation logic, tenant privacy, and launch gates.

Enterprise agentic AI

Aurum

SThree employer work with controlled tests, evaluator architecture, reviewable state, fairness/red-team systems, and self-improvement research.

8 specialists · memory OS

Personal Assistant AI

A private multi-agent second brain with capture, durable recall, review, action, and quality monitoring.

Full recruitment lifecycle

Production platform

Candidate matching, outreach, scheduling, AI voice interviews, enrichment, and reporting.

Evaluator self-improvement research

Aurum research layer

GEPA prompt evolution, MultiStep evaluators, leakage audits, and Exp5 validation transfer.

P0–P12 patterns

Deep-research benchmark

A study of around 13 orchestration patterns, from single-pass pipelines to multi-step agentic planning.

Zero LLM calls

Skills Taxonomy counterpoint

SThree CV skill tagging shows when the right move is removing the LLM from the hot path entirely.

Detailed plan

CivTech exploration plan

A human-in-the-loop impact-assessment architecture; exploration-stage strategy, not a delivered system.

Real-time voice operations

Conversico shows multi-agent judgement under time pressure with Greeta as the voice agent: the system has to recover missed calls, support bookings, handle uncertainty gracefully, and preserve practice-level privacy.

  • Public detail is limited to product-level behaviour: voice operations, booking support, escalation, privacy, and launch gates.
  • The system is framed as in-production and going to market, not as a customer-adoption claim.
  • The private implementation keeps the product mechanics, sensitive materials, and exact workflows out of public view.

Private assistant architecture

Personal Assistant AI is the independent counterpoint to Aurum: not recruitment safety, but a private assistant case study for capture, memory, review, action, and proactive support.

  • Uses specialist agents only where decomposition improves control, while simpler paths stay simple.
  • Shows durable memory, review rhythm, and proactive background support without publishing the product recipe.
  • Keeps quality under review through monitoring, calibration, and regression detection.

Aurum architecture principles

Peter uses multi-agent systems when decomposition improves control, reliability, or the reviewer’s ability to understand what happened. In Aurum, the public architecture story is deliberately bounded because it is SThree employer work and the implementation remains employer IP.

  • Controlled tests, review surfaces, fairness diagnostics, and red-team testing are separated so accountability stays visible.
  • Structured handoffs and audit records make system behavior inspectable without publishing the implementation recipe.
  • The public case study shows agentic AI judgment while keeping employer-IP implementation details private.

Production control plane

The production question is how to debug, evaluate, and govern the system after it leaves a notebook. Aurum treats runtime state as a first-class product concern.

  • Run history, event records, review bundles, and artifacts make long-running AI work inspectable.
  • Failure analysis is designed into the product rather than handled as an after-the-fact debugging exercise.
  • Human approval surfaces keep evaluator improvement reviewable rather than hidden.

Evaluator self-improvement research

A separate Aurum research layer shows Peter can build agentic AI evaluation architecture, not just orchestration. It studied reflective prompt evolution against a synthetic proxy oracle across a frozen 215-candidate snapshot.

  • Compared MultiStep evaluators, DSPy optimizers, GEPA prompt evolution, and top-3 ensembles across Phase 0–4 and four Exp5 validation experiments.
  • MultiStep+GEPA led Phase 4 on fitness (0.8507) and separation (24.77); the top-3 ensemble still beat it on Spearman, Kendall tau-b, and concordance in all five seeds.
  • Used DSPy input-leakage audits, prompt lineage, and a claim ledger, with full negative-result reporting from Phase 0 through Exp5 validation.

Benchmarking architectures, not opinions

The deep-research benchmark shows the same instinct outside recruitment: orchestration claims need a controlled comparison, a shared tool layer, judge-reliability checks, and bounded interpretation.

  • Compared around 13 orchestration patterns (P0–P12), from single-pass pipelines to multi-step agentic planning, search, synthesis, and verification.
  • Scored report quality with a three-judge panel — GPT-5.2, Claude Opus 4.1, Claude Sonnet 4.5 — and reported Krippendorff alpha and ICC reliability.
  • Used TOST/ROPE equivalence testing, citation-provenance audits, and FActScore/SAFE-style factuality checks rather than a leaderboard mean.

When more agents are the wrong answer

The Skills Taxonomy work is the counterweight to agent-count storytelling. It shows Peter can decide when the right production move is retrieval, filters, calibration, and a small head rather than more LLM calls.

  • Moved SThree CV skill tagging from an LLM-per-skill validator to a pipeline with zero LLM calls on the inference hot path.
  • A three-line top-k cap by risk score added +12.03 pp precision; the full seven-filter stack added +16.44 pp, all with no LLM calls.
  • Reports the open limits — single-judge evaluation and real-PDF prose drift — as part of the result.

Continue exploring

Related work

Read all writing