Skip to content

AI evaluation systems

LLM Evaluation, Judge Panels, And Deep-Research Benchmarks

Peter builds evaluation systems that test the output, the judge, the citation trail, the operational gate, and the claims people are tempted to make from a score.

Code-owned checks

Praviar

Patent-intelligence support with code-owned checks, bounded legal reasoning, and internal benchmark discipline.

Launch-gated operations

Conversico

Voice AI case study where reliability, graceful failure, privacy, and operator readiness are treated as release conditions.

P0–P12 patterns

Deep-research benchmark

Around 13 orchestration patterns compared under one judging protocol with formal equivalence testing.

3-judge reliability panel

Judge-reliability discipline

GPT-5.2, Claude Opus 4.1, and Claude Sonnet 4.5, with Krippendorff alpha and ICC agreement reported.

Citation checks

Citation accountability

Citation-provenance audits and FActScore/SAFE-style checks test whether a cited claim is actually supported.

9 optimizers benchmarked

Aurum self-improvement research

Evaluator-improvement claims held inside a synthetic proxy oracle, with a claim ledger and full negative-result reporting across Phase 0–4 and Exp5.

Continuous quality monitoring

Personal Assistant AI

Private assistant behavior reviewed through quality expectations, calibration, comparison, and regression alerts.

τ-thresholded precision

Skills Taxonomy judge limits

A single LLM judge scores every emission; the precision claim is reported with that limit named.

Evaluation inside regulated founder ventures

Praviar and Conversico extend the evaluation story into current founder ventures where the public writing lets people see the judgement without publishing the operating recipe.

  • Praviar frames AI as patent-intelligence support: code-owned verification and bounded reasoning assistance, not legal advice or a legal opinion.
  • Conversico frames voice AI as launch-gated operations: reliability, failure handling, privacy, and operator readiness before public adoption claims.
  • Both public case studies show engineering discipline while keeping the protected operating details private.

Quality monitoring for a private assistant

Personal Assistant AI carries the same evaluation instinct into a private assistant setting: when software holds personal context, prompt changes and model drift need monitoring rather than vibes.

  • Monitors real interactions against bounded quality expectations, with alerts when behavior drops against recent baselines.
  • Uses calibration and comparison workflows so behavior changes remain inspectable.
  • Frames the work as continuous quality monitoring with human calibration, not magical self-improvement or external certification.

Comparing architectures, not opinions

The deep-research benchmark does not just build agents; it tests whether architecture claims survive a controlled comparison across a registry of patterns.

  • Benchmarks around 13 orchestration patterns (P0–P12), from simple single-pass pipelines to multi-step planning, search, synthesis, and verification.
  • Compares frontier hosted GPT-4o pipelines against local ~7B open models — Qwen2.5-7B and DeepResearcher-7B — as a deliberate cross-scale test.
  • Uses TOST and ROPE equivalence testing, so it can state when two patterns are practically equivalent rather than reading a small mean gap as a win.

Judge reliability and citation accountability

The work treats LLM-as-judge as an object of evaluation, not a scoreboard, and follows the claims down into citation provenance.

  • Scores report quality with a three-judge panel — GPT-5.2, Claude Opus 4.1, Claude Sonnet 4.5 — and reports Krippendorff alpha and ICC inter-rater reliability.
  • Audits citation provenance: each citation in a generated report is checked for whether it actually supports the claim it is attached to.
  • Verifies factual accuracy with FActScore/SAFE-style atomic-fact decomposition rather than trusting a single quality score.

RL training, kept bounded

The benchmark also trains agents rather than only scoring them, and frames the training honestly as applied work with established methods.

  • Trains agents with GRPO reinforcement learning and fine-tunes the ~7B local models with QLoRA parameter-efficient adaptation.
  • Frames any equivalence to hosted pipelines through a chosen ROPE margin — a bounded "no worse than" claim, never a flat "the small model won".
  • Treats automated factuality and provenance checks as proxies that inherit the verifying model’s error modes, not as human ground truth.

LLM-judge limits in production ML

The Skills Taxonomy work is the production counterpart: an LLM judge can bootstrap labels and evaluation, but its limits have to be named before the system belongs in operations.

  • A single LLM-as-judge — Azure OpenAI gpt-5.2 at temperature 0 — scores every emission, so the headline uses the harder +4.2 pp GroupKFold-by-CV lift, not the easier +5.71 pp stratified split.
  • The diverse 500-CV benchmark is bounded because V1 was not re-run on it, so the V1 comparison is between distributions, not systems.
  • Roughly 31% of head training tuples share a source corpus with the evaluation, so the +4.2 pp is reported as an upper bound on cross-corpus generalisation.

Continue exploring

Related work

Read all writing

Assistant evaluation harness

Graded against 38 cases, gated at ≥85%.

The profile AI is not shipped on vibes. An offline suite of 38 graded cases across 8 dimensions holds every dimension to a ≥85% pass-rate gate, run against the deployed assistant before release. The published dimensions are asserted against the live fixtures in the build, so this scorecard cannot drift from what is actually tested. Citations are checked against the site's source list, so fabricated sources are structurally impossible.

85% release gate
  • Factuality13 cases

    Answers match the verified knowledge base — no invented metrics, titles, employers, or dates.

  • Article grounding5 cases

    Blog-chat answers stay grounded in the specific article and honour its private boundaries.

  • Jailbreak resistance5 cases

    Resists prompt-injection and system-prompt-extraction attempts without breaking character.

  • Refusal correctness5 cases

    Declines out-of-scope or unknowable questions directly instead of guessing.

  • Empty-KB discipline4 cases

    When the knowledge base has nothing on a topic, it says so rather than fabricating.

  • Citation accuracy2 cases

    Citations resolve to real entries in the source registry — fabricated sources are structurally impossible.

  • Follow-up quality2 cases

    Suggested follow-up questions are relevant and grounded in the conversation.

  • Latency2 cases

    Responses stream within the latency budget.

How it is scored: fixed, repeatable checks rate whether answers are accurate, refuse what they should, resist attempts to break the rules, cite real sources, and respond fast — all against a set list of test questions. The published figure is the minimum bar, not a single run — the AI's scores vary from run to run, so the meaningful guarantee is the bar every measure must clear before release.