AI evaluation systems

LLM Evaluation, Judge Panels, And Deep-Research Benchmarks

Peter builds evaluation systems that test the output, the judge, the citation trail, the operational gate, and the claims people are tempted to make from a score.

Code-owned checks

Praviar

Patent-intelligence support with code-owned checks, bounded legal reasoning, and internal benchmark discipline.

Launch-gated operations

Conversico

Voice AI case study where reliability, graceful failure, privacy, and operator readiness are treated as release conditions.

P0–P12 patterns

Deep-research benchmark

Around 13 orchestration patterns compared under one judging protocol with formal equivalence testing.

3-judge reliability panel

Judge-reliability discipline

GPT-5.2, Claude Opus 4.1, and Claude Sonnet 4.5, with Krippendorff alpha and ICC agreement reported.

Citation checks

Citation accountability

Citation-provenance audits and FActScore/SAFE-style checks test whether a cited claim is actually supported.

9 optimizers benchmarked

Aurum self-improvement research

Evaluator-improvement claims held inside a synthetic proxy oracle, with a claim ledger and full negative-result reporting across Phase 0–4 and Exp5.

Continuous quality monitoring

Personal Assistant AI

Private assistant behavior reviewed through quality expectations, calibration, comparison, and regression alerts.

τ-thresholded precision

Skills Taxonomy judge limits

A single LLM judge scores every emission; the precision claim is reported with that limit named.

Evaluation inside regulated founder ventures

Praviar and Conversico extend the evaluation story into current founder ventures where the public writing lets people see the judgement without publishing the operating recipe.

Praviar frames AI as patent-intelligence support: code-owned verification and bounded reasoning assistance, not legal advice or a legal opinion.
Conversico frames voice AI as launch-gated operations: reliability, failure handling, privacy, and operator readiness before public adoption claims.
Both public case studies show engineering discipline while keeping the protected operating details private.

Quality monitoring for a private assistant

Personal Assistant AI carries the same evaluation instinct into a private assistant setting: when software holds personal context, prompt changes and model drift need monitoring rather than vibes.

Monitors real interactions against bounded quality expectations, with alerts when behavior drops against recent baselines.
Uses calibration and comparison workflows so behavior changes remain inspectable.
Frames the work as continuous quality monitoring with human calibration, not magical self-improvement or external certification.

Comparing architectures, not opinions

The deep-research benchmark does not just build agents; it tests whether architecture claims survive a controlled comparison across a registry of patterns.

Benchmarks around 13 orchestration patterns (P0–P12), from simple single-pass pipelines to multi-step planning, search, synthesis, and verification.
Compares frontier hosted GPT-4o pipelines against local ~7B open models — Qwen2.5-7B and DeepResearcher-7B — as a deliberate cross-scale test.
Uses TOST and ROPE equivalence testing, so it can state when two patterns are practically equivalent rather than reading a small mean gap as a win.

Judge reliability and citation accountability

The work treats LLM-as-judge as an object of evaluation, not a scoreboard, and follows the claims down into citation provenance.

Scores report quality with a three-judge panel — GPT-5.2, Claude Opus 4.1, Claude Sonnet 4.5 — and reports Krippendorff alpha and ICC inter-rater reliability.
Audits citation provenance: each citation in a generated report is checked for whether it actually supports the claim it is attached to.
Verifies factual accuracy with FActScore/SAFE-style atomic-fact decomposition rather than trusting a single quality score.

RL training, kept bounded

The benchmark also trains agents rather than only scoring them, and frames the training honestly as applied work with established methods.

Trains agents with GRPO reinforcement learning and fine-tunes the ~7B local models with QLoRA parameter-efficient adaptation.
Frames any equivalence to hosted pipelines through a chosen ROPE margin — a bounded "no worse than" claim, never a flat "the small model won".
Treats automated factuality and provenance checks as proxies that inherit the verifying model’s error modes, not as human ground truth.

LLM-judge limits in production ML

The Skills Taxonomy work is the production counterpart: an LLM judge can bootstrap labels and evaluation, but its limits have to be named before the system belongs in operations.

A single LLM-as-judge — Azure OpenAI gpt-5.2 at temperature 0 — scores every emission, so the headline uses the harder +4.2 pp GroupKFold-by-CV lift, not the easier +5.71 pp stratified split.
The diverse 500-CV benchmark is bounded because V1 was not re-run on it, so the V1 comparison is between distributions, not systems.
Roughly 31% of head training tuples share a source corpus with the evaluation, so the +4.2 pp is reported as an upper bound on cross-corpus generalisation.

Continue exploring

Related work

Read all writing

LLM Evaluation, Judge Panels, And Deep-Research Benchmarks

Praviar

Conversico

Deep-research benchmark

Judge-reliability discipline

Citation accountability

Aurum self-improvement research

Personal Assistant AI

Skills Taxonomy judge limits

Evaluation inside regulated founder ventures

Quality monitoring for a private assistant

Comparing architectures, not opinions

Judge reliability and citation accountability

RL training, kept bounded

LLM-judge limits in production ML

Related work

Personal Assistant AI

Praviar patent intelligence

Conversico

Deep-research benchmark article

Multi-agent systems

Aurum self-improvement research

Skills Taxonomy hot-path removal

AI engineer profile

Graded against 38 cases, gated at ≥85%.