Code-owned checks
Praviar
Patent-intelligence support with code-owned checks, bounded legal reasoning, and internal benchmark discipline.
AI evaluation systems
Peter builds evaluation systems that test the output, the judge, the citation trail, the operational gate, and the claims people are tempted to make from a score.
Code-owned checks
Patent-intelligence support with code-owned checks, bounded legal reasoning, and internal benchmark discipline.
Launch-gated operations
Voice AI case study where reliability, graceful failure, privacy, and operator readiness are treated as release conditions.
P0–P12 patterns
Around 13 orchestration patterns compared under one judging protocol with formal equivalence testing.
3-judge reliability panel
GPT-5.2, Claude Opus 4.1, and Claude Sonnet 4.5, with Krippendorff alpha and ICC agreement reported.
Citation checks
Citation-provenance audits and FActScore/SAFE-style checks test whether a cited claim is actually supported.
9 optimizers benchmarked
Evaluator-improvement claims held inside a synthetic proxy oracle, with a claim ledger and full negative-result reporting across Phase 0–4 and Exp5.
Continuous quality monitoring
Private assistant behavior reviewed through quality expectations, calibration, comparison, and regression alerts.
τ-thresholded precision
A single LLM judge scores every emission; the precision claim is reported with that limit named.
Praviar and Conversico extend the evaluation story into current founder ventures where the public writing lets people see the judgement without publishing the operating recipe.
Personal Assistant AI carries the same evaluation instinct into a private assistant setting: when software holds personal context, prompt changes and model drift need monitoring rather than vibes.
The deep-research benchmark does not just build agents; it tests whether architecture claims survive a controlled comparison across a registry of patterns.
The work treats LLM-as-judge as an object of evaluation, not a scoreboard, and follows the claims down into citation provenance.
The benchmark also trains agents rather than only scoring them, and frames the training honestly as applied work with established methods.
The Skills Taxonomy work is the production counterpart: an LLM judge can bootstrap labels and evaluation, but its limits have to be named before the system belongs in operations.
Continue exploring
Ongoing private-assistant case study with quality monitoring and human calibration built into the product loop.
OpenBounded pharmaceutical patent-intelligence case study with fixed-check discipline.
OpenReal-time voice operations case study with launch gates and demo video coming soon.
OpenThe public narrative of the deep-research benchmark study and its boundaries.
OpenHow controlled evaluation changes the agentic architecture story.
OpenA separate evaluator-optimization project with synthetic proxy-oracle boundaries.
OpenHow LLM-judge labels, a tiny head, and explicit boundaries changed a production CV tagging pipeline.
OpenBroader senior AI engineering context.
OpenAssistant evaluation harness
The profile AI is not shipped on vibes. An offline suite of 38 graded cases across 8 dimensions holds every dimension to a ≥85% pass-rate gate, run against the deployed assistant before release. The published dimensions are asserted against the live fixtures in the build, so this scorecard cannot drift from what is actually tested. Citations are checked against the site's source list, so fabricated sources are structurally impossible.
Answers match the verified knowledge base — no invented metrics, titles, employers, or dates.
Blog-chat answers stay grounded in the specific article and honour its private boundaries.
Resists prompt-injection and system-prompt-extraction attempts without breaking character.
Declines out-of-scope or unknowable questions directly instead of guessing.
When the knowledge base has nothing on a topic, it says so rather than fabricating.
Citations resolve to real entries in the source registry — fabricated sources are structurally impossible.
Suggested follow-up questions are relevant and grounded in the conversation.
Responses stream within the latency budget.
How it is scored: fixed, repeatable checks rate whether answers are accurate, refuse what they should, resist attempts to break the rules, cite real sources, and respond fast — all against a set list of test questions. The published figure is the minimum bar, not a single run — the AI's scores vary from run to run, so the meaningful guarantee is the bar every measure must clear before release.