EDD: Evals Driven Development

The ideal LLM-dev environment that I wish was easier to do…

Aug 10, 2025

The Problem: Don't try to retrofit LLM testing into old paradigms. Testing pyramids, CI/CD gates, deterministic assertions - these were built for a world where f(x) = y every time. LLMs broke that contract.

Core

Traditional software has bugs. LLM software has behaviors.

You don't test behaviors - you observe them, measure them, and guide them. This requires a fundamentally different approach.

Three Modes of Observation

Mode 1: Introspection The model evaluates itself during generation. Like a writer reviewing their draft before hitting send. Built-in confidence scoring, explanation generation, self-critique loops. This happens at inference time, costs tokens, but provides immediate guardrails.

Mode 2: Instrumentation Automated measurement of what actually happened. Response times, token usage, embedding similarities, retrieval success rates. Like application performance monitoring but for AI quality. Runs continuously, at scale.

Mode 3: Interpretation Humans make sense of what matters. Not everything can be measured - sometimes you need human judgment on tone, helpfulness, or edge cases. Expensive but irreplaceable for subjective quality and ground truth.

What This Looks Like in Practice

During Development:

Start with exploration, not test cases - explore what your model can do
Capture interesting behaviors as you discover them (the good, bad, and weird)
Write expectations, not assertions: "should be empathetic" not "must contain 'sorry'"
Build your eval suite backwards - get it working first, then define what "good" means based on real outputs

During Deployment:

Shadow scoring on real traffic before switching
Gradual rollout based on confidence thresholds
Automatic fallback to previous version if scores drop
No binary deploy/rollback - it's a confidence dial

During Operation:

Stream of observations, not error logs
Distribution of scores, not pass/fail counts
Anomaly detection, not threshold alerts
Behavioral drift tracking, not uptime monitoring

During Improvement:

Human labels on strategic samples, not random QA
Disagreement cases between evaluators get priority
Corrections become test cases automatically
Fine-tuning happens on production-validated examples

Embracing LLM Constraints

Embrace Uncertainty: Stop pretending LLMs are deterministic. Design for confidence intervals, not binary outcomes.

Compose, Don't Compile: Evaluators should be Lego blocks - mix and match for your use case. Need safety + factuality + tone? Stack them.

Optimize for Learning: Every evaluation should make the system smarter. If it doesn't feed back into improvement, why measure it?

Human Time is Sacred: Only escalate to humans when machines disagree or confidence is low. Make their input count by turning it into reusable signals.

Daniel Lopes

Discussion about this post

Ready for more?