The Problem: Don't try to retrofit LLM testing into old paradigms. Testing pyramids, CI/CD gates, deterministic assertions - these were built for a world where f(x) = y
every time. LLMs broke that contract.
Core
Traditional software has bugs. LLM software has behaviors.
You don't test behaviors - you observe them, measure them, and guide them. This requires a fundamentally different approach.
Three Modes of Observation
Mode 1: Introspection The model evaluates itself during generation. Like a writer reviewing their draft before hitting send. Built-in confidence scoring, explanation generation, self-critique loops. This happens at inference time, costs tokens, but provides immediate guardrails.
Mode 2: Instrumentation Automated measurement of what actually happened. Response times, token usage, embedding similarities, retrieval success rates. Like application performance monitoring but for AI quality. Runs continuously, at scale.
Mode 3: Interpretation Humans make sense of what matters. Not everything can be measured - sometimes you need human judgment on tone, helpfulness, or edge cases. Expensive but irreplaceable for subjective quality and ground truth.
What This Looks Like in Practice
During Development:
Start with exploration, not test cases - explore what your model can do
Capture interesting behaviors as you discover them (the good, bad, and weird)
Write expectations, not assertions: "should be empathetic" not "must contain 'sorry'"
Build your eval suite backwards - get it working first, then define what "good" means based on real outputs
During Deployment:
Shadow scoring on real traffic before switching
Gradual rollout based on confidence thresholds
Automatic fallback to previous version if scores drop
No binary deploy/rollback - it's a confidence dial
During Operation:
Stream of observations, not error logs
Distribution of scores, not pass/fail counts
Anomaly detection, not threshold alerts
Behavioral drift tracking, not uptime monitoring
During Improvement:
Human labels on strategic samples, not random QA
Disagreement cases between evaluators get priority
Corrections become test cases automatically
Fine-tuning happens on production-validated examples
Embracing LLM Constraints
Embrace Uncertainty: Stop pretending LLMs are deterministic. Design for confidence intervals, not binary outcomes.
Compose, Don't Compile: Evaluators should be Lego blocks - mix and match for your use case. Need safety + factuality + tone? Stack them.
Optimize for Learning: Every evaluation should make the system smarter. If it doesn't feed back into improvement, why measure it?
Human Time is Sacred: Only escalate to humans when machines disagree or confidence is low. Make their input count by turning it into reusable signals.