After building my daily digest agent, an AI that curates a personalised morning newsletter from my RSS feeds and email newsletters every day, the natural next question was: is it any good? Not "does it run without errors" — GitHub Actions tells me that. But does it actually produce a digest I would want to read? Is it capturing what I care about? Are the summaries genuinely useful or just prettier headlines?
These are qualitative questions, and for those first few days I was answering them the way most people answer them: by reading my inbox and having a feeling. Some mornings the digest felt sharp. Other mornings it felt like a slightly rearranged version of my existing feeds. But "a feeling" does not compound. It does not tell you whether things are getting better or worse over time, which prompt changes helped, or which dimensions of quality are underperforming.
This is the problem that evaluation frameworks exist to solve. And this post is about how I built one: specifically, a system where I use an LLM to judge the quality of another LLM's output, calibrated to my own taste.
Why Not Just Use Standard Benchmarks?
The short answer: they were not built for this.
Standard benchmarks like MMLU or SWE-bench measure general model capabilities across standardised test sets. They are useful for choosing which model to use, but they tell you nothing about whether that model is performing your specific task well. My digest agent uses Claude Sonnet via the Claude Agent SDK — which means I can reasonably assume that Anthropic has already run the standard capability and safety evals on the underlying model. Layers 1 and 2 from Part 1 are covered by the SDK vendor.
What I need to evaluate is Layer 3: task-specific performance. Specifically: given my actual interest profile and source list, is the agent producing a digest that reflects my taste? No public benchmark measures that. I have to build it myself.
There are also general-purpose agent evaluation frameworks — GAIA, AgentBench, ToolBench — designed to measure whether an agent can complete multi-step tasks. But these measure general agentic capability, not the specific editorial judgment my digest requires. "Can the agent browse the web and synthesise information" is table stakes for any modern agent. What I care about is whether it synthesises the right information, with the right framing, from a sufficiently diverse set of sources.
Designing the Rubric: What Does "Good" Actually Mean?
Before writing a single line of evaluation code, I had to answer the harder question: what does a good digest actually look like? Not vaguely — specifically enough that a language model could score it consistently.
I settled on eight dimensions, each scored 1 to 5 by the judge. They fall into three natural groups.
Relevance (55% of score)
1. Interest Priority Adherence (weight: 25%) — My agent is configured with a user_profile.yaml file that explicitly categorises my interests as high, medium, or low priority. High priority: AI agents, LLM architecture, developer tools, robotics. Medium: VC funding, product launches. Low: general tech news, SF local events. A good digest should feel heavy on high-priority content. A 5/5 here means the digest reflects that hierarchy faithfully.
2. Summary Quality (weight: 20%) — The system prompt asks the agent to write summaries "like a knowledgeable friend who tells you only what actually matters." Do the summaries add genuine insight beyond the headline, or do they just rewrite the title? A 5/5 here means the summaries explain why a story matters — context, implication, or technical detail worth knowing.
3. Signal-to-Noise Ratio (weight: 15%) — Every item should earn its slot. The profile caps the digest at 20 items and sets a minimum relevance threshold. Does the agent curate ruthlessly, or does it pad with marginal content?
Curation (25% of score)
4. Source Diversity and Tool Use (weight: 15%) — The agent has access to six configured sources plus web search via the Exa API. Over-relying on a single source — for instance, drawing 60% of items from one blog — suggests the agent is not using its full toolkit. Scored using explicit percentage thresholds (more on this when we get to calibration).
5. Theme Detection and Editorial Voice (weight: 10%) — The system prompt asks for a 2-to-3 sentence editorial intro identifying the day's connecting theme. Does the agent actually do this, and is the theme insightful or generic? A 5/5 reads like a knowledgeable friend's take — not "here is what happened" but "here is what you should notice."
Quality and Freshness (20% of score)
6. Content Freshness (weight: 10%) — My profile flags content as stale after 48 hours. Is the digest capturing the current moment, or recycling last week's news?
7. Source Failure Recovery (weight: 3%) — When sources fail (documented in the agent's run_log.json), does the agent recover gracefully via web search, or does it leave gaps?
8. Novelty (weight: 2%) — Does the digest surface anything I am unlikely to have seen already? This dimension requires comparing today's digest to the past two days, which is handled by injecting recent digests into the judge prompt.
The weights reflect my actual priorities. Interest alignment and summary quality together account for 45% of the score, because if those two fail, nothing else saves the digest.
LLM as Judge
The core idea is straightforward: instead of me reading every digest and scoring it, I send each digest to a stronger LLM — Claude Opus 4.5, one tier above the Claude Sonnet 4.5 model that generates the digest — along with my user profile, the agent's system prompt, and the rubric, and ask it to score across all eight dimensions with a written explanation for each.
Using a stronger model to judge matters. Asking the same model that generated the digest to evaluate it introduces self-serving bias: the model is unlikely to be as hard on its own output as an independent evaluator would be. Opus has stronger reasoning and is more calibrated when asked to apply a rubric strictly. This matches the approach used in the MT-Bench paper (Zheng et al., 2023), which popularized LLM-as-a-judge methodology and specifically recommended using a more capable model for evaluation.
The judge prompt has four sections: my user profile as the ground-truth baseline for what the digest should achieve; the agent's system prompt as context for what it was told to do; run metadata from the agent's log file (sources fetched, sources that failed, items included before filtering); and the full digest text to evaluate. The judge returns a structured JSON response: a score and two-to-three sentence explanation for each dimension, an overall weighted score, the single biggest issue, and the single biggest strength.
Here is an excerpt from a real judge output on one of my February digests:
{
"interest_priority_adherence": {
"score": 5,
"explanation": "The digest is almost entirely composed of high-priority content:
AI agents, LLM architecture, developer tools, and AI research. The only
medium-priority item is the Anthropic PAC story. No low-priority filler."
},
"source_diversity": {
"score": 3,
"explanation": "Simon Willison accounts for 10 of 17 items (59%). Product Hunt
contributes 3, TLDR adds 2, TechCrunch 1. One source contributing 59% falls in
the 41-60% range per the rubric threshold. Missing: Lenny's Newsletter
(marked always_include) with no web search recovery noted."
},
"theme_and_editorial_voice": {
"score": 5,
"explanation": "Excellent editorial intro identifying 'AI stack professionalization'
as the connecting thread and explaining why it matters. Grouping is genuinely
thematic rather than source-based."
}
}
The explanations are the most valuable part. They turn a number into something actionable.
The Calibration Layer: Making the Judge Agree with You
Writing a rubric and running the judge is not enough. The judge has its own interpretation of what "3 out of 5 on Source Diversity" means, and that interpretation may not match yours. Before you trust the scores, you need to verify that the judge is scoring the way you would.
This is the calibration layer. The process works like this: manually rate a set of past digests yourself using the same rubric, then compare your scores to the judge's scores using Pearson correlation. Pearson correlation (r) measures how closely two sets of numbers move together, on a scale from -1 (perfect inverse) to +1 (perfect agreement). If r is 0.60 or above on a dimension, the judge is interpreting that dimension similarly to you. If not, the rubric needs refinement.
I rated 21 digests manually: five real daily digests from February 2026 and sixteen synthetic test digests generated in early January. The January test digests were deliberately included because they span a wider quality range than the real ones — some are good, some have obvious problems. High-variance data is what calibration needs. If every digest were a 4.5 out of 5, correlation measurements would be meaningless.
The first calibration run was humbling.
Source Diversity: A Systematic Misalignment
The most flagrant gap was Source Diversity. On digests where I rated 2 out of 5, the judge was giving 4 out of 5. The root cause was vague rubric language. My original rubric said things like "one source may contribute up to 40% of items" for a 4 out of 5 — but "up to 40%" is ambiguous. Does 38% count? What about 42%? The judge interpreted this loosely. I interpreted it strictly.
Human rating: 2/5 — "80% of the content was just taken from Simon Willison. This is not diverse."
Judge rating: 4/5 — "Content draws from multiple sources including a prominent technical blog, product news, and mainstream tech media. Reasonable diversity for a technical digest."
Delta: 2 full points on a 5-point scale. The judge was not wrong to read "multiple sources" as satisfied — I just had never specified that the distribution mattered, not just the count.
The fix was to replace every qualitative description in the Source Diversity rubric with an exact percentage threshold:
IMPORTANT: Calculate the exact percentage of items from the most-used source.
Use the thresholds below strictly - do not round or approximate.
| Score | Threshold |
|-------|-----------|
| 5 | <=30% |
| 4 | 31-40% |
| 3 | 41-60% |
| 2 | 61-80% |
| 1 | >=81% |
After this change, digests where Simon Willison contributed 53 to 59% of items correctly scored 3 out of 5, not 4 out of 5. The lesson is that for any objective dimension, you need exact numbers. "Noticeable clustering" is not a scoring criterion — it is an invitation for the judge to make a subjective call that may not match yours.
Novelty: The Judge Needs Memory
The Novelty dimension had a different problem. The rubric was fine — it explicitly mentioned checking for repeated stories across consecutive days. The judge just did not have the information to apply it. Without access to yesterday's digest, the judge could only evaluate novelty relative to the single digest in front of it. A story that appeared yesterday looks completely fresh if yesterday's digest is not in context.
The fix was to inject the past two days of digests into the judge prompt with an explicit instruction: check for repeated stories and apply a one-point penalty per repeated item. The get_recent_digests() function handles retrieval from the digests/ archive. This is a general principle worth noting: if a dimension of your rubric requires context the judge does not have, it cannot score that dimension correctly regardless of how good the rubric language is.
The Result: Three Rounds, Three Days
Three rounds of rubric refinement over three days. After each round, I re-scored the most problematic digests and checked whether the delta closed. After the final round, overall Pearson correlation improved from r = 0.48 to r = 0.72. The Source Diversity dimension went from r = 0.21 to r = 0.65. The rubric was then locked, and the judge was trusted.
The total setup cost — generating test digests, running calibration scoring, and iterating through rubric refinement — was roughly $3.24 in API credits and about eight hours of work across three days. For a simple personal agent with one output type, that is the real cost of "just building an eval." It is not enormous, but it is not trivial either. It also gives you a sense of why evaluation infrastructure at scale is a serious engineering investment.
What the Scores Actually Revealed
Once the judge was calibrated, the interesting findings started coming through. The most consistent issue across my synthetic test digests was source diversity — and it was not subtle. The January test corpus had many digests where Simon Willison's blog accounted for 55 to 80% of items. The agent was technically following its instructions (Simon is marked always_include: true in my profile), but it was over-weighting one high-signal source to the exclusion of others.
This is the kind of pattern that is easy to miss when you are just reading your inbox. Any individual digest might feel fine — Simon Willison writes excellent content. But when you look at scores over time, source diversity has been systematically low, which means the agent is not using its full toolkit, and the digest is effectively one person's curation wearing multiple hats.
The fix is straightforward: adjust the system prompt to more explicitly cap any single source's contribution, and add a stronger directive to draw from TLDR, Product Hunt, and Lenny's in every digest. But I would only have known to make that specific change because the judge gave me a precise, consistent diagnosis across 21 digests — not just a feeling.
The Cost Question
One thing worth flagging explicitly: evaluation is significantly more expensive than generation.
| Component | Model | Cost per Digest | Monthly (daily cadence) |
|---|---|---|---|
| Digest generation | Claude Sonnet 4.5 | ~$0.015 | ~$0.45 |
| Judge evaluation | Claude Opus 4.5 | ~$0.10 | ~$3.00 |
| Total | ~$0.115 | ~$3.45 |
The judge costs about seven times more per run than the agent itself. This is because the judge prompt is substantially longer: it includes the user profile, the system prompt, run metadata, the past two digests for novelty context, the full rubric, and the digest being evaluated. Opus is also priced higher than Sonnet. The combination adds up.
Is it worth it? For my use case, yes — $3 a month for a continuous quality signal is inexpensive. But it illustrates a broader truth: building robust evaluation infrastructure is not free, and at scale, eval costs can rival or exceed generation costs. If you are running evaluations across thousands of outputs, this needs to be part of your cost model from the start.
The Closed-Loop Temptation
An obvious question follows from all of this: if the judge can identify problems, can the agent just fix them automatically? Could I wire the scores back into a loop where a second agent reads the judge's output, revises the curation criteria, and the next digest automatically improves?
Technically, yes. Architecturally, this is doable. But I would be cautious about doing it without oversight.
The risk is drift. If the judge identifies that source diversity is low and an automated agent responds by changing my user profile to deprioritise Simon Willison — without my review — it might fix one metric at the cost of something I care about more. The judge is calibrated to score against my stated preferences, but my stated preferences are an approximation of what I actually want. An automated loop that optimises against the approximation without human checks can drift in ways that feel subtle until they are obvious.
Where this becomes genuinely transformative is at scale. If I were running fifty versions of this digest for different users, or a hundred different agentic workflows in a production system, I could not manually review every output. The judge lets me set quality thresholds and only look at the things that fall below them — the difference between monitoring everything and being alerted only when something actually needs attention.
The real value of building a proper eval framework — even for something as small as a personal digest — is that the pattern transfers. The same structure I used here (rubric design, LLM judge, Pearson calibration, threshold alerting) applies directly to a customer-facing summarisation product, a content moderation pipeline, or any agentic workflow where output quality matters and volume makes manual review impractical. You build the eval once for one use case; you validate that it captures your intent; then you scale the framework. The bottleneck for scaling AI products is often not the model or the infrastructure — it is whether you can trust what the model is producing. That is an eval problem.
Production-Grade Eval Tooling: MLflow and Braintrust
Building a custom eval framework from scratch — as I did here — is instructive. But at production scale, teams reach for dedicated tooling. Two worth knowing are MLflow (open-source, maintained by Databricks) and Braintrust. I recently attended a tech community session where both were covered in depth. Here is what stood out.
MLflow has expanded well beyond its original experiment-tracking roots into a full AI observability layer. For LLM applications, it captures traces and telemetry at the request level — you can see exactly what prompt was sent, what the model returned, how long it took, and what tools were called. On top of that trace infrastructure, MLflow supports LLM-as-judge evaluation natively: you define judge criteria (safety, PII detection, relevance, tone) and run them across trace batches. The GEPA prompt optimizer is particularly useful — it treats your prompt as a variable and optimises it systematically against your rubric, rather than relying on manual iteration. MLflow also has native MCP integration, which means it can plug directly into agentic tool-use workflows without custom instrumentation.
Braintrust approaches evaluation through the lens of continuous integration. The core idea — what they call eval-driven development — is that every prompt change should trigger a scored eval run, the same way a code change triggers a test suite. This reframes evaluation from an occasional quality check into a continuous feedback loop: you do not ship a prompt update until the evals pass. Braintrust supports single-agent and multi-agent evaluation patterns, as well as multi-turn conversation evals, which are considerably harder to design than single-output evals. The practical implication is that teams using Braintrust tend to catch regressions before deployment rather than in production — the same shift that unit testing enabled for software engineering.
Both tools operate on a principle that aligns with what I learned building this system manually: the value of evaluation compounds. A single scored output tells you little. A dashboard of scores over time tells you whether a prompt change helped or hurt, which dimensions are systematically underperforming, and when something drifts. That compounding diagnostic value is what turns eval infrastructure from a nice-to-have into a core part of how you improve an AI system.
What I Would Do Differently
Start with the rubric, not the code. The hardest part of this whole exercise was not implementing the scoring pipeline or the Streamlit dashboard. It was writing rubric language precise enough for the judge to interpret the way I intended. I would spend more time on the rubric before touching a keyboard — specifically, identifying upfront which dimensions need quantitative thresholds and writing those first.
Calibration is not optional. The initial misalignment between my scores and the judge's on source diversity would have made the whole system misleading if I had not caught it. I would have been optimizing against a metric that did not actually reflect my taste. Running calibration before trusting the judge is the step that makes everything else meaningful.
The judge needs maximum context. Early versions of the judge prompt did not include the agent's run logs — the metadata about which sources were fetched, which failed, how many items were considered before the final selection. Adding that context made the judgments significantly more accurate, especially for source failure recovery and source diversity. Judges need to understand intent, not just output.
Every morning the agent generates a digest, archives it, and the scoring pipeline evaluates it automatically via GitHub Actions. A Streamlit dashboard shows score trends over time, flags any digest that scored below threshold, and surfaces the judge's specific explanation for why. When a dimension starts trending down, I know exactly what to fix — and I know whether the fix worked.
There is a broader point worth sitting with. In a Stanford engineering lecture I came across, someone made the observation that models are only going to develop as fast as they can be evaluated. It is easy to treat evaluation as a secondary concern — the thing you do after you have built the thing that matters. But the history of how frontier models improved suggests the opposite. The jumps in capability — RLHF, constitutional AI, scalable oversight — all depended on having evaluation infrastructure that could credibly signal what "better" meant. Anthropic's engineering team has written about this directly: good evaluations are what allow teams to ship AI agents confidently, and their value compounds over the lifecycle of a system. Without them, you are flying without instruments.
For operators building with AI — not just researchers training models — the same logic applies. You do not need a research team to build a meaningful eval. You need a rubric precise enough to be scored consistently, a judge calibrated to your taste, and the discipline to look at the scores rather than just reading your inbox and having a feeling. That is the whole system. And it is more tractable than it looks.