Every week, a new model drops. The announcement follows a familiar template: state-of-the-art on benchmark X, beats previous SOTA on benchmark Y, human-level performance on Z. The numbers are real. The benchmarks are real. And yet, if you have spent any time actually deploying these models on real tasks, you have probably had the experience of picking the "best" model by the numbers and being quietly disappointed by the results.
This is the central tension of AI evaluation: the gap between what we can measure and what we actually care about. Understanding that gap is the starting point for thinking clearly about how to evaluate AI for your own use case. This post covers the landscape of how models are evaluated today, where the methods work well, where they break down, and what that means for anyone choosing or building with AI in a production setting.
The Three Layers of Model Evaluation
Model evaluation is not one thing. It is really three overlapping concerns, each asking a different question about a model.
The first is general capability: is this model broadly intelligent? Can it reason, write, follow complex instructions, solve novel problems? This is what most public benchmarks measure. MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic subjects. ARC-Challenge tests scientific reasoning. MATH tests mathematical problem-solving. HumanEval and SWE-bench test coding. These benchmarks are designed to be hard enough that a model cannot just pattern-match its way through them — they are meant to probe something like general reasoning ability.
The second layer is safety and alignment: will this model do what I want without doing things I do not want? This is where red-teaming, adversarial testing, and constitutional AI evaluations live. How does the model handle requests to help with harmful content? Does it refuse appropriately, or refuse too aggressively? Does it have problematic biases in how it treats different groups? This layer is less visible in public benchmarks but is increasingly where labs spend the most internal engineering effort.
The third layer is task-specific performance: can this model do the particular thing I need it to do, and do it well? This is where general benchmarks become inadequate. A model that aces MMLU might still write mediocre marketing copy, give imprecise legal summaries, or fail to extract the information you need from a contract. Task-specific evals are the ones you usually have to build yourself — and they are the most meaningful signal for any real deployment decision.
What Model Cards Tell Us
If you look at the model cards that labs publish alongside their releases — OpenAI's for GPT-4o, Anthropic's for Claude 3.5 Sonnet, Google's for Gemini 1.5 Pro — you will see all three layers represented, but not equally weighted. General capability numbers sit prominently in the comparison tables. Safety evaluations get their own dedicated section, often running to several pages. Task-specific benchmarks appear selectively: coding benchmarks for models positioned toward developers, medical benchmarks for models pitched at healthcare applications.
What is interesting is where each lab chooses to place its emphasis. These are not just aesthetic choices. They reflect each lab's theory of what its audience most needs to know, and in some cases, what the lab is most proud of.
The Benchmark Landscape
For text-based language models, the evaluation toolbox is relatively mature. General capability benchmarks like MMLU have been around since 2021. Coding benchmarks like HumanEval and SWE-bench have become increasingly important as models are applied to software engineering. SWE-bench in particular is worth noting: rather than testing whether a model can write code, it tests whether a model can autonomously resolve real GitHub issues — a much harder and more realistic proxy for practical utility.
| Benchmark | What it Tests | Layer | Notes |
|---|---|---|---|
| MMLU Massive Multitask Language Understanding |
Knowledge across 57 academic subjects | Capability | Most referenced general benchmark; increasingly contaminated by training |
| ARC-Challenge AI2 Reasoning Challenge |
Scientific reasoning at grade-school level | Capability | Now mostly saturated by frontier models; less differentiating |
| MATH | Competition-level mathematics | Capability | Still differentiates frontier from mid-tier models meaningfully |
| HumanEval | Code generation from docstrings | Capability | Single-function scope; SWE-bench is a more realistic test |
| SWE-bench Verified | Autonomous resolution of real GitHub issues | Capability / Task | Multi-step agentic coding; harder and more predictive of real use |
| GPQA Diamond Graduate-Level Google-Proof Q&A |
Expert-level science Q&A written by PhD researchers | Capability | Designed to resist contamination; remains a meaningful signal |
| TruthfulQA | Factual accuracy and hallucination resistance | Safety | Tests calibration; can be overfitted through targeted training |
| FID / CLIP score Fréchet Inception Distance / Contrastive Language-Image Pre-training |
Image quality and text-image alignment | Capability | Narrow technical measures; human preference studies still required |
| GAIA General AI Assistants |
Multi-step research tasks: web browsing, file manipulation, reasoning | Task | Purpose-built for agentic models; growing adoption |
| TerminalBench | Terminal task completion in shell and file system environments | Task | Purpose-built for agentic models; still maturing as a benchmark |
As you move away from text, the evaluation landscape becomes thinner and less reliable. For image generation, metrics like FID and CLIP scores exist, but they measure narrow technical properties rather than whether the image actually looks good or correctly depicts the scene you described. The field largely falls back on human preference studies: show raters pairs of images, ask which is better, aggregate the results.
Video is even harder. When OpenAI launched Sora, the evaluation was almost entirely qualitative — curated demo clips, not benchmark numbers. This was not a failure of rigor; it reflected the honest state of video generation metrics, which largely cannot capture what makes a video impressive or useful. Human preference data was the only credible signal.
For agentic models, benchmarks like GAIA and TerminalBench are emerging, but agent evaluation carries its own structural challenges. An agent's performance depends not just on the model itself but on the tools it has access to, the quality of its system prompt, the reliability of the APIs it calls, and the specific context it is operating in. Two agents with the same underlying model can perform very differently depending on configuration. Benchmarking the model in isolation tells you something — but not how the deployed system will actually perform.
Why Evals Are Imperfect Signals
The most important thing to understand about benchmarks is that they degrade over time. If the evaluation criteria are static and public, then models can be — and are — trained on them. What starts as a meaningful test of generalization becomes, over several training iterations, closer to a memorized dataset. Researchers call this "benchmark contamination," and it is endemic to the field.
A model that scores 90% on MMLU in 2025 is not necessarily more capable than one that scored 85% in 2023. The benchmark may simply be closer to its training distribution. This is why new, harder benchmarks keep appearing — ARC-AGI, GPQA, Humanity's Last Exam — each trying to stay one step ahead of models that are increasingly good at acing tests that have already been published.
Notice what happens as you move away from text toward images, video, and agents: the field falls back on human preference data faster, automated metrics become less trustworthy, and the benchmarks that exist tend to measure narrow technical properties rather than the thing you actually care about. The unsexy truth of AI evaluation is that thousands of hours of human annotation — real people rating outputs — remain the bedrock of both training and evaluation for anything that is not text. The same human judgment that powered RLHF in the early days of ChatGPT is still the gold standard for evaluating whether a generated video looks good, whether an agent completed a task correctly, or whether a curated digest actually reflects what its reader cares about.
For agents and multimodal outputs, building your own eval framework is not just a reasonable choice. In many cases, it is the only rigorous one.
So What Should You Actually Use?
My professor at Kellogg, Josh D'Arcy, raised a point in a seminar that I keep coming back to: he is skeptical about looking at benchmark X to decide which model to use. Not because the benchmarks are worthless — they are useful for shortlisting and for verifying that basic thresholds are met. But they are generic by design. If you are building a legal research tool, MMLU does not tell you how well the model reads contracts. If you are building a customer support bot, HumanEval does not tell you how well the model handles edge-case refund requests.
The practical implication is that some evals should always be in place. Safety and alignment evals are non-negotiable regardless of use case — you want to know that a model will not behave in harmful ways before you deploy it anywhere. But once you have crossed that baseline, picking a model based on generic capability benchmarks alone is not best practice. Everyone is optimizing for something different. And there is a price-performance dimension that benchmarks rarely capture.
In my own project — building an agentic daily digest that curates a personalised morning newsletter every day — I went through several model iterations. In v1, I used free-tier open-source models via OpenRouter: Llama 3.3 70B, DeepSeek R1, Gemma. Some of these technically benchmark comparably to or better than Claude Haiku on certain general capability tasks. But in practice, Haiku won, and it was not close. Free-tier models were flaky: rate limits kicked in mid-pipeline, model availability varied, and the structured output I needed was not consistently reliable. Haiku cost $0.018 per day but delivered consistent quality and fast inference. The right model for a production task is not always the one with the highest benchmark score — it is the one that performs your specific task reliably, at a price you can sustain.
Use public benchmarks to shortlist models and verify that safety thresholds are met. Then build your own task-specific eval to make the final selection. The benchmark gets you to a shortlist of candidates; your custom eval tells you which one to actually deploy — and whether it is actually working once it is live.
For text-based applications with a clear success criterion, building a custom eval is tractable. For image, video, and agent outputs, it is harder but still necessary — and in many cases, human evaluation remains the only rigorous option. A thousand automated metrics will not tell you whether your users find the generated content useful. Only your users can.
In the next post, I walk through exactly how I built a custom evaluation framework for my digest agent: a judge model that scores every digest against my own calibrated rubric, what it cost, what it revealed, and why even at small scale, getting evals right takes more work than you might expect.