Applied AI Thinking for Operators · Evaluation Series · Part 1 of 2

How Do We Know If an AI Model Is Actually Good?

The gap between benchmark scores and real-world performance is wider than the leaderboards suggest. Here is how to think about it.

Every week, a new model drops. The announcement follows a familiar template: state-of-the-art on benchmark X, beats previous SOTA on benchmark Y, human-level performance on Z. The numbers are real. The benchmarks are real. And yet, if you have spent any time actually deploying these models on real tasks, you have probably had the experience of picking the "best" model by the numbers and being quietly disappointed by the results.

This is the central tension of AI evaluation: the gap between what we can measure and what we actually care about. Understanding that gap is the starting point for thinking clearly about how to evaluate AI for your own use case. This post covers the landscape of how models are evaluated today, where the methods work well, where they break down, and what that means for anyone choosing or building with AI in a production setting.

The Three Layers of Model Evaluation

Model evaluation is not one thing. It is really three overlapping concerns, each asking a different question about a model.

The first is general capability: is this model broadly intelligent? Can it reason, write, follow complex instructions, solve novel problems? This is what most public benchmarks measure. MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic subjects. ARC-Challenge tests scientific reasoning. MATH tests mathematical problem-solving. HumanEval and SWE-bench test coding. These benchmarks are designed to be hard enough that a model cannot just pattern-match its way through them — they are meant to probe something like general reasoning ability.

The second layer is safety and alignment: will this model do what I want without doing things I do not want? This is where red-teaming, adversarial testing, and constitutional AI evaluations live. How does the model handle requests to help with harmful content? Does it refuse appropriately, or refuse too aggressively? Does it have problematic biases in how it treats different groups? This layer is less visible in public benchmarks but is increasingly where labs spend the most internal engineering effort.

The third layer is task-specific performance: can this model do the particular thing I need it to do, and do it well? This is where general benchmarks become inadequate. A model that aces MMLU might still write mediocre marketing copy, give imprecise legal summaries, or fail to extract the information you need from a contract. Task-specific evals are the ones you usually have to build yourself — and they are the most meaningful signal for any real deployment decision.

How the Weight of Each Evaluation Layer Shifts by Model Type Base / General LLM GPT-4o, Claude, Gemini Specialized LLM Fine-tuned / domain models Agentic Model Multi-step, tool-using Layer 1 General Capability MMLU, ARC, HumanEval Layer 2 Safety and Alignment Red-teaming, TruthfulQA Layer 3 Task-Specific Custom evals, domain tests Primary signal Core selection criteria Non-negotiable Labs invest heavily here Not applicable No deployment context yet Table stakes Verify threshold is met Non-negotiable Domain-specific risks apply Primary signal Domain task performance Table stakes Already assumed capable Non-negotiable Agentic actions carry real risk Most critical signal General benchmarks cannot measure this
Layer 1 (general capability) is the primary selection signal for base models but becomes table stakes for agents. Layer 3 (task-specific) barely matters for a base model selection — there is no deployment context yet — but becomes the most critical signal for any deployed agentic system. Layer 2 (safety and alignment) is non-negotiable at every level.

What Model Cards Tell Us

If you look at the model cards that labs publish alongside their releases — OpenAI's for GPT-4o, Anthropic's for Claude 3.5 Sonnet, Google's for Gemini 1.5 Pro — you will see all three layers represented, but not equally weighted. General capability numbers sit prominently in the comparison tables. Safety evaluations get their own dedicated section, often running to several pages. Task-specific benchmarks appear selectively: coding benchmarks for models positioned toward developers, medical benchmarks for models pitched at healthcare applications.

What is interesting is where each lab chooses to place its emphasis. These are not just aesthetic choices. They reflect each lab's theory of what its audience most needs to know, and in some cases, what the lab is most proud of.

Model Card Comparison: Structure and Emphasis by Lab OpenAI · GPT-4o Released May 2024 EMPHASIS: CAPABILITY-FIRST GENERAL CAPABILITY MMLU (5-shot) 88.7% MATH (0-shot) 76.6% HumanEval (0-shot) 90.2% GPQA Diamond 53.6% SAFETY AND ALIGNMENT TruthfulQA, BBQ bias evals, red-team results listed but not foregrounded Relative page allocation Capability (70%) Safety (30%) Raw capability numbers lead Anthropic · Claude 3.5 Sonnet Released October 2024 EMPHASIS: SAFETY CO-EQUAL GENERAL CAPABILITY MMLU (5-shot) 88.7% HumanEval (0-shot) 92.0% GPQA Diamond 59.4% MATH (0-shot) 71.1% SAFETY AND ALIGNMENT Constitutional AI, RSP policy, CBRN refusals, persuasion resistance — dedicated section Relative page allocation Capability (50%) Safety (50%) Safety given equal billing Google · Gemini 1.5 Pro Released February 2024 EMPHASIS: DOMAIN AND SCALE GENERAL CAPABILITY MMLU (5-shot) 85.9% HumanEval (0-shot) 84.1% GPQA Diamond 46.2% MATH (0-shot) 67.7% DOMAIN AND MULTIMODAL Medical QA, multilingual, video understanding, long-context benchmarks Relative page allocation Capability (40%) Safety (20%) Domain (40%) Domain use cases foregrounded
Simplified model card excerpts based on publicly documented evaluations. Bar allocations are approximate representations of relative page emphasis, not exact measurements. Anthropic gives safety evaluations co-equal prominence with capability benchmarks — both segments run to comparable depth — consistent with its Constitutional AI and Responsible Scaling Policy commitments. OpenAI leads with raw capability numbers. Google foregrounds domain-specific and multimodal benchmarks alongside general capability, reflecting its positioning across consumer and enterprise verticals. All three labs publish all three layers — the difference is what gets headlined.

The Benchmark Landscape

For text-based language models, the evaluation toolbox is relatively mature. General capability benchmarks like MMLU have been around since 2021. Coding benchmarks like HumanEval and SWE-bench have become increasingly important as models are applied to software engineering. SWE-bench in particular is worth noting: rather than testing whether a model can write code, it tests whether a model can autonomously resolve real GitHub issues — a much harder and more realistic proxy for practical utility.

Benchmark What it Tests Layer Notes
MMLU
Massive Multitask Language Understanding
Knowledge across 57 academic subjects Capability Most referenced general benchmark; increasingly contaminated by training
ARC-Challenge
AI2 Reasoning Challenge
Scientific reasoning at grade-school level Capability Now mostly saturated by frontier models; less differentiating
MATH Competition-level mathematics Capability Still differentiates frontier from mid-tier models meaningfully
HumanEval Code generation from docstrings Capability Single-function scope; SWE-bench is a more realistic test
SWE-bench Verified Autonomous resolution of real GitHub issues Capability / Task Multi-step agentic coding; harder and more predictive of real use
GPQA Diamond
Graduate-Level Google-Proof Q&A
Expert-level science Q&A written by PhD researchers Capability Designed to resist contamination; remains a meaningful signal
TruthfulQA Factual accuracy and hallucination resistance Safety Tests calibration; can be overfitted through targeted training
FID / CLIP score
Fréchet Inception Distance / Contrastive Language-Image Pre-training
Image quality and text-image alignment Capability Narrow technical measures; human preference studies still required
GAIA
General AI Assistants
Multi-step research tasks: web browsing, file manipulation, reasoning Task Purpose-built for agentic models; growing adoption
TerminalBench Terminal task completion in shell and file system environments Task Purpose-built for agentic models; still maturing as a benchmark

As you move away from text, the evaluation landscape becomes thinner and less reliable. For image generation, metrics like FID and CLIP scores exist, but they measure narrow technical properties rather than whether the image actually looks good or correctly depicts the scene you described. The field largely falls back on human preference studies: show raters pairs of images, ask which is better, aggregate the results.

Video is even harder. When OpenAI launched Sora, the evaluation was almost entirely qualitative — curated demo clips, not benchmark numbers. This was not a failure of rigor; it reflected the honest state of video generation metrics, which largely cannot capture what makes a video impressive or useful. Human preference data was the only credible signal.

For agentic models, benchmarks like GAIA and TerminalBench are emerging, but agent evaluation carries its own structural challenges. An agent's performance depends not just on the model itself but on the tools it has access to, the quality of its system prompt, the reliability of the APIs it calls, and the specific context it is operating in. Two agents with the same underlying model can perform very differently depending on configuration. Benchmarking the model in isolation tells you something — but not how the deployed system will actually perform.

Why Evals Are Imperfect Signals

The most important thing to understand about benchmarks is that they degrade over time. If the evaluation criteria are static and public, then models can be — and are — trained on them. What starts as a meaningful test of generalization becomes, over several training iterations, closer to a memorized dataset. Researchers call this "benchmark contamination," and it is endemic to the field.

A model that scores 90% on MMLU in 2025 is not necessarily more capable than one that scored 85% in 2023. The benchmark may simply be closer to its training distribution. This is why new, harder benchmarks keep appearing — ARC-AGI, GPQA, Humanity's Last Exam — each trying to stay one step ahead of models that are increasingly good at acing tests that have already been published.

The Contamination Problem in Practice
The contamination problem is most acute for general capability benchmarks, where test sets are public and fixed. Safety evals are somewhat more resistant, because red-teaming is iterative and adversarial — new attack vectors keep being discovered. Task-specific evals, especially custom ones, are the most robust, because models have never seen them. As you move from general to task-specific, benchmarks become less gameable and more meaningful.

Notice what happens as you move away from text toward images, video, and agents: the field falls back on human preference data faster, automated metrics become less trustworthy, and the benchmarks that exist tend to measure narrow technical properties rather than the thing you actually care about. The unsexy truth of AI evaluation is that thousands of hours of human annotation — real people rating outputs — remain the bedrock of both training and evaluation for anything that is not text. The same human judgment that powered RLHF in the early days of ChatGPT is still the gold standard for evaluating whether a generated video looks good, whether an agent completed a task correctly, or whether a curated digest actually reflects what its reader cares about.

For agents and multimodal outputs, building your own eval framework is not just a reasonable choice. In many cases, it is the only rigorous one.

So What Should You Actually Use?

My professor at Kellogg, Josh D'Arcy, raised a point in a seminar that I keep coming back to: he is skeptical about looking at benchmark X to decide which model to use. Not because the benchmarks are worthless — they are useful for shortlisting and for verifying that basic thresholds are met. But they are generic by design. If you are building a legal research tool, MMLU does not tell you how well the model reads contracts. If you are building a customer support bot, HumanEval does not tell you how well the model handles edge-case refund requests.

The practical implication is that some evals should always be in place. Safety and alignment evals are non-negotiable regardless of use case — you want to know that a model will not behave in harmful ways before you deploy it anywhere. But once you have crossed that baseline, picking a model based on generic capability benchmarks alone is not best practice. Everyone is optimizing for something different. And there is a price-performance dimension that benchmarks rarely capture.

In my own project — building an agentic daily digest that curates a personalised morning newsletter every day — I went through several model iterations. In v1, I used free-tier open-source models via OpenRouter: Llama 3.3 70B, DeepSeek R1, Gemma. Some of these technically benchmark comparably to or better than Claude Haiku on certain general capability tasks. But in practice, Haiku won, and it was not close. Free-tier models were flaky: rate limits kicked in mid-pipeline, model availability varied, and the structured output I needed was not consistently reliable. Haiku cost $0.018 per day but delivered consistent quality and fast inference. The right model for a production task is not always the one with the highest benchmark score — it is the one that performs your specific task reliably, at a price you can sustain.

The Practical Takeaway

Use public benchmarks to shortlist models and verify that safety thresholds are met. Then build your own task-specific eval to make the final selection. The benchmark gets you to a shortlist of candidates; your custom eval tells you which one to actually deploy — and whether it is actually working once it is live.

For text-based applications with a clear success criterion, building a custom eval is tractable. For image, video, and agent outputs, it is harder but still necessary — and in many cases, human evaluation remains the only rigorous option. A thousand automated metrics will not tell you whether your users find the generated content useful. Only your users can.

In the next post, I walk through exactly how I built a custom evaluation framework for my digest agent: a judge model that scores every digest against my own calibrated rubric, what it cost, what it revealed, and why even at small scale, getting evals right takes more work than you might expect.