How Do We Know If an AI Model Is Actually Good?

Every week, a new model drops. The announcement follows a familiar template: state-of-the-art on benchmark X, beats previous SOTA on benchmark Y, human-level performance on Z. The numbers are real. The benchmarks are real. And yet, if you have spent any time actually deploying these models on real tasks, you have probably had the experience of picking the "best" model by the numbers and being quietly disappointed by the results.

This is the central tension of AI evaluation: the gap between what we can measure and what we actually care about. Understanding that gap is the starting point for thinking clearly about how to evaluate AI for your own use case. This post covers the landscape of how models are evaluated today, where the methods work well, where they break down, and what that means for anyone choosing or building with AI in a production setting.

The Three Layers of Model Evaluation

Model evaluation is not one thing. It is really three overlapping concerns, each asking a different question about a model.

The first is general capability: is this model broadly intelligent? Can it reason, write, follow complex instructions, solve novel problems? This is what most public benchmarks measure. MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic subjects. ARC-Challenge tests scientific reasoning. MATH tests mathematical problem-solving. HumanEval and SWE-bench test coding. These benchmarks are designed to be hard enough that a model cannot just pattern-match its way through them — they are meant to probe something like general reasoning ability.

The second layer is safety and alignment: will this model do what I want without doing things I do not want? This is where red-teaming, adversarial testing, and constitutional AI evaluations live. How does the model handle requests to help with harmful content? Does it refuse appropriately, or refuse too aggressively? Does it have problematic biases in how it treats different groups? This layer is less visible in public benchmarks but is increasingly where labs spend the most internal engineering effort.

The third layer is task-specific performance: can this model do the particular thing I need it to do, and do it well? This is where general benchmarks become inadequate. A model that aces MMLU might still write mediocre marketing copy, give imprecise legal summaries, or fail to extract the information you need from a contract. Task-specific evals are the ones you usually have to build yourself — and they are the most meaningful signal for any real deployment decision.

Layer 1 (general capability) is the primary selection signal for base models but becomes table stakes for agents. Layer 3 (task-specific) barely matters for a base model selection — there is no deployment context yet — but becomes the most critical signal for any deployed agentic system. Layer 2 (safety and alignment) is non-negotiable at every level.

What Model Cards Tell Us

If you look at the model cards that labs publish alongside their releases — OpenAI's for GPT-4o, Anthropic's for Claude 3.5 Sonnet, Google's for Gemini 1.5 Pro — you will see all three layers represented, but not equally weighted. General capability numbers sit prominently in the comparison tables. Safety evaluations get their own dedicated section, often running to several pages. Task-specific benchmarks appear selectively: coding benchmarks for models positioned toward developers, medical benchmarks for models pitched at healthcare applications.

What is interesting is where each lab chooses to place its emphasis. These are not just aesthetic choices. They reflect each lab's theory of what its audience most needs to know, and in some cases, what the lab is most proud of.

Simplified model card excerpts based on publicly documented evaluations. Bar allocations are approximate representations of relative page emphasis, not exact measurements. Anthropic gives safety evaluations co-equal prominence with capability benchmarks — both segments run to comparable depth — consistent with its Constitutional AI and Responsible Scaling Policy commitments. OpenAI leads with raw capability numbers. Google foregrounds domain-specific and multimodal benchmarks alongside general capability, reflecting its positioning across consumer and enterprise verticals. All three labs publish all three layers — the difference is what gets headlined.

The Benchmark Landscape

For text-based language models, the evaluation toolbox is relatively mature. General capability benchmarks like MMLU have been around since 2021. Coding benchmarks like HumanEval and SWE-bench have become increasingly important as models are applied to software engineering. SWE-bench in particular is worth noting: rather than testing whether a model can write code, it tests whether a model can autonomously resolve real GitHub issues — a much harder and more realistic proxy for practical utility.

Benchmark	What it Tests	Layer	Notes
MMLU Massive Multitask Language Understanding	Knowledge across 57 academic subjects	Capability	Most referenced general benchmark; increasingly contaminated by training
ARC-Challenge AI2 Reasoning Challenge	Scientific reasoning at grade-school level	Capability	Now mostly saturated by frontier models; less differentiating
MATH	Competition-level mathematics	Capability	Still differentiates frontier from mid-tier models meaningfully
HumanEval	Code generation from docstrings	Capability	Single-function scope; SWE-bench is a more realistic test
SWE-bench Verified	Autonomous resolution of real GitHub issues	Capability / Task	Multi-step agentic coding; harder and more predictive of real use
GPQA Diamond Graduate-Level Google-Proof Q&A	Expert-level science Q&A written by PhD researchers	Capability	Designed to resist contamination; remains a meaningful signal
TruthfulQA	Factual accuracy and hallucination resistance	Safety	Tests calibration; can be overfitted through targeted training
FID / CLIP score Fréchet Inception Distance / Contrastive Language-Image Pre-training	Image quality and text-image alignment	Capability	Narrow technical measures; human preference studies still required
GAIA General AI Assistants	Multi-step research tasks: web browsing, file manipulation, reasoning	Task	Purpose-built for agentic models; growing adoption
TerminalBench	Terminal task completion in shell and file system environments	Task	Purpose-built for agentic models; still maturing as a benchmark

As you move away from text, the evaluation landscape becomes thinner and less reliable. For image generation, metrics like FID and CLIP scores exist, but they measure narrow technical properties rather than whether the image actually looks good or correctly depicts the scene you described. The field largely falls back on human preference studies: show raters pairs of images, ask which is better, aggregate the results.

Video is even harder. When OpenAI launched Sora, the evaluation was almost entirely qualitative — curated demo clips, not benchmark numbers. This was not a failure of rigor; it reflected the honest state of video generation metrics, which largely cannot capture what makes a video impressive or useful. Human preference data was the only credible signal.

For agentic models, benchmarks like GAIA and TerminalBench are emerging, but agent evaluation carries its own structural challenges. An agent's performance depends not just on the model itself but on the tools it has access to, the quality of its system prompt, the reliability of the APIs it calls, and the specific context it is operating in. Two agents with the same underlying model can perform very differently depending on configuration. Benchmarking the model in isolation tells you something — but not how the deployed system will actually perform.

Why Evals Are Imperfect Signals

The most important thing to understand about benchmarks is that they degrade over time. If the evaluation criteria are static and public, then models can be — and are — trained on them. What starts as a meaningful test of generalization becomes, over several training iterations, closer to a memorized dataset. Researchers call this "benchmark contamination," and it is endemic to the field.

A model that scores 90% on MMLU in 2025 is not necessarily more capable than one that scored 85% in 2023. The benchmark may simply be closer to its training distribution. This is why new, harder benchmarks keep appearing — ARC-AGI, GPQA, Humanity's Last Exam — each trying to stay one step ahead of models that are increasingly good at acing tests that have already been published.

The Contamination Problem in Practice

The contamination problem is most acute for general capability benchmarks, where test sets are public and fixed. Safety evals are somewhat more resistant, because red-teaming is iterative and adversarial — new attack vectors keep being discovered. Task-specific evals, especially custom ones, are the most robust, because models have never seen them. As you move from general to task-specific, benchmarks become less gameable and more meaningful.

Notice what happens as you move away from text toward images, video, and agents: the field falls back on human preference data faster, automated metrics become less trustworthy, and the benchmarks that exist tend to measure narrow technical properties rather than the thing you actually care about. The unsexy truth of AI evaluation is that thousands of hours of human annotation — real people rating outputs — remain the bedrock of both training and evaluation for anything that is not text. The same human judgment that powered RLHF in the early days of ChatGPT is still the gold standard for evaluating whether a generated video looks good, whether an agent completed a task correctly, or whether a curated digest actually reflects what its reader cares about.

For agents and multimodal outputs, building your own eval framework is not just a reasonable choice. In many cases, it is the only rigorous one.

So What Should You Actually Use?

My professor at Kellogg, Josh D'Arcy, raised a point in a seminar that I keep coming back to: he is skeptical about looking at benchmark X to decide which model to use. Not because the benchmarks are worthless — they are useful for shortlisting and for verifying that basic thresholds are met. But they are generic by design. If you are building a legal research tool, MMLU does not tell you how well the model reads contracts. If you are building a customer support bot, HumanEval does not tell you how well the model handles edge-case refund requests.

The practical implication is that some evals should always be in place. Safety and alignment evals are non-negotiable regardless of use case — you want to know that a model will not behave in harmful ways before you deploy it anywhere. But once you have crossed that baseline, picking a model based on generic capability benchmarks alone is not best practice. Everyone is optimizing for something different. And there is a price-performance dimension that benchmarks rarely capture.

In my own project — building an agentic daily digest that curates a personalised morning newsletter every day — I went through several model iterations. In v1, I used free-tier open-source models via OpenRouter: Llama 3.3 70B, DeepSeek R1, Gemma. Some of these technically benchmark comparably to or better than Claude Haiku on certain general capability tasks. But in practice, Haiku won, and it was not close. Free-tier models were flaky: rate limits kicked in mid-pipeline, model availability varied, and the structured output I needed was not consistently reliable. Haiku cost $0.018 per day but delivered consistent quality and fast inference. The right model for a production task is not always the one with the highest benchmark score — it is the one that performs your specific task reliably, at a price you can sustain.

The Practical Takeaway

Use public benchmarks to shortlist models and verify that safety thresholds are met. Then build your own task-specific eval to make the final selection. The benchmark gets you to a shortlist of candidates; your custom eval tells you which one to actually deploy — and whether it is actually working once it is live.

For text-based applications with a clear success criterion, building a custom eval is tractable. For image, video, and agent outputs, it is harder but still necessary — and in many cases, human evaluation remains the only rigorous option. A thousand automated metrics will not tell you whether your users find the generated content useful. Only your users can.

In the next post, I walk through exactly how I built a custom evaluation framework for my digest agent: a judge model that scores every digest against my own calibrated rubric, what it cost, what it revealed, and why even at small scale, getting evals right takes more work than you might expect.

Next in this Series

Building an LLM Judge to Evaluate My AI Digest Agent

Evaluation Series Part 2: an eight-dimension custom rubric, a calibrated LLM judge, what the scores revealed, and why eval costs more than you expect.

About the author: I'm Linus, a Singaporean Product Manager currently based in San Francisco. I write about building practical AI systems from the perspective of someone who's learning by doing. This is part of my ongoing series, Applied AI Thinking for Operators.

Email: seah.linus@gmail.com
GitHub: linusseah

References:
· Hendrycks et al. — MMLU: Massive Multitask Language Understanding
· SWE-bench: Can Language Models Resolve Real GitHub Issues?
· OpenAI — GPT-4o System Card
· Anthropic — Claude 3.5 Sonnet Model Card
· Google — Gemini 1.5 Technical Report