— Concepts

    AI Evaluation (Eval)

    The discipline of measuring whether an AI system is actually doing the job correctly.

    Also known as: LLM eval · Model evaluation · AI testing

    What is AI Evaluation (Eval)?

    AI evaluation — usually shortened to 'evals' — is how you measure LLM output quality on a defined task. Evals can be deterministic (regex match, exact answer) or LLM-judged (one model grades another). Without evals you cannot tell whether a prompt change made things better or worse, whether a new model upgrade is safe to ship, or whether your agent is regressing in production. Frameworks like Anthropic Evals, OpenAI Evals, Promptfoo, and Inspect AI are standard in 2026.