Skip to main content
When your agents handle hundreds or thousands of conversations, you can’t read them all to know whether they’re doing a good job. Spot-checking a handful by hand is slow, inconsistent, and impossible to repeat. Evaluation solves that: it measures the quality of your agent conversations consistently and at scale, so you can answer questions like “Did the agent resolve the caller’s task?” or “Did it ever give out wrong information?” across every conversation, not just the few you happened to open. You measure quality two ways, and you can combine them:
  • An AI judge — a top-tier large language model that reads a conversation and answers your quality questions automatically.
  • Human reviewers — people on your team who grade conversations themselves.
Both produce the same kind of result, so you can compare them directly. That comparison is what eventually lets you trust the AI judge to score on its own.

Where it lives

Everything below lives in the Analytics section of the dashboard. Inside Analytics you’ll find these tabs:
TabWhat it’s for
OverviewThe scoreboard — compliant rates and counts for each metric.
MetricsDefine what you measure (metrics and their criteria).
EvaluationsBundle metrics together and configure the judge.
RunsTrigger evaluations and inspect individual results.
CalibrationDecide whether the AI judge is trustworthy.

The mental model

The most important thing to learn first is the hierarchy — how the pieces nest. Read this once and the rest of the section will make sense.
Metric  ────────────────  "Task Resolution"
  └─ Criterion (atom) ──── "Did the agent resolve the caller's request?" → expects true
  └─ Criterion (atom) ──── "Did the agent give wrong info?"            → expects false

Evaluation  ────────────  bundles one or more Metrics + judge instructions

  └─ run against conversations

        ├─ Run  ─────────  one conversation, one assessor (AI or human)
        │     └─ Verdict per criterion: outcome + quote + confidence
        └─ Run  ─────────  another conversation, or the same one by a human
Here’s each term, defined the first time you’ll meet it:
TermWhat it is
MetricA thing you measure about a conversation, such as “Task Resolution” or “Information Accuracy”. A metric groups one or more criteria.
Criterion (also called an atom)A single yes/no check inside a metric. It has a question (the exact yes/no question asked about the conversation) and an expected_value (true or false — the answer that means “good”). A criterion can be deterministic (a hard, reproducible rule) or judged (decided by the AI).
OutcomeThe answer to a criterion’s question on one conversation: true, false, abstain (the judge couldn’t decide), or na (the criterion doesn’t apply here).
CompliantWhen the outcome matches the criterion’s expected_value. So if a criterion expects false (like “Did the agent give wrong info?”), a false outcome is the good result.
EvaluationA reusable bundle of metrics plus a judge configuration and a free-text instructions field. You run an evaluation against conversations.
RunOne assessment of one conversation by one assessor. The assessor is either the AI judge (ai) or a human reviewer (human).
Verdict (criterion result)The result for one criterion within one run: an outcome, a verbatim quote of the supporting evidence, a confidence, and reasoning.
A metric expecting false flips the intuition: “false is good.” Throughout the dashboard, the compliant rate is what matters, not the raw share of true answers. Compliant rate = compliant ÷ answered, where answered counts only true and false verdicts — abstain and na are left out of the denominator.

AI judge vs. human reviewers

The same conversation can be scored by the AI judge and by a person, and each scoring is its own run. That’s deliberate: when you have an AI run and a human run for the same conversations, you can measure how often they agree. High agreement is your evidence that the AI judge is trustworthy enough to score on its own. Low agreement tells you the question needs sharpening or the judge needs more context. Human grading is designed to be fast. It works “grade-by-exception” — every criterion starts on the no-violation (compliant) answer, and you only flag the exceptions. A human run can also confirm or correct an existing AI run.

The end-to-end lifecycle

Putting it together, here’s the path from nothing to trusted, automatic scoring:
1

Define metrics

In Metrics, create the metrics and criteria that capture what “good” means for your conversations. See Creating metrics & criteria.
2

Bundle into an evaluation

In Evaluations, group the metrics you want to score together and add judge instructions (often your agent’s knowledge base, so the judge knows what “correct” means).
3

Run it

From Runs, trigger the evaluation against one or many conversations, a pasted transcript, or an audio URL. Each conversation becomes its own run, processed in parallel.
4

Read the results

In Overview, read compliant rates per metric and drill into the individual verdicts behind any number.
5

Calibrate

In Calibration, compare AI and human runs to ask the real question: is the AI judge trustworthy for this metric?
6

Graduate

Once a metric’s agreement clears the bar, graduate it to turn on automatic AI scoring — explicitly, with a click.
You don’t have to do all six steps before you get value. Defining one metric and running it against a handful of conversations already tells you something. Calibration and graduation are how you scale that up with confidence.

Start here

Quickstart

The fastest path: create a metric, run an evaluation, and read your first results.

Core concepts

A deeper tour of metrics, criteria, outcomes, runs, and verdicts.

Creating metrics & criteria

Write good yes/no criteria and choose the right expected value.