Skip to main content
Evaluation is how you measure the quality of your agents’ conversations consistently — using an AI judge, human reviewers, or both. It lives in the dashboard under the Analytics section, where the tabs are Overview, Metrics, Evaluations, Runs, and Calibration. This page is a reference. Whenever you hit a term anywhere in the evaluation docs and want the exact meaning, scan the table below, then jump to that term’s section for a plain-language line followed by a precise one.

Quick summary

TermIn one line
MetricA thing you measure about a conversation (e.g. “Task Resolution”); groups one or more criteria.
Criterion (atom)A single yes/no check inside a metric.
QuestionThe exact yes/no question the AI judge answers about the conversation.
Expected valueThe answer (true or false) that means “good / compliant”.
Deterministic vs judgedA hard, reproducible rule vs a call the LLM makes.
OutcomeOne verdict’s value: true, false, abstain, or na.
CompliantThe outcome matched the expected value.
Compliant ratecompliant ÷ answered (abstain and na excluded).
EvaluationA reusable bundle of metrics + judge config + instructions.
InstructionsFree text the judge reads as scoring context.
Evaluation runOne assessment of one conversation by one assessor.
AssessorWho produced the verdicts: ai or human.
Criterion resultThe verdict for one criterion in one run (outcome + quote + confidence + reasoning).
Scoring mode (scored_by)A metric’s capability badge: auto, hybrid, or human_only.
Calibration setA frozen set of conversations used to measure agreement rigorously.
Rater / assignmentA person (or AI) graded into a calibration set, and the work assigned to them.
AgreementHow often two assessors give the same verdict.
Cohen’s κA chance-corrected agreement score.
Gate 1 / Gate 2 / proxyHuman↔human / AI↔human / internal↔customer agreement gates.
Golden labelsA held-out set of known-correct verdicts used as a regression check.
GraduationExplicitly turning on auto-scoring for a metric once its gates pass.
ModalityWhether a metric reads text or audio.

Metric

Plain: a single thing you want to measure about a conversation — for example “Task Resolution” or “Information Accuracy”. Precise: a metric groups one or more criteria and carries a metric_kind, a modality (text or audio), and a scored_by status (its scoring mode). You view per-metric results as cards in the Analytics Overview tab.

Criterion (atom)

Plain: one specific yes/no check inside a metric. Criteria are sometimes called “atoms” because they’re the smallest unit that gets a verdict. Precise: a criterion holds a question, an expected value, an optional applies-when rule, and a flag for whether it’s deterministic or judged. Every evaluation run produces one criterion result per criterion.

Question

Plain: the exact yes/no question that gets answered about the conversation. Precise: this text is what the AI judge actually reads, so any hard rule — for instance a channel rule like “only count this for phone calls” — must be written into the question itself, phrased to stay channel-agnostic. Put rules in the question; put supporting facts in the evaluation’s instructions.

Applies-when (na rule)

Plain: an optional condition that says “skip this criterion when it doesn’t apply”. Precise: when the applies-when / na rule is met, the criterion resolves to the na outcome instead of being scored, and it’s excluded from the compliant rate.

Deterministic vs judged criterion

Plain: a deterministic criterion is a hard rule that always gives the same answer; a judged criterion is one the AI decides. Precise: a deterministic criterion is reproducible by construction — it never needs the LLM and never needs calibration. A judged criterion is decided by the AI judge (or a human). A metric made entirely of deterministic criteria auto-certifies the moment you create it; a metric with any judged criterion has to earn trust through calibration before it can graduate.

Outcome

Plain: the value of a single verdict — what the judge or reviewer answered for one criterion. Precise: an outcome is exactly one of:
OutcomeMeaning
trueThe answer to the criterion’s question is yes.
falseThe answer is no.
abstainThe assessor couldn’t decide.
naThe criterion doesn’t apply to this conversation (e.g. its applies-when rule wasn’t met, or an audio metric has no recording).
Both abstain and na are non-answers: they are excluded from the compliant rate denominator.

Expected value

Plain: the answer that counts as “good”. Precise: a criterion’s expected_value is true or false — whichever verdict means the conversation was good or compliant. It’s set per criterion because some questions are phrased positively and some negatively (see the worked example below).

Compliant & compliant rate

Plain: a verdict is “compliant” when it matches the expected answer. The compliant rate is the share of answered verdicts that came out good. Precise: compliant = the outcome equals the criterion’s expected value. The compliant rate shown on dashboard cards is:
compliant rate = compliant ÷ answered
answered        = verdicts that are true OR false   (abstain and na are NOT counted)

Worked example: why a “false” verdict can be compliant

Take the criterion “Did the agent give wrong information?” Wrong information is bad, so the good answer is “no” — meaning expected_value = false.
Question:        "Did the agent give wrong information?"
expected_value:  false
Verdict outcome: false   ->  matches expected_value  ->  COMPLIANT  (good!)
Verdict outcome: true    ->  agent gave wrong info   ->  NOT compliant
So a false verdict here is the good result. Always read compliant as “matched expected value”, not as “the verdict was true”.

Evaluation

Plain: a reusable bundle you point at conversations to score them. Precise: an evaluation bundles a set of metrics, the judge configuration, and a free-text instructions field. You run an evaluation against one or more conversations to produce runs. Build and edit evaluations in the Evaluations tab.

Instructions

Plain: notes you give the AI judge so it knows what “correct” looks like. Precise: free text the judge reads as scoring context — not as the question. A common use is pasting the agent’s knowledge base or scoring guidance so the judge knows what counts as in-scope or accurate. The judge anchors on each criterion’s question and treats instructions as supporting context, so put hard rules in the question and put facts in instructions. Editable in the Evaluation builder.

Evaluation run

Plain: one scoring of one conversation by one assessor. Precise: a run is a single assessment of one conversation by one assessor. Its status is pending, completed, or failed. AI runs are queued and processed asynchronously. You can have multiple runs on the same conversation — for example an AI run and a human run that confirms or corrects it.

Assessor

Plain: who produced the verdicts — the AI or a person. Precise: a run’s assessor_type is either ai (the AI judge) or human (a reviewer). In the Overview tab, the assessor toggle (“AI + Human”, “AI judge”, “Human”) chooses whose verdicts you’re currently looking at.

The AI judge

Plain: a top-tier LLM that reads the conversation and answers each criterion. Precise: for each criterion the judge reads the question (plus any na rule) and the evaluation’s instructions, then returns an outcome, a verbatim quote of evidence, a confidence, and reasoning. It runs asynchronously. For audio-modality metrics, if the conversation has no recording the verdict auto-resolves to na.

Criterion result

Plain: the recorded verdict for one criterion within one run, with its evidence. Precise: a criterion result holds the outcome, a verbatim quote of the supporting evidence from the transcript, a confidence, and reasoning. One run has one criterion result per criterion in the evaluation’s metrics.

Scoring mode (scored_by)

Plain: a per-metric badge that says how much you trust the AI to score this metric on its own. Precise: scored_by is a status shown as a badge with three values:
BadgeMeaning
human_onlyThe AI judge is not yet trusted to score this metric automatically.
hybridThe AI judge scores it, with human spot-checks.
autoThe AI judge is trusted to score it automatically.
The badge is the metric’s scoring capability, not the data you’re currently viewing. Both AI and human runs can exist for a human_only metric — the assessor toggle controls what you view, the badge controls what the metric is allowed to do. The badge has a popover explaining its measured gates, blockers, and what it would take to graduate.

Calibration set

Plain: a frozen group of conversations you use to rigorously measure whether the AI judge agrees with humans. Precise: a calibration set freezes a fixed set of conversations, assigns multiple raters, and computes agreement. Each rater grades every conversation as a new, isolated run, separate from any other evaluation — so the measurement isn’t contaminated by existing verdicts. Set these up in the Calibration tab.
For a quick check without any setup, every evaluation also has an “AI vs Human” tab that computes Gate-2 κ on the fly over the AI and human runs that already exist. It takes the most-recent run per conversation and excludes na.

Rater & assignment

Plain: a rater is someone (or the AI) doing the grading; an assignment is the work given to them. Precise: in a calibration set, each rater grades every frozen conversation as its own isolated run. You need at least two human raters to measure Gate 1; one human plus one AI rater gives you Gate 2 only.

Agreement

Plain: how often two assessors land on the same verdict. Precise: agreement compares two assessors’ outcomes on the same conversations. Raw % agreement is shown but is misleading on its own — see Cohen’s κ.

Cohen’s κ (kappa)

Plain: an agreement score that corrects for lucky guesses. Precise: when outcomes are skewed — say the answer is almost always the same — two assessors can agree 80%+ purely by chance, so raw % overstates trust. Cohen’s κ corrects for that chance agreement. The dashboard also shows Gwet AC1, Krippendorff α, and prevalence (the share of true). Interpret κ with the Landis–Koch bands:
Cohen’s κInterpretation
≥ 0.80Almost perfect
0.60 – 0.80Substantial
0.40 – 0.60Moderate
0.20 – 0.40Fair
< 0.20Roughly chance
A metric can show a high % agreement while κ is near 0 — that’s the trap κ is there to catch.

Gate 1 / Gate 2 / proxy

Plain: the three agreement checks that decide whether the AI judge is trustworthy. Precise:
GateComparesNeeds
Gate 1human ↔ humanat least 2 human raters
Gate 2AI ↔ human1 human + the AI
Proxyinternal rater ↔ customer rateran internal and a customer rater
Agreement cards on the calibration page are titled by metric (pooled, “by metric”) and by criterion (“by criterion”), and each shows κ, AC1, α, % agreement, prevalence, and a threshold bar.

Golden labels

Plain: a small set of verdicts you already know are correct, kept aside as a sanity check. Precise: golden labels are a held-out set of known-correct verdicts used as a regression check during graduation — they catch a judge that would otherwise pass the gates but get the known cases wrong.

Graduation

Plain: the deliberate act of turning on auto-scoring for a metric once it has earned trust. Precise: computing agreement on a calibration set refreshes each target metric’s eligibility (its gates, blockers, and “what it would take”) but never auto-graduates it. If the gates pass — for example Gate-2 κ at or above the threshold — the metric shows as eligible, and you then graduate it explicitly from its Status / Graduation panel. Its scored_by then flips to auto or hybrid. Nothing scores automatically until a human clicks Graduate. Conversely, a metric whose gates later fail is demoted automatically — losing trust needs no confirmation. Fully deterministic metrics skip all of this and auto-certify on creation.

Modality

Plain: whether a metric reads the conversation’s text or its audio. Precise: a metric’s modality is text or audio. For an audio metric, if the conversation has no recording, the AI judge auto-resolves the verdict to na (so it never counts against the compliant rate).

Permissions

Viewing evaluation data requires the assessments:read scope. Creating or editing metrics and evaluations, triggering runs, and graduating metrics all require assessments:manage.

Next steps

Quickstart

Build your first metric and score a conversation end to end.

Creating metrics & criteria

Write good questions and set expected values that score the way you mean.

Calibration

Measure whether the AI judge agrees with your reviewers.