Quick summary
| Term | In one line |
|---|---|
| Metric | A thing you measure about a conversation (e.g. “Task Resolution”); groups one or more criteria. |
| Criterion (atom) | A single yes/no check inside a metric. |
| Question | The exact yes/no question the AI judge answers about the conversation. |
| Expected value | The answer (true or false) that means “good / compliant”. |
| Deterministic vs judged | A hard, reproducible rule vs a call the LLM makes. |
| Outcome | One verdict’s value: true, false, abstain, or na. |
| Compliant | The outcome matched the expected value. |
| Compliant rate | compliant ÷ answered (abstain and na excluded). |
| Evaluation | A reusable bundle of metrics + judge config + instructions. |
| Instructions | Free text the judge reads as scoring context. |
| Evaluation run | One assessment of one conversation by one assessor. |
| Assessor | Who produced the verdicts: ai or human. |
| Criterion result | The verdict for one criterion in one run (outcome + quote + confidence + reasoning). |
| Scoring mode (scored_by) | A metric’s capability badge: auto, hybrid, or human_only. |
| Calibration set | A frozen set of conversations used to measure agreement rigorously. |
| Rater / assignment | A person (or AI) graded into a calibration set, and the work assigned to them. |
| Agreement | How often two assessors give the same verdict. |
| Cohen’s κ | A chance-corrected agreement score. |
| Gate 1 / Gate 2 / proxy | Human↔human / AI↔human / internal↔customer agreement gates. |
| Golden labels | A held-out set of known-correct verdicts used as a regression check. |
| Graduation | Explicitly turning on auto-scoring for a metric once its gates pass. |
| Modality | Whether a metric reads text or audio. |
Metric
Plain: a single thing you want to measure about a conversation — for example “Task Resolution” or “Information Accuracy”. Precise: a metric groups one or more criteria and carries ametric_kind, a modality (text or audio), and a scored_by status (its scoring mode). You view per-metric results as cards in the Analytics Overview tab.
Criterion (atom)
Plain: one specific yes/no check inside a metric. Criteria are sometimes called “atoms” because they’re the smallest unit that gets a verdict. Precise: a criterion holds a question, an expected value, an optional applies-when rule, and a flag for whether it’s deterministic or judged. Every evaluation run produces one criterion result per criterion.Question
Plain: the exact yes/no question that gets answered about the conversation. Precise: this text is what the AI judge actually reads, so any hard rule — for instance a channel rule like “only count this for phone calls” — must be written into the question itself, phrased to stay channel-agnostic. Put rules in the question; put supporting facts in the evaluation’s instructions.Applies-when (na rule)
Plain: an optional condition that says “skip this criterion when it doesn’t apply”. Precise: when the applies-when / na rule is met, the criterion resolves to thena outcome instead of being scored, and it’s excluded from the compliant rate.
Deterministic vs judged criterion
Plain: a deterministic criterion is a hard rule that always gives the same answer; a judged criterion is one the AI decides. Precise: a deterministic criterion is reproducible by construction — it never needs the LLM and never needs calibration. A judged criterion is decided by the AI judge (or a human). A metric made entirely of deterministic criteria auto-certifies the moment you create it; a metric with any judged criterion has to earn trust through calibration before it can graduate.Outcome
Plain: the value of a single verdict — what the judge or reviewer answered for one criterion. Precise: an outcome is exactly one of:| Outcome | Meaning |
|---|---|
true | The answer to the criterion’s question is yes. |
false | The answer is no. |
abstain | The assessor couldn’t decide. |
na | The criterion doesn’t apply to this conversation (e.g. its applies-when rule wasn’t met, or an audio metric has no recording). |
abstain and na are non-answers: they are excluded from the compliant rate denominator.
Expected value
Plain: the answer that counts as “good”. Precise: a criterion’sexpected_value is true or false — whichever verdict means the conversation was good or compliant. It’s set per criterion because some questions are phrased positively and some negatively (see the worked example below).
Compliant & compliant rate
Plain: a verdict is “compliant” when it matches the expected answer. The compliant rate is the share of answered verdicts that came out good. Precise: compliant = the outcome equals the criterion’s expected value. The compliant rate shown on dashboard cards is:Worked example: why a “false” verdict can be compliant
Take the criterion “Did the agent give wrong information?” Wrong information is bad, so the good answer is “no” — meaningexpected_value = false.
false verdict here is the good result. Always read compliant as “matched expected value”, not as “the verdict was true”.
Evaluation
Plain: a reusable bundle you point at conversations to score them. Precise: an evaluation bundles a set of metrics, the judge configuration, and a free-text instructions field. You run an evaluation against one or more conversations to produce runs. Build and edit evaluations in the Evaluations tab.Instructions
Plain: notes you give the AI judge so it knows what “correct” looks like. Precise: free text the judge reads as scoring context — not as the question. A common use is pasting the agent’s knowledge base or scoring guidance so the judge knows what counts as in-scope or accurate. The judge anchors on each criterion’s question and treats instructions as supporting context, so put hard rules in the question and put facts in instructions. Editable in the Evaluation builder.Evaluation run
Plain: one scoring of one conversation by one assessor. Precise: a run is a single assessment of one conversation by one assessor. Its status ispending, completed, or failed. AI runs are queued and processed asynchronously. You can have multiple runs on the same conversation — for example an AI run and a human run that confirms or corrects it.
Assessor
Plain: who produced the verdicts — the AI or a person. Precise: a run’sassessor_type is either ai (the AI judge) or human (a reviewer). In the Overview tab, the assessor toggle (“AI + Human”, “AI judge”, “Human”) chooses whose verdicts you’re currently looking at.
The AI judge
Plain: a top-tier LLM that reads the conversation and answers each criterion. Precise: for each criterion the judge reads the question (plus any na rule) and the evaluation’s instructions, then returns an outcome, a verbatim quote of evidence, a confidence, and reasoning. It runs asynchronously. For audio-modality metrics, if the conversation has no recording the verdict auto-resolves tona.
Criterion result
Plain: the recorded verdict for one criterion within one run, with its evidence. Precise: a criterion result holds the outcome, a verbatim quote of the supporting evidence from the transcript, a confidence, and reasoning. One run has one criterion result per criterion in the evaluation’s metrics.Scoring mode (scored_by)
Plain: a per-metric badge that says how much you trust the AI to score this metric on its own. Precise:scored_by is a status shown as a badge with three values:
| Badge | Meaning |
|---|---|
human_only | The AI judge is not yet trusted to score this metric automatically. |
hybrid | The AI judge scores it, with human spot-checks. |
auto | The AI judge is trusted to score it automatically. |
Calibration set
Plain: a frozen group of conversations you use to rigorously measure whether the AI judge agrees with humans. Precise: a calibration set freezes a fixed set of conversations, assigns multiple raters, and computes agreement. Each rater grades every conversation as a new, isolated run, separate from any other evaluation — so the measurement isn’t contaminated by existing verdicts. Set these up in the Calibration tab.Rater & assignment
Plain: a rater is someone (or the AI) doing the grading; an assignment is the work given to them. Precise: in a calibration set, each rater grades every frozen conversation as its own isolated run. You need at least two human raters to measure Gate 1; one human plus one AI rater gives you Gate 2 only.Agreement
Plain: how often two assessors land on the same verdict. Precise: agreement compares two assessors’ outcomes on the same conversations. Raw % agreement is shown but is misleading on its own — see Cohen’s κ.Cohen’s κ (kappa)
Plain: an agreement score that corrects for lucky guesses. Precise: when outcomes are skewed — say the answer is almost always the same — two assessors can agree 80%+ purely by chance, so raw % overstates trust. Cohen’s κ corrects for that chance agreement. The dashboard also shows Gwet AC1, Krippendorff α, and prevalence (the share oftrue). Interpret κ with the Landis–Koch bands:
| Cohen’s κ | Interpretation |
|---|---|
| ≥ 0.80 | Almost perfect |
| 0.60 – 0.80 | Substantial |
| 0.40 – 0.60 | Moderate |
| 0.20 – 0.40 | Fair |
| < 0.20 | Roughly chance |
Gate 1 / Gate 2 / proxy
Plain: the three agreement checks that decide whether the AI judge is trustworthy. Precise:| Gate | Compares | Needs |
|---|---|---|
| Gate 1 | human ↔ human | at least 2 human raters |
| Gate 2 | AI ↔ human | 1 human + the AI |
| Proxy | internal rater ↔ customer rater | an internal and a customer rater |
Golden labels
Plain: a small set of verdicts you already know are correct, kept aside as a sanity check. Precise: golden labels are a held-out set of known-correct verdicts used as a regression check during graduation — they catch a judge that would otherwise pass the gates but get the known cases wrong.Graduation
Plain: the deliberate act of turning on auto-scoring for a metric once it has earned trust. Precise: computing agreement on a calibration set refreshes each target metric’s eligibility (its gates, blockers, and “what it would take”) but never auto-graduates it. If the gates pass — for example Gate-2 κ at or above the threshold — the metric shows as eligible, and you then graduate it explicitly from its Status / Graduation panel. Itsscored_by then flips to auto or hybrid. Nothing scores automatically until a human clicks Graduate. Conversely, a metric whose gates later fail is demoted automatically — losing trust needs no confirmation. Fully deterministic metrics skip all of this and auto-certify on creation.
Modality
Plain: whether a metric reads the conversation’s text or its audio. Precise: a metric’smodality is text or audio. For an audio metric, if the conversation has no recording, the AI judge auto-resolves the verdict to na (so it never counts against the compliant rate).
Permissions
Viewing evaluation data requires theassessments:read scope. Creating or editing metrics and evaluations, triggering runs, and graduating metrics all require assessments:manage.
Next steps
Quickstart
Build your first metric and score a conversation end to end.
Creating metrics & criteria
Write good questions and set expected values that score the way you mean.
Calibration
Measure whether the AI judge agrees with your reviewers.

