- An AI judge — a top-tier large language model that reads a conversation and answers your quality questions automatically.
- Human reviewers — people on your team who grade conversations themselves.
Where it lives
Everything below lives in the Analytics section of the dashboard. Inside Analytics you’ll find these tabs:| Tab | What it’s for |
|---|---|
| Overview | The scoreboard — compliant rates and counts for each metric. |
| Metrics | Define what you measure (metrics and their criteria). |
| Evaluations | Bundle metrics together and configure the judge. |
| Runs | Trigger evaluations and inspect individual results. |
| Calibration | Decide whether the AI judge is trustworthy. |
The mental model
The most important thing to learn first is the hierarchy — how the pieces nest. Read this once and the rest of the section will make sense.| Term | What it is |
|---|---|
| Metric | A thing you measure about a conversation, such as “Task Resolution” or “Information Accuracy”. A metric groups one or more criteria. |
| Criterion (also called an atom) | A single yes/no check inside a metric. It has a question (the exact yes/no question asked about the conversation) and an expected_value (true or false — the answer that means “good”). A criterion can be deterministic (a hard, reproducible rule) or judged (decided by the AI). |
| Outcome | The answer to a criterion’s question on one conversation: true, false, abstain (the judge couldn’t decide), or na (the criterion doesn’t apply here). |
| Compliant | When the outcome matches the criterion’s expected_value. So if a criterion expects false (like “Did the agent give wrong info?”), a false outcome is the good result. |
| Evaluation | A reusable bundle of metrics plus a judge configuration and a free-text instructions field. You run an evaluation against conversations. |
| Run | One assessment of one conversation by one assessor. The assessor is either the AI judge (ai) or a human reviewer (human). |
| Verdict (criterion result) | The result for one criterion within one run: an outcome, a verbatim quote of the supporting evidence, a confidence, and reasoning. |
A metric expecting
false flips the intuition: “false is good.” Throughout the dashboard, the compliant rate is what matters, not the raw share of true answers. Compliant rate = compliant ÷ answered, where answered counts only true and false verdicts — abstain and na are left out of the denominator.AI judge vs. human reviewers
The same conversation can be scored by the AI judge and by a person, and each scoring is its own run. That’s deliberate: when you have an AI run and a human run for the same conversations, you can measure how often they agree. High agreement is your evidence that the AI judge is trustworthy enough to score on its own. Low agreement tells you the question needs sharpening or the judge needs more context. Human grading is designed to be fast. It works “grade-by-exception” — every criterion starts on the no-violation (compliant) answer, and you only flag the exceptions. A human run can also confirm or correct an existing AI run.The end-to-end lifecycle
Putting it together, here’s the path from nothing to trusted, automatic scoring:Define metrics
In Metrics, create the metrics and criteria that capture what “good” means for your conversations. See Creating metrics & criteria.
Bundle into an evaluation
In Evaluations, group the metrics you want to score together and add judge instructions (often your agent’s knowledge base, so the judge knows what “correct” means).
Run it
From Runs, trigger the evaluation against one or many conversations, a pasted transcript, or an audio URL. Each conversation becomes its own run, processed in parallel.
Read the results
In Overview, read compliant rates per metric and drill into the individual verdicts behind any number.
Calibrate
In Calibration, compare AI and human runs to ask the real question: is the AI judge trustworthy for this metric?
Start here
Quickstart
The fastest path: create a metric, run an evaluation, and read your first results.
Core concepts
A deeper tour of metrics, criteria, outcomes, runs, and verdicts.
Creating metrics & criteria
Write good yes/no criteria and choose the right expected value.

