- A metric is one thing you measure about a conversation (for example, “Task Resolution”). Each metric groups one or more criteria — individual yes/no checks.
- A verdict (or outcome) is the answer to one criterion for one conversation. Every verdict is one of four values: true, false, abstain (the grader couldn’t decide), or na (the criterion didn’t apply to that conversation).
The per-metric cards
The Overview tab shows one card per metric. Each card is a scoreboard for how that metric is performing across all the conversations you’ve evaluated. The headline number on a card is the compliant rate. To understand it, you first need to know what “compliant” means. A criterion has an expected value — the answer that counts as good. Some criteria are phrased so that true is good (“Did the agent confirm the caller’s identity?”), and some so that false is good (“Did the agent give wrong information?”). A verdict is compliant when its outcome matches that expected value. So a “false” verdict can absolutely be the good outcome — it depends on the criterion. The card shows four counts that add up to the full picture:| Count | What it means |
|---|---|
| Compliant | Verdicts where the outcome matched the expected value — the “good” answers. |
| Answered | Verdicts that came back as either true or false — a real decision was made. |
| Abstain | The grader read the conversation but couldn’t decide. |
| Na | The criterion didn’t apply to that conversation (for example, an audio-only check on a chat with no recording). |
The compliant, answered, abstain, and na totals on each card are exact. They come straight from the server across all matching verdicts — they are not capped or sampled, so the rate you read is the true rate for the filters you’ve applied.
How to read a metric card — a worked example
Say the Information Accuracy card shows these numbers over the last 30 days:- There were 60 verdicts in total (47 + 3 + 2 + 8 — the 3 is the non-compliant share of the answered group).
- 8 were na — the criterion didn’t apply to those conversations, so they’re ignored.
- 2 were abstain — the grader saw them but couldn’t decide, so they’re ignored too.
- That leaves 50 answered verdicts. Of those, 47 were compliant.
- 47 ÷ 50 = 94% compliant rate.
The assessor toggle — whose verdicts am I looking at?
At the top of the Overview is an assessor toggle with three options:- AI + Human — verdicts from both the AI judge and human reviewers, combined.
- AI judge — only verdicts produced by the AI judge.
- Human — only verdicts produced by human reviewers.
The assessor toggle is not the same thing as a metric’s scoring-mode badge (auto / hybrid / human_only). The toggle controls what you’re currently viewing; the badge describes a metric’s capability — whether the AI judge is trusted to score it automatically. Both AI and human verdicts can exist for a metric no matter what its badge says. See Graduation for what the badge means and how a metric earns it.
The date-range filter
Next to the assessor toggle is a date-range filter. It scopes every card to verdicts from conversations in that window, so you can compare last week to this week, or zoom into a specific incident. The counts and the compliant rate recompute for the range you choose.Drilling into a metric’s exceptions
A compliant rate tells you how many conversations had a problem. To see which ones and why, click into a metric to open its drill-down. The drill-down lists that metric’s exceptions — the individual verdicts that were not compliant. Each exception is a real conversation that failed the check, shown with the supporting evidence:- The criterion that failed.
- A verbatim quote from the conversation — the exact line the grader anchored on.
- The grader’s reasoning and confidence.
Run tiles and the runs-by-day chart
Below or alongside the metric cards, the Overview also summarizes the runs themselves. A run is one assessment of one conversation by one assessor — the unit of work behind every verdict. You’ll see three run tiles:| Tile | What it counts |
|---|---|
| Total | All evaluation runs in the current range. |
| Completed | Runs that finished successfully and produced verdicts. |
| Failed | Runs that errored out and produced nothing. |
Permissions
Viewing Analytics requires theassessments:read scope. Creating, editing, triggering, or graduating metrics requires assessments:manage.
Next steps
Calibration & agreement
Learn whether the AI judge actually agrees with your humans — using Cohen’s κ, not raw percentages.
Graduation
Understand the scoring-mode badge and how a metric earns automatic AI scoring.

