Skip to main content
Once you’ve run some evaluations, the numbers live in the dashboard under the Analytics section, on the Overview tab. This page walks you through how to read what you see there — what each score means, how to filter by who did the grading, and how to drill into the conversations that failed. Before we start, two quick definitions you’ll need throughout:
  • A metric is one thing you measure about a conversation (for example, “Task Resolution”). Each metric groups one or more criteria — individual yes/no checks.
  • A verdict (or outcome) is the answer to one criterion for one conversation. Every verdict is one of four values: true, false, abstain (the grader couldn’t decide), or na (the criterion didn’t apply to that conversation).

The per-metric cards

The Overview tab shows one card per metric. Each card is a scoreboard for how that metric is performing across all the conversations you’ve evaluated. The headline number on a card is the compliant rate. To understand it, you first need to know what “compliant” means. A criterion has an expected value — the answer that counts as good. Some criteria are phrased so that true is good (“Did the agent confirm the caller’s identity?”), and some so that false is good (“Did the agent give wrong information?”). A verdict is compliant when its outcome matches that expected value. So a “false” verdict can absolutely be the good outcome — it depends on the criterion. The card shows four counts that add up to the full picture:
CountWhat it means
CompliantVerdicts where the outcome matched the expected value — the “good” answers.
AnsweredVerdicts that came back as either true or false — a real decision was made.
AbstainThe grader read the conversation but couldn’t decide.
NaThe criterion didn’t apply to that conversation (for example, an audio-only check on a chat with no recording).
The compliant rate is then:
compliant rate = compliant ÷ answered
The key thing to notice is the denominator: answered, not the total number of verdicts. Abstain and na verdicts are deliberately left out of the math. A criterion that didn’t apply, or that the grader skipped, shouldn’t drag a score up or down — only real decisions count.
The compliant, answered, abstain, and na totals on each card are exact. They come straight from the server across all matching verdicts — they are not capped or sampled, so the rate you read is the true rate for the filters you’ve applied.

How to read a metric card — a worked example

Say the Information Accuracy card shows these numbers over the last 30 days:
Information Accuracy
compliant rate: 94%
compliant 47 · answered 50 · abstain 2 · na 8
Read it like this:
  • There were 60 verdicts in total (47 + 3 + 2 + 8 — the 3 is the non-compliant share of the answered group).
  • 8 were na — the criterion didn’t apply to those conversations, so they’re ignored.
  • 2 were abstain — the grader saw them but couldn’t decide, so they’re ignored too.
  • That leaves 50 answered verdicts. Of those, 47 were compliant.
  • 47 ÷ 50 = 94% compliant rate.
So three conversations had a real accuracy problem worth looking at — and those three are exactly what you’ll find when you drill in (below).

The assessor toggle — whose verdicts am I looking at?

At the top of the Overview is an assessor toggle with three options:
  • AI + Human — verdicts from both the AI judge and human reviewers, combined.
  • AI judge — only verdicts produced by the AI judge.
  • Human — only verdicts produced by human reviewers.
This toggle changes whose verdicts every card is counting. The same metric can have very different numbers depending on which option you pick, because AI and human reviewers don’t always agree.
The assessor toggle is not the same thing as a metric’s scoring-mode badge (auto / hybrid / human_only). The toggle controls what you’re currently viewing; the badge describes a metric’s capability — whether the AI judge is trusted to score it automatically. Both AI and human verdicts can exist for a metric no matter what its badge says. See Graduation for what the badge means and how a metric earns it.
If a metric you think of as “Human only” is showing surprising numbers — a suspiciously high count, or a rate that doesn’t match what your reviewers told you — check the assessor toggle first. You’re very likely viewing AI judge or AI + Human verdicts rather than just the human ones. Flip the toggle to Human to see only what your reviewers actually graded.

The date-range filter

Next to the assessor toggle is a date-range filter. It scopes every card to verdicts from conversations in that window, so you can compare last week to this week, or zoom into a specific incident. The counts and the compliant rate recompute for the range you choose.

Drilling into a metric’s exceptions

A compliant rate tells you how many conversations had a problem. To see which ones and why, click into a metric to open its drill-down. The drill-down lists that metric’s exceptions — the individual verdicts that were not compliant. Each exception is a real conversation that failed the check, shown with the supporting evidence:
  • The criterion that failed.
  • A verbatim quote from the conversation — the exact line the grader anchored on.
  • The grader’s reasoning and confidence.
This is where a score becomes actionable. Instead of “94% accurate,” you see the three specific moments where the agent said something wrong, in the agent’s own words, so you can decide whether it’s a prompt fix, a knowledge-base gap, or a one-off. You can open the underlying conversation from here to read the full transcript.

Run tiles and the runs-by-day chart

Below or alongside the metric cards, the Overview also summarizes the runs themselves. A run is one assessment of one conversation by one assessor — the unit of work behind every verdict. You’ll see three run tiles:
TileWhat it counts
TotalAll evaluation runs in the current range.
CompletedRuns that finished successfully and produced verdicts.
FailedRuns that errored out and produced nothing.
A healthy failed count is near zero. If it climbs, runs aren’t finishing — a good sign to investigate before you trust the rates above them, since failed runs contribute no verdicts. The runs-by-day chart plots run volume over time, so you can see when evaluation activity spiked or went quiet across the selected date range.

Permissions

Viewing Analytics requires the assessments:read scope. Creating, editing, triggering, or graduating metrics requires assessments:manage.

Next steps

Calibration & agreement

Learn whether the AI judge actually agrees with your humans — using Cohen’s κ, not raw percentages.

Graduation

Understand the scoring-mode badge and how a metric earns automatic AI scoring.