> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Reading the dashboard

> Make sense of metric scores, filter by who graded, and drill into exceptions.

Once you've run some evaluations, the numbers live in the dashboard under the **Analytics** section, on the **Overview** tab. This page walks you through how to read what you see there — what each score means, how to filter by who did the grading, and how to drill into the conversations that failed.

Before we start, two quick definitions you'll need throughout:

* A **metric** is one thing you measure about a conversation (for example, "Task Resolution"). Each metric groups one or more **criteria** — individual yes/no checks.
* A **verdict** (or **outcome**) is the answer to one criterion for one conversation. Every verdict is one of four values: **true**, **false**, **abstain** (the grader couldn't decide), or **na** (the criterion didn't apply to that conversation).

## The per-metric cards

The Overview tab shows one **card per metric**. Each card is a scoreboard for how that metric is performing across all the conversations you've evaluated.

The headline number on a card is the **compliant rate**. To understand it, you first need to know what "compliant" means.

A criterion has an **expected value** — the answer that counts as good. Some criteria are phrased so that **true** is good ("Did the agent confirm the caller's identity?"), and some so that **false** is good ("Did the agent give wrong information?"). A verdict is **compliant** when its outcome matches that expected value. So a "false" verdict can absolutely be the good outcome — it depends on the criterion.

The card shows four counts that add up to the full picture:

| Count         | What it means                                                                                                   |
| ------------- | --------------------------------------------------------------------------------------------------------------- |
| **Compliant** | Verdicts where the outcome matched the expected value — the "good" answers.                                     |
| **Answered**  | Verdicts that came back as either **true** or **false** — a real decision was made.                             |
| **Abstain**   | The grader read the conversation but couldn't decide.                                                           |
| **Na**        | The criterion didn't apply to that conversation (for example, an audio-only check on a chat with no recording). |

The **compliant rate** is then:

```
compliant rate = compliant ÷ answered
```

The key thing to notice is the denominator: **answered**, not the total number of verdicts. **Abstain** and **na** verdicts are deliberately left out of the math. A criterion that didn't apply, or that the grader skipped, shouldn't drag a score up or down — only real decisions count.

<Note>
  The compliant, answered, abstain, and na totals on each card are exact. They come straight from the server across **all** matching verdicts — they are not capped or sampled, so the rate you read is the true rate for the filters you've applied.
</Note>

### How to read a metric card — a worked example

Say the **Information Accuracy** card shows these numbers over the last 30 days:

```
Information Accuracy
compliant rate: 94%
compliant 47 · answered 50 · abstain 2 · na 8
```

Read it like this:

* There were **60** verdicts in total (47 + 3 + 2 + 8 — the 3 is the non-compliant share of the answered group).
* **8** were **na** — the criterion didn't apply to those conversations, so they're ignored.
* **2** were **abstain** — the grader saw them but couldn't decide, so they're ignored too.
* That leaves **50 answered** verdicts. Of those, **47** were compliant.
* 47 ÷ 50 = **94%** compliant rate.

So three conversations had a real accuracy problem worth looking at — and those three are exactly what you'll find when you drill in (below).

## The assessor toggle — whose verdicts am I looking at?

At the top of the Overview is an **assessor toggle** with three options:

* **AI + Human** — verdicts from both the AI judge and human reviewers, combined.
* **AI judge** — only verdicts produced by the AI judge.
* **Human** — only verdicts produced by human reviewers.

This toggle changes **whose** verdicts every card is counting. The same metric can have very different numbers depending on which option you pick, because AI and human reviewers don't always agree.

<Note>
  The assessor toggle is **not** the same thing as a metric's scoring-mode badge (auto / hybrid / human\_only). The toggle controls **what you're currently viewing**; the badge describes a metric's **capability** — whether the AI judge is trusted to score it automatically. Both AI and human verdicts can exist for a metric no matter what its badge says. See [Graduation](/evaluation/graduation) for what the badge means and how a metric earns it.
</Note>

<Tip>
  If a metric you think of as "Human only" is showing surprising numbers — a suspiciously high count, or a rate that doesn't match what your reviewers told you — check the assessor toggle first. You're very likely viewing **AI judge** or **AI + Human** verdicts rather than just the human ones. Flip the toggle to **Human** to see only what your reviewers actually graded.
</Tip>

## The date-range filter

Next to the assessor toggle is a **date-range filter**. It scopes every card to verdicts from conversations in that window, so you can compare last week to this week, or zoom into a specific incident. The counts and the compliant rate recompute for the range you choose.

## Drilling into a metric's exceptions

A compliant rate tells you *how many* conversations had a problem. To see *which ones* and *why*, click into a metric to open its drill-down.

The drill-down lists that metric's **exceptions** — the individual verdicts that were **not** compliant. Each exception is a real conversation that failed the check, shown with the supporting evidence:

* The **criterion** that failed.
* A verbatim **quote** from the conversation — the exact line the grader anchored on.
* The grader's **reasoning** and **confidence**.

This is where a score becomes actionable. Instead of "94% accurate," you see the three specific moments where the agent said something wrong, in the agent's own words, so you can decide whether it's a prompt fix, a knowledge-base gap, or a one-off. You can open the underlying conversation from here to read the full [transcript](/conversations/conversation-detail).

## Run tiles and the runs-by-day chart

Below or alongside the metric cards, the Overview also summarizes the **runs** themselves. A **run** is one assessment of one conversation by one assessor — the unit of work behind every verdict.

You'll see three **run tiles**:

| Tile          | What it counts                                         |
| ------------- | ------------------------------------------------------ |
| **Total**     | All evaluation runs in the current range.              |
| **Completed** | Runs that finished successfully and produced verdicts. |
| **Failed**    | Runs that errored out and produced nothing.            |

A healthy **failed** count is near zero. If it climbs, runs aren't finishing — a good sign to investigate before you trust the rates above them, since failed runs contribute no verdicts.

The **runs-by-day chart** plots run volume over time, so you can see when evaluation activity spiked or went quiet across the selected date range.

## Permissions

Viewing Analytics requires the `assessments:read` scope. Creating, editing, triggering, or graduating metrics requires `assessments:manage`.

## Next steps

<CardGroup cols={2}>
  <Card title="Calibration & agreement" icon="scale-balanced" href="/evaluation/calibration">
    Learn whether the AI judge actually agrees with your humans — using Cohen's κ, not raw percentages.
  </Card>

  <Card title="Graduation" icon="graduation-cap" href="/evaluation/graduation">
    Understand the scoring-mode badge and how a metric earns automatic AI scoring.
  </Card>
</CardGroup>
