> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation overview

> What conversation evaluation is, the pieces it's made of, and how they fit together.

When your agents handle hundreds or thousands of conversations, you can't read them all to know whether they're doing a good job. Spot-checking a handful by hand is slow, inconsistent, and impossible to repeat. **Evaluation** solves that: it measures the quality of your agent conversations consistently and at scale, so you can answer questions like "Did the agent resolve the caller's task?" or "Did it ever give out wrong information?" across every conversation, not just the few you happened to open.

You measure quality two ways, and you can combine them:

* An **AI judge** — a top-tier large language model that reads a conversation and answers your quality questions automatically.
* **Human reviewers** — people on your team who grade conversations themselves.

Both produce the same kind of result, so you can compare them directly. That comparison is what eventually lets you trust the AI judge to score on its own.

## Where it lives

Everything below lives in the **Analytics** section of the dashboard. Inside Analytics you'll find these tabs:

| Tab             | What it's for                                                |
| --------------- | ------------------------------------------------------------ |
| **Overview**    | The scoreboard — compliant rates and counts for each metric. |
| **Metrics**     | Define what you measure (metrics and their criteria).        |
| **Evaluations** | Bundle metrics together and configure the judge.             |
| **Runs**        | Trigger evaluations and inspect individual results.          |
| **Calibration** | Decide whether the AI judge is trustworthy.                  |

## The mental model

The most important thing to learn first is the hierarchy — how the pieces nest. Read this once and the rest of the section will make sense.

```
Metric  ────────────────  "Task Resolution"
  └─ Criterion (atom) ──── "Did the agent resolve the caller's request?" → expects true
  └─ Criterion (atom) ──── "Did the agent give wrong info?"            → expects false

Evaluation  ────────────  bundles one or more Metrics + judge instructions
  │
  └─ run against conversations
        │
        ├─ Run  ─────────  one conversation, one assessor (AI or human)
        │     └─ Verdict per criterion: outcome + quote + confidence
        └─ Run  ─────────  another conversation, or the same one by a human
```

Here's each term, defined the first time you'll meet it:

| Term                                    | What it is                                                                                                                                                                                                                                                                                               |
| --------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Metric**                              | A thing you measure about a conversation, such as "Task Resolution" or "Information Accuracy". A metric groups one or more criteria.                                                                                                                                                                     |
| **Criterion** (also called an **atom**) | A single yes/no check inside a metric. It has a **question** (the exact yes/no question asked about the conversation) and an **expected\_value** (`true` or `false` — the answer that means "good"). A criterion can be **deterministic** (a hard, reproducible rule) or **judged** (decided by the AI). |
| **Outcome**                             | The answer to a criterion's question on one conversation: `true`, `false`, `abstain` (the judge couldn't decide), or `na` (the criterion doesn't apply here).                                                                                                                                            |
| **Compliant**                           | When the outcome matches the criterion's expected\_value. So if a criterion expects `false` (like "Did the agent give wrong info?"), a `false` outcome is the *good* result.                                                                                                                             |
| **Evaluation**                          | A reusable bundle of metrics plus a judge configuration and a free-text **instructions** field. You run an evaluation against conversations.                                                                                                                                                             |
| **Run**                                 | One assessment of one conversation by one assessor. The assessor is either the AI judge (`ai`) or a human reviewer (`human`).                                                                                                                                                                            |
| **Verdict** (criterion result)          | The result for one criterion within one run: an outcome, a verbatim **quote** of the supporting evidence, a **confidence**, and reasoning.                                                                                                                                                               |

<Note>
  A metric expecting `false` flips the intuition: "false is good." Throughout the dashboard, the **compliant rate** is what matters, not the raw share of `true` answers. Compliant rate = compliant ÷ answered, where *answered* counts only `true` and `false` verdicts — `abstain` and `na` are left out of the denominator.
</Note>

## AI judge vs. human reviewers

The same conversation can be scored by the AI judge and by a person, and each scoring is its own **run**. That's deliberate: when you have an AI run and a human run for the same conversations, you can measure how often they agree. High agreement is your evidence that the AI judge is trustworthy enough to score on its own. Low agreement tells you the question needs sharpening or the judge needs more context.

Human grading is designed to be fast. It works "grade-by-exception" — every criterion starts on the no-violation (compliant) answer, and you only flag the exceptions. A human run can also confirm or correct an existing AI run.

## The end-to-end lifecycle

Putting it together, here's the path from nothing to trusted, automatic scoring:

<Steps>
  <Step title="Define metrics">
    In **Metrics**, create the metrics and criteria that capture what "good" means for your conversations. See [Creating metrics & criteria](/evaluation/creating-metrics).
  </Step>

  <Step title="Bundle into an evaluation">
    In **Evaluations**, group the metrics you want to score together and add judge instructions (often your agent's knowledge base, so the judge knows what "correct" means).
  </Step>

  <Step title="Run it">
    From **Runs**, trigger the evaluation against one or many conversations, a pasted transcript, or an audio URL. Each conversation becomes its own run, processed in parallel.
  </Step>

  <Step title="Read the results">
    In **Overview**, read compliant rates per metric and drill into the individual verdicts behind any number.
  </Step>

  <Step title="Calibrate">
    In **Calibration**, compare AI and human runs to ask the real question: *is the AI judge trustworthy for this metric?*
  </Step>

  <Step title="Graduate">
    Once a metric's agreement clears the bar, **graduate** it to turn on automatic AI scoring — explicitly, with a click.
  </Step>
</Steps>

<Tip>
  You don't have to do all six steps before you get value. Defining one metric and running it against a handful of conversations already tells you something. Calibration and graduation are how you scale that up with confidence.
</Tip>

## Start here

<CardGroup cols={2}>
  <Card title="Quickstart" icon="rocket" href="/evaluation/quickstart">
    The fastest path: create a metric, run an evaluation, and read your first results.
  </Card>

  <Card title="Core concepts" icon="book-open" href="/evaluation/concepts">
    A deeper tour of metrics, criteria, outcomes, runs, and verdicts.
  </Card>

  <Card title="Creating metrics & criteria" icon="ruler" href="/evaluation/creating-metrics">
    Write good yes/no criteria and choose the right expected value.
  </Card>
</CardGroup>