> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Running an evaluation

> Trigger the AI judge on one or many conversations, and review or correct verdicts by hand.

Once you've built an evaluation, running it is how you actually get scores. An **evaluation** is a reusable bundle of metrics plus the judge configuration and a free-text **instructions** field. When you run it, Anyreach produces verdicts for every criterion in those metrics, against the conversations you choose.

There are two ways to get verdicts, and you'll use both:

* **The AI judge** scores conversations automatically and at scale. This is the everyday path — point it at one conversation or thousands and let it work.
* **Human review** lets a person grade a conversation themselves, to confirm or correct what the judge said. Those human verdicts are the ground truth that [calibration](/evaluation/calibration) later compares the judge against.

Everything here lives in the dashboard under the **Analytics** section. A quick reminder of the vocabulary, in case you're new: a **metric** is a thing you measure (like "Task Resolution"); a **criterion** is one yes/no check inside it; an **outcome** is the verdict for that check (`true`, `false`, `abstain`, or `na`); and a **run** is one assessment of one conversation by one assessor. If any of that is fuzzy, read [Core concepts](/evaluation/concepts) first.

<Note>
  Triggering runs and grading conversations both require the `assessments:manage` scope. With only `assessments:read` you can view results but not produce them.
</Note>

## Part A — Run the AI judge

The AI judge is a top-tier LLM that reads a conversation and answers each criterion's yes/no question for you. You start it from the **Trigger Run** dialog.

<Steps>
  <Step title="Open the Trigger Run dialog">
    From the **Analytics** section, open **Trigger Run**. This is where you tell Anyreach *what* to score and *which conversations* to score it against.
  </Step>

  <Step title="Pick an evaluation">
    Choose the evaluation you want to run. The evaluation decides which metrics and criteria get scored, and it carries the **instructions** the judge will read as context (more on that below).
  </Step>

  <Step title="Choose a source">
    Pick what the judge should read. You have three options:

    | Source            | Use it when                                         | Result                            |
    | ----------------- | --------------------------------------------------- | --------------------------------- |
    | **Conversations** | You want to score real interactions from your inbox | One run per selected conversation |
    | **Transcript**    | You want to test against text you paste in by hand  | A single run over that text       |
    | **Audio URL**     | You want to score a recording by link               | A single run over that audio      |

    For **Conversations**, you can select one or many. Selecting more than one is a **batch**: each conversation becomes its own independent run, and the runs are processed in parallel rather than as one combined score. That keeps every conversation's verdicts separate and comparable.
  </Step>

  <Step title="Trigger the run">
    Confirm, and the runs are queued. The judge works asynchronously, so the dialog doesn't block — verdicts fill in as each run finishes.
  </Step>
</Steps>

### What the judge does for each criterion

For every criterion in the evaluation's metrics, the judge reads three things — the criterion **question**, any **na rule** attached to it, and the evaluation's **instructions** — and then returns:

* an **outcome** (`true`, `false`, `abstain`, or `na`),
* a verbatim **quote** from the conversation that supports the verdict,
* a **confidence**, and
* short **reasoning**.

The judge anchors on the **question** as the rule it's deciding, and treats the **instructions** as supporting context — so the question carries the hard rule, while the instructions tell the judge what "correct" or "in scope" looks like (for example, a pasted copy of the agent's [knowledge base](/agents/knowledge-base-attachment)).

<Note>
  For an **audio**-modality metric, if the conversation has no recording the judge can't listen to anything, so the verdict auto-resolves to `na` (not applicable) rather than guessing.
</Note>

### Run statuses

Each run moves through a status you can watch in the **Runs** tab:

| Status      | Meaning                                                      |
| ----------- | ------------------------------------------------------------ |
| `pending`   | Queued or in progress — the judge hasn't finished yet.       |
| `completed` | The run finished and produced verdicts.                      |
| `failed`    | The run couldn't complete (for example, a processing error). |

### Batches and partial failures

When you run a batch and some conversations fail while others succeed, Anyreach doesn't make you start over. **The failed conversations stay selected** in the **Trigger Run** dialog, so you can fix the cause and re-trigger just those, while the ones that already completed keep their results.

<Tip>
  Batches are the fast way to score a whole slice of history — filter your [inbox](/conversations/inbox-overview) down to the conversations you care about, then select them all in **Trigger Run**.
</Tip>

## Part B — Review or correct by hand

The AI judge is fast, but you don't have to take its word for it. A **human run** is the same kind of run, produced by a person instead of the model. You use human runs to spot-check the judge, to score metrics the judge isn't trusted on yet, and to build the ground-truth verdicts that [calibration](/evaluation/calibration) measures the judge against.

### Grade-by-exception

Grading a conversation yourself is designed to be quick. It's **grade-by-exception**: every criterion starts on the compliant answer — the one that means "no problem here" — and you only change the criteria where something actually went wrong. You're flagging exceptions, not filling out every field from scratch.

```
Default state:  every criterion → compliant (assume good)
Your job:       flip only the criteria where the agent slipped
```

Remember that "compliant" depends on the criterion's **expected\_value**: for a criterion like *"Did the agent give wrong information?"* the expected (good) answer is `false`, so leaving it on the compliant default means you're asserting the agent did **not** give wrong information.

### Confirming or correcting an AI run

A human run can stand on its own, but it can also **confirm or correct** an existing AI run on the same conversation. When you agree with the judge, your human verdicts line up with its verdicts; when you disagree, they diverge. That agreement (and disagreement) is exactly what [calibration](/evaluation/calibration) measures later — it's how Anyreach decides whether the judge is trustworthy enough to score a metric automatically. The more honest human verdicts you record, the more reliable that measurement becomes.

<Info>
  Both AI and human runs can exist for the same conversation and the same metric at once. On the [Analytics overview](/evaluation/analytics), the **assessor toggle** ("AI + Human", "AI judge", "Human") controls *whose* verdicts you're looking at.
</Info>

## Next steps

<CardGroup cols={2}>
  <Card title="Read your results" icon="chart-column" href="/evaluation/analytics">
    See compliant rates per metric, filter by assessor and date, and drill into the exceptions.
  </Card>

  <Card title="Calibrate the judge" icon="ruler" href="/evaluation/calibration">
    Compare AI and human runs with Cohen's κ to decide whether the judge can be trusted.
  </Card>
</CardGroup>