> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Bundling metrics into evaluations

> Group metrics into a reusable evaluation and give the AI judge the context it needs.

A **metric** measures one thing about a conversation (see [Creating metrics & criteria](/evaluation/creating-metrics)). But you rarely want to score just one thing. To run a real QA pass you want to check task resolution, information accuracy, tone, and compliance all at once, across many conversations. That is what an **evaluation** is for.

An **evaluation** is a reusable bundle that ties three things together:

* **Metrics** — the set of metrics (and therefore all their criteria) you want scored.
* **Judge configuration** — how the AI judge runs against this bundle.
* **Instructions** — a free-text field the judge reads as context, so it knows what "correct" and "in-scope" mean for your agent.

Once you define an evaluation, you can run it against one conversation or a whole batch. Every conversation is scored against the same metrics, with the same instructions, by the same judge — so the results are comparable. That consistency is the entire point: you build the bundle once, then reuse it.

You manage evaluations in the dashboard under **Analytics → Evaluations**.

<Note>
  An evaluation does not score anything on its own. It is a saved template. Scoring happens when you trigger a **run** — one assessment of one conversation. See [Running an evaluation](/evaluation/running-evaluations).
</Note>

## Why bundle metrics

Without bundling, you would have to remember which metrics make up "the QA suite" and re-select them every time. An evaluation freezes that decision:

* **Run a whole suite at once.** Select an evaluation, pick a batch of conversations, and every metric in the bundle is scored against every conversation in parallel.
* **Keep results comparable.** Because the metrics and instructions are fixed, two conversations scored by the same evaluation are measured the same way.
* **Reuse across time.** Run the same evaluation next week against new conversations and the numbers mean the same thing.

You can keep several evaluations side by side for different jobs. For example, a broad **Full QA suite** that includes every metric, and a focused **Accuracy check** that includes only the information-accuracy metric for a quick spot-check. Each is its own bundle with its own instructions.

## Creating an evaluation

<Steps>
  <Step title="Open the Evaluations tab">
    Go to **Analytics → Evaluations** and click **New**.
  </Step>

  <Step title="Name the evaluation">
    Give it a clear name that describes the job, like **Full QA suite** or **Compliance check**. You will pick this name later when you trigger a run.
  </Step>

  <Step title="Select the metrics it includes">
    Choose which metrics belong in this bundle. Adding a metric pulls in all of its criteria automatically — you select at the metric level, not the criterion level. Include only what this evaluation is meant to measure; a focused evaluation with one metric is perfectly valid.
  </Step>

  <Step title="Write the instructions">
    Fill in the **instructions** field with the context the judge needs. This is covered in detail below.
  </Step>

  <Step title="Save">
    Save the evaluation. It is now reusable and shows up in the **Trigger Run** dialog.
  </Step>
</Steps>

## The instructions field

The **instructions** field is free text that the AI judge reads as scoring context before it answers any criterion. It is one of the most important things you configure, because it tells the judge what "correct" looks like for *your* agent.

The judge cannot know your business. It does not know which products you sell, which policies apply, or what counts as an accurate answer — unless you tell it. The classic use of the instructions field is to **paste the agent's knowledge base or scoring guidance** so the judge can decide things like "Did the agent give correct information?" or "Did the agent stay in scope?" against real facts instead of guessing.

<Tip>
  If your agent answers from a knowledge base, paste that same source material into the instructions. The judge then grades "Did the agent give wrong info?" against the truth your agent is supposed to know. See [Attaching a knowledge base](/agents/knowledge-base-attachment).
</Tip>

### Division of labor: question vs. instructions

The judge anchors on the **criterion question** — that exact yes/no text is the thing it is answering — and treats the **instructions** as supporting context. That difference tells you where to put each kind of information:

| Put this...                                        | ...here                         | Why                                                                                       |
| -------------------------------------------------- | ------------------------------- | ----------------------------------------------------------------------------------------- |
| Hard rules and the exact thing being judged        | The criterion **question**      | The judge answers the question directly, so a rule only binds if it is *in* the question. |
| Supporting facts, knowledge base, scoring guidance | The evaluation **instructions** | The judge reads these as context to decide the question correctly.                        |

So a rule like "the agent must never quote a price over chat" belongs *inside the criterion question* (phrased channel-agnostically), not in the instructions — because the judge enforces the question, not the context. The instructions are where the supporting knowledge lives: the price list, the policy document, the definition of "in-scope."

```
Criterion question  →  the rule the judge enforces   (hard, exact, binding)
Evaluation instructions →  the facts the judge reasons with  (context, knowledge)
```

<Warning>
  Do not bury a pass/fail rule in the instructions and leave the question vague. The judge may treat it as background and not enforce it. Anything that must be true for a conversation to be compliant belongs in the criterion question.
</Warning>

## Editing an evaluation

Open any evaluation from **Analytics → Evaluations** to change it. You can rename it, add or remove metrics, and edit the **instructions** at any time in the Evaluation builder. Editing the instructions is common: as you learn what the judge gets wrong, you sharpen the guidance and re-run.

<Note>
  Editing an evaluation changes how *future* runs are scored. It does not retroactively rescore runs you have already triggered.
</Note>

## Archiving an evaluation

When an evaluation is no longer in use, archive it to keep your Evaluations list clean. Archiving removes it from everyday view without deleting the runs it produced, so your historical results stay intact.

## Permissions

Viewing evaluations requires the `assessments:read` scope. Creating, editing, and triggering them requires `assessments:manage`.

## Next steps

<CardGroup cols={2}>
  <Card title="Run an evaluation" icon="play" href="/evaluation/running-evaluations">
    Trigger a run against one conversation or a whole batch, a transcript, or an audio URL.
  </Card>

  <Card title="Create metrics & criteria" icon="list-checks" href="/evaluation/creating-metrics">
    Define the metrics and yes/no criteria that an evaluation bundles together.
  </Card>
</CardGroup>
