> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Your first evaluation

> A 10-minute, end-to-end walkthrough: create a metric, run it on a real conversation, and read the result.

This walkthrough gets you a real result fast. You'll create one simple quality check, point it at a single conversation, let the AI judge score it, and read the verdict — all from the **Analytics** section of the dashboard. By the end you'll understand how the pieces fit together and be ready to build something more serious.

Before you start, it helps to know three terms. A **metric** is something you want to measure about a conversation (like "did the agent resolve the request?"). A **criterion** is a single yes/no question inside a metric. An **evaluation** is a reusable bundle of metrics you run against conversations. If you want the full picture first, read [Concepts](/evaluation/concepts) — otherwise just follow along.

<Note>
  You need the `assessments:manage` permission to create metrics, build evaluations, and trigger runs. If buttons are missing or greyed out, ask an admin for that scope. See [Roles & permissions](/organizations/roles-and-permissions).
</Note>

## Walk through it

<Steps>
  <Step title="Create a metric with one criterion">
    Open **Analytics → Metrics** and click **New metric**.

    Give the metric a clear name — for this walkthrough, use **Task Resolution**. A metric on its own measures nothing; the actual check lives in its **criteria**. Add one criterion with:

    * **Question**: `Did the agent fully resolve the member's request?`
    * **Expected value**: `true`

    The **question** is the most important field. It is the exact yes/no question the AI judge reads and answers about each conversation, so write it as a plain, self-contained question. The **expected value** is the answer that means "good": here, `true` means a resolved request is the outcome you want. (For a "bad behavior" check like *"Did the agent give incorrect information?"* you'd set expected value to `false`, so a `false` answer is the good one.)

    Leave the criterion as a normal **judged** criterion — that means the AI decides the answer. Save the metric.

    <Tip>
      Keep your first criterion to a single, unambiguous question. One question per criterion makes verdicts easy to read and easy to trust later. You can always add more criteria once you see how scoring works. See [Creating metrics & criteria](/evaluation/creating-metrics).
    </Tip>
  </Step>

  <Step title="Bundle it into an evaluation">
    A metric isn't run directly — you run an **evaluation**, which is a reusable bundle of one or more metrics plus the configuration the judge uses.

    Go to **Analytics → Evaluations** and click **New evaluation**. Give it a name (for example **First eval**), then pick your **Task Resolution** metric so it's included.

    There's an **instructions** field — free text the judge reads as supporting context. You can leave it empty for now, or paste one line that tells the judge what "resolved" means for your team, e.g. `The agent's job is to answer membership questions and complete account changes. Treat a request as resolved only if the member's question is actually answered.` Save the evaluation.

    <Info>
      Put hard rules in the criterion **question** and supporting facts in the evaluation **instructions**. The judge anchors on the question and treats instructions as scoring context. A common use of instructions is to paste your agent's knowledge base so the judge knows what "correct" looks like — but that's optional for your first run. See [Evaluations](/evaluation/evaluations).
    </Info>
  </Step>

  <Step title="Trigger a run on one conversation">
    Now score a real conversation. Click **Trigger Run** to open the dialog.

    1. Pick your evaluation (**First eval**).
    2. For the **source**, choose **one conversation** from your history. (The same dialog can take many conversations as a batch, a pasted transcript, or an audio URL — but start with one.)
    3. Click to run it.

    The run is **queued** and processed asynchronously, so it won't finish instantly. Behind the scenes the AI judge reads the conversation transcript, answers your criterion question, and records its reasoning. A single short conversation usually completes in well under a minute.

    <Note>
      An **evaluation run** is one assessment of one conversation by one assessor. Here the assessor is the AI judge. You can also have a human reviewer grade the same conversation — that comes later in [Running evaluations](/evaluation/running-evaluations).
    </Note>
  </Step>

  <Step title="Read the result">
    Once the run completes, open it from the **Analytics → Runs** tab (or jump to **Analytics → Overview** to see the rolled-up numbers).

    For your criterion you'll see a **criterion result** — the judge's verdict — made up of:

    * **Outcome**: one of `true`, `false`, `abstain` (the judge couldn't decide), or `na` (the criterion didn't apply). For **Task Resolution** with expected value `true`, an outcome of `true` is **compliant** (good).
    * **Quote**: a verbatim snippet from the transcript that the judge used as evidence. This is how you check the judge's work — does the quote actually support the verdict?
    * **Confidence**: how sure the judge is.
    * **Reasoning**: a short explanation of why it landed on that outcome.

    On **Analytics → Overview**, each metric shows a **compliant rate** — the share of answered verdicts that matched the expected value. With one run you'll see `1/1` (or `0/1`). Only `true`/`false` verdicts count toward "answered"; `abstain` and `na` are excluded.

    <Tip>
      Always read the **quote** alongside the outcome on your first few runs. If the quote supports the verdict, you're building justified confidence in the judge. If it doesn't, that's a signal your criterion question needs sharpening. See [Analytics](/evaluation/analytics) for reading results at scale.
    </Tip>
  </Step>
</Steps>

## What you just built

```
Evaluation "First eval"
└── Metric "Task Resolution"
    └── Criterion: "Did the agent fully resolve the member's request?" (expected: true)
        └── Criterion result (AI verdict): outcome + quote + confidence + reasoning
```

You created a metric, gave it a criterion, bundled it into an evaluation, ran it on a conversation, and read a judged verdict with its evidence. That's the whole loop. Everything else — more criteria, batch runs, human review, and deciding when to trust the AI judge — builds on exactly these pieces.

## Where to next

<CardGroup cols={2}>
  <Card title="Creating metrics & criteria" icon="list-checks" href="/evaluation/creating-metrics">
    Write sharper questions, set expected values, and use deterministic rules and na conditions.
  </Card>

  <Card title="Running evaluations" icon="play" href="/evaluation/running-evaluations">
    Score many conversations at once as a batch, and add human reviewers alongside the AI judge.
  </Card>

  <Card title="Calibration" icon="scale-balanced" href="/evaluation/calibration">
    Measure whether the AI judge actually agrees with humans before you trust it.
  </Card>
</CardGroup>
