Skip to main content
This walkthrough gets you a real result fast. You’ll create one simple quality check, point it at a single conversation, let the AI judge score it, and read the verdict — all from the Analytics section of the dashboard. By the end you’ll understand how the pieces fit together and be ready to build something more serious. Before you start, it helps to know three terms. A metric is something you want to measure about a conversation (like “did the agent resolve the request?”). A criterion is a single yes/no question inside a metric. An evaluation is a reusable bundle of metrics you run against conversations. If you want the full picture first, read Concepts — otherwise just follow along.
You need the assessments:manage permission to create metrics, build evaluations, and trigger runs. If buttons are missing or greyed out, ask an admin for that scope. See Roles & permissions.

Walk through it

1

Create a metric with one criterion

Open Analytics → Metrics and click New metric.Give the metric a clear name — for this walkthrough, use Task Resolution. A metric on its own measures nothing; the actual check lives in its criteria. Add one criterion with:
  • Question: Did the agent fully resolve the member's request?
  • Expected value: true
The question is the most important field. It is the exact yes/no question the AI judge reads and answers about each conversation, so write it as a plain, self-contained question. The expected value is the answer that means “good”: here, true means a resolved request is the outcome you want. (For a “bad behavior” check like “Did the agent give incorrect information?” you’d set expected value to false, so a false answer is the good one.)Leave the criterion as a normal judged criterion — that means the AI decides the answer. Save the metric.
Keep your first criterion to a single, unambiguous question. One question per criterion makes verdicts easy to read and easy to trust later. You can always add more criteria once you see how scoring works. See Creating metrics & criteria.
2

Bundle it into an evaluation

A metric isn’t run directly — you run an evaluation, which is a reusable bundle of one or more metrics plus the configuration the judge uses.Go to Analytics → Evaluations and click New evaluation. Give it a name (for example First eval), then pick your Task Resolution metric so it’s included.There’s an instructions field — free text the judge reads as supporting context. You can leave it empty for now, or paste one line that tells the judge what “resolved” means for your team, e.g. The agent's job is to answer membership questions and complete account changes. Treat a request as resolved only if the member's question is actually answered. Save the evaluation.
Put hard rules in the criterion question and supporting facts in the evaluation instructions. The judge anchors on the question and treats instructions as scoring context. A common use of instructions is to paste your agent’s knowledge base so the judge knows what “correct” looks like — but that’s optional for your first run. See Evaluations.
3

Trigger a run on one conversation

Now score a real conversation. Click Trigger Run to open the dialog.
  1. Pick your evaluation (First eval).
  2. For the source, choose one conversation from your history. (The same dialog can take many conversations as a batch, a pasted transcript, or an audio URL — but start with one.)
  3. Click to run it.
The run is queued and processed asynchronously, so it won’t finish instantly. Behind the scenes the AI judge reads the conversation transcript, answers your criterion question, and records its reasoning. A single short conversation usually completes in well under a minute.
An evaluation run is one assessment of one conversation by one assessor. Here the assessor is the AI judge. You can also have a human reviewer grade the same conversation — that comes later in Running evaluations.
4

Read the result

Once the run completes, open it from the Analytics → Runs tab (or jump to Analytics → Overview to see the rolled-up numbers).For your criterion you’ll see a criterion result — the judge’s verdict — made up of:
  • Outcome: one of true, false, abstain (the judge couldn’t decide), or na (the criterion didn’t apply). For Task Resolution with expected value true, an outcome of true is compliant (good).
  • Quote: a verbatim snippet from the transcript that the judge used as evidence. This is how you check the judge’s work — does the quote actually support the verdict?
  • Confidence: how sure the judge is.
  • Reasoning: a short explanation of why it landed on that outcome.
On Analytics → Overview, each metric shows a compliant rate — the share of answered verdicts that matched the expected value. With one run you’ll see 1/1 (or 0/1). Only true/false verdicts count toward “answered”; abstain and na are excluded.
Always read the quote alongside the outcome on your first few runs. If the quote supports the verdict, you’re building justified confidence in the judge. If it doesn’t, that’s a signal your criterion question needs sharpening. See Analytics for reading results at scale.

What you just built

Evaluation "First eval"
└── Metric "Task Resolution"
    └── Criterion: "Did the agent fully resolve the member's request?" (expected: true)
        └── Criterion result (AI verdict): outcome + quote + confidence + reasoning
You created a metric, gave it a criterion, bundled it into an evaluation, ran it on a conversation, and read a judged verdict with its evidence. That’s the whole loop. Everything else — more criteria, batch runs, human review, and deciding when to trust the AI judge — builds on exactly these pieces.

Where to next

Creating metrics & criteria

Write sharper questions, set expected values, and use deterministic rules and na conditions.

Running evaluations

Score many conversations at once as a batch, and add human reviewers alongside the AI judge.

Calibration

Measure whether the AI judge actually agrees with humans before you trust it.