You need the
assessments:manage permission to create metrics, build evaluations, and trigger runs. If buttons are missing or greyed out, ask an admin for that scope. See Roles & permissions.Walk through it
Create a metric with one criterion
Open Analytics → Metrics and click New metric.Give the metric a clear name — for this walkthrough, use Task Resolution. A metric on its own measures nothing; the actual check lives in its criteria. Add one criterion with:
- Question:
Did the agent fully resolve the member's request? - Expected value:
true
true means a resolved request is the outcome you want. (For a “bad behavior” check like “Did the agent give incorrect information?” you’d set expected value to false, so a false answer is the good one.)Leave the criterion as a normal judged criterion — that means the AI decides the answer. Save the metric.Bundle it into an evaluation
A metric isn’t run directly — you run an evaluation, which is a reusable bundle of one or more metrics plus the configuration the judge uses.Go to Analytics → Evaluations and click New evaluation. Give it a name (for example First eval), then pick your Task Resolution metric so it’s included.There’s an instructions field — free text the judge reads as supporting context. You can leave it empty for now, or paste one line that tells the judge what “resolved” means for your team, e.g.
The agent's job is to answer membership questions and complete account changes. Treat a request as resolved only if the member's question is actually answered. Save the evaluation.Put hard rules in the criterion question and supporting facts in the evaluation instructions. The judge anchors on the question and treats instructions as scoring context. A common use of instructions is to paste your agent’s knowledge base so the judge knows what “correct” looks like — but that’s optional for your first run. See Evaluations.
Trigger a run on one conversation
Now score a real conversation. Click Trigger Run to open the dialog.
- Pick your evaluation (First eval).
- For the source, choose one conversation from your history. (The same dialog can take many conversations as a batch, a pasted transcript, or an audio URL — but start with one.)
- Click to run it.
An evaluation run is one assessment of one conversation by one assessor. Here the assessor is the AI judge. You can also have a human reviewer grade the same conversation — that comes later in Running evaluations.
Read the result
Once the run completes, open it from the Analytics → Runs tab (or jump to Analytics → Overview to see the rolled-up numbers).For your criterion you’ll see a criterion result — the judge’s verdict — made up of:
- Outcome: one of
true,false,abstain(the judge couldn’t decide), orna(the criterion didn’t apply). For Task Resolution with expected valuetrue, an outcome oftrueis compliant (good). - Quote: a verbatim snippet from the transcript that the judge used as evidence. This is how you check the judge’s work — does the quote actually support the verdict?
- Confidence: how sure the judge is.
- Reasoning: a short explanation of why it landed on that outcome.
1/1 (or 0/1). Only true/false verdicts count toward “answered”; abstain and na are excluded.What you just built
Where to next
Creating metrics & criteria
Write sharper questions, set expected values, and use deterministic rules and na conditions.
Running evaluations
Score many conversations at once as a batch, and add human reviewers alongside the AI judge.
Calibration
Measure whether the AI judge actually agrees with humans before you trust it.

