Skip to main content
A metric measures one thing about a conversation (see Creating metrics & criteria). But you rarely want to score just one thing. To run a real QA pass you want to check task resolution, information accuracy, tone, and compliance all at once, across many conversations. That is what an evaluation is for. An evaluation is a reusable bundle that ties three things together:
  • Metrics — the set of metrics (and therefore all their criteria) you want scored.
  • Judge configuration — how the AI judge runs against this bundle.
  • Instructions — a free-text field the judge reads as context, so it knows what “correct” and “in-scope” mean for your agent.
Once you define an evaluation, you can run it against one conversation or a whole batch. Every conversation is scored against the same metrics, with the same instructions, by the same judge — so the results are comparable. That consistency is the entire point: you build the bundle once, then reuse it. You manage evaluations in the dashboard under Analytics → Evaluations.
An evaluation does not score anything on its own. It is a saved template. Scoring happens when you trigger a run — one assessment of one conversation. See Running an evaluation.

Why bundle metrics

Without bundling, you would have to remember which metrics make up “the QA suite” and re-select them every time. An evaluation freezes that decision:
  • Run a whole suite at once. Select an evaluation, pick a batch of conversations, and every metric in the bundle is scored against every conversation in parallel.
  • Keep results comparable. Because the metrics and instructions are fixed, two conversations scored by the same evaluation are measured the same way.
  • Reuse across time. Run the same evaluation next week against new conversations and the numbers mean the same thing.
You can keep several evaluations side by side for different jobs. For example, a broad Full QA suite that includes every metric, and a focused Accuracy check that includes only the information-accuracy metric for a quick spot-check. Each is its own bundle with its own instructions.

Creating an evaluation

1

Open the Evaluations tab

Go to Analytics → Evaluations and click New.
2

Name the evaluation

Give it a clear name that describes the job, like Full QA suite or Compliance check. You will pick this name later when you trigger a run.
3

Select the metrics it includes

Choose which metrics belong in this bundle. Adding a metric pulls in all of its criteria automatically — you select at the metric level, not the criterion level. Include only what this evaluation is meant to measure; a focused evaluation with one metric is perfectly valid.
4

Write the instructions

Fill in the instructions field with the context the judge needs. This is covered in detail below.
5

Save

Save the evaluation. It is now reusable and shows up in the Trigger Run dialog.

The instructions field

The instructions field is free text that the AI judge reads as scoring context before it answers any criterion. It is one of the most important things you configure, because it tells the judge what “correct” looks like for your agent. The judge cannot know your business. It does not know which products you sell, which policies apply, or what counts as an accurate answer — unless you tell it. The classic use of the instructions field is to paste the agent’s knowledge base or scoring guidance so the judge can decide things like “Did the agent give correct information?” or “Did the agent stay in scope?” against real facts instead of guessing.
If your agent answers from a knowledge base, paste that same source material into the instructions. The judge then grades “Did the agent give wrong info?” against the truth your agent is supposed to know. See Attaching a knowledge base.

Division of labor: question vs. instructions

The judge anchors on the criterion question — that exact yes/no text is the thing it is answering — and treats the instructions as supporting context. That difference tells you where to put each kind of information:
Put this……hereWhy
Hard rules and the exact thing being judgedThe criterion questionThe judge answers the question directly, so a rule only binds if it is in the question.
Supporting facts, knowledge base, scoring guidanceThe evaluation instructionsThe judge reads these as context to decide the question correctly.
So a rule like “the agent must never quote a price over chat” belongs inside the criterion question (phrased channel-agnostically), not in the instructions — because the judge enforces the question, not the context. The instructions are where the supporting knowledge lives: the price list, the policy document, the definition of “in-scope.”
Criterion question  →  the rule the judge enforces   (hard, exact, binding)
Evaluation instructions →  the facts the judge reasons with  (context, knowledge)
Do not bury a pass/fail rule in the instructions and leave the question vague. The judge may treat it as background and not enforce it. Anything that must be true for a conversation to be compliant belongs in the criterion question.

Editing an evaluation

Open any evaluation from Analytics → Evaluations to change it. You can rename it, add or remove metrics, and edit the instructions at any time in the Evaluation builder. Editing the instructions is common: as you learn what the judge gets wrong, you sharpen the guidance and re-run.
Editing an evaluation changes how future runs are scored. It does not retroactively rescore runs you have already triggered.

Archiving an evaluation

When an evaluation is no longer in use, archive it to keep your Evaluations list clean. Archiving removes it from everyday view without deleting the runs it produced, so your historical results stay intact.

Permissions

Viewing evaluations requires the assessments:read scope. Creating, editing, and triggering them requires assessments:manage.

Next steps

Run an evaluation

Trigger a run against one conversation or a whole batch, a transcript, or an audio URL.

Create metrics & criteria

Define the metrics and yes/no criteria that an evaluation bundles together.