- Metrics — the set of metrics (and therefore all their criteria) you want scored.
- Judge configuration — how the AI judge runs against this bundle.
- Instructions — a free-text field the judge reads as context, so it knows what “correct” and “in-scope” mean for your agent.
An evaluation does not score anything on its own. It is a saved template. Scoring happens when you trigger a run — one assessment of one conversation. See Running an evaluation.
Why bundle metrics
Without bundling, you would have to remember which metrics make up “the QA suite” and re-select them every time. An evaluation freezes that decision:- Run a whole suite at once. Select an evaluation, pick a batch of conversations, and every metric in the bundle is scored against every conversation in parallel.
- Keep results comparable. Because the metrics and instructions are fixed, two conversations scored by the same evaluation are measured the same way.
- Reuse across time. Run the same evaluation next week against new conversations and the numbers mean the same thing.
Creating an evaluation
Name the evaluation
Give it a clear name that describes the job, like Full QA suite or Compliance check. You will pick this name later when you trigger a run.
Select the metrics it includes
Choose which metrics belong in this bundle. Adding a metric pulls in all of its criteria automatically — you select at the metric level, not the criterion level. Include only what this evaluation is meant to measure; a focused evaluation with one metric is perfectly valid.
Write the instructions
Fill in the instructions field with the context the judge needs. This is covered in detail below.
The instructions field
The instructions field is free text that the AI judge reads as scoring context before it answers any criterion. It is one of the most important things you configure, because it tells the judge what “correct” looks like for your agent. The judge cannot know your business. It does not know which products you sell, which policies apply, or what counts as an accurate answer — unless you tell it. The classic use of the instructions field is to paste the agent’s knowledge base or scoring guidance so the judge can decide things like “Did the agent give correct information?” or “Did the agent stay in scope?” against real facts instead of guessing.Division of labor: question vs. instructions
The judge anchors on the criterion question — that exact yes/no text is the thing it is answering — and treats the instructions as supporting context. That difference tells you where to put each kind of information:| Put this… | …here | Why |
|---|---|---|
| Hard rules and the exact thing being judged | The criterion question | The judge answers the question directly, so a rule only binds if it is in the question. |
| Supporting facts, knowledge base, scoring guidance | The evaluation instructions | The judge reads these as context to decide the question correctly. |
Editing an evaluation
Open any evaluation from Analytics → Evaluations to change it. You can rename it, add or remove metrics, and edit the instructions at any time in the Evaluation builder. Editing the instructions is common: as you learn what the judge gets wrong, you sharpen the guidance and re-run.Editing an evaluation changes how future runs are scored. It does not retroactively rescore runs you have already triggered.
Archiving an evaluation
When an evaluation is no longer in use, archive it to keep your Evaluations list clean. Archiving removes it from everyday view without deleting the runs it produced, so your historical results stay intact.Permissions
Viewing evaluations requires theassessments:read scope. Creating, editing, and triggering them requires assessments:manage.
Next steps
Run an evaluation
Trigger a run against one conversation or a whole batch, a transcript, or an audio URL.
Create metrics & criteria
Define the metrics and yes/no criteria that an evaluation bundles together.

