assessments:read scope; creating or editing requires assessments:manage.
Open the Metric Builder
Go to Analytics → Metrics
Open the Analytics section in the dashboard and select the Metrics tab. You’ll see every metric your organization has defined, each with a scored-by badge (more on that in Graduation).
Click New metric
Press New metric to open the Metric Builder. This is where you name the metric, choose how it’s measured, and add criteria.
Name and describe it
Give the metric a short, plain-language name and a one-line description. The name is what you’ll see on every analytics card and run, so make it concrete (“Booked an appointment”) rather than abstract (“Quality”).
Choose the metric kind and modality
Set the metric kind (the category of thing you’re measuring) and the modality — whether this metric is graded from text (the transcript) or audio (the recording). See the warning below before choosing audio.
Add your criteria
Add one or more criteria to the metric. Each criterion is a yes/no check. The next section covers how to write them well.
Save
Save the metric. It’s now available to add to an evaluation.
Writing a good criterion
A criterion has a few parts, but two of them carry all the weight: the question and the expected value. Get these right and your scores will be reliable; get them wrong and the judge will answer a question you didn’t mean to ask.| Part | What it is |
|---|---|
| Question | The exact yes/no question answered about the conversation. This is the text the AI judge reads. |
| Expected value | The answer (true or false) that means “good” or “compliant”. |
| Applies-when / na rule | Optional conditions under which the criterion doesn’t apply, so it can opt out of irrelevant conversations. |
| Type | Whether the criterion is deterministic (a hard rule) or judged (decided by the LLM). |
The question is what the judge actually reads
The AI judge anchors on the criterion question. It reads the question (plus any na rule), consults the evaluation’s free-text instructions as background context, and returns an outcome with a verbatim supporting quote, a confidence, and reasoning. Because the question is the anchor, everything that decides the verdict must live in the question itself. Phrase it as a clear yes/no question. A criterion is a single check that resolves to true or false, so write it that way:Expected value, and the “false-is-good” case
The expected value is the answer that means good. The dashboard calls a verdict compliant when its outcome matches the expected value — so “good” isn’t always “true”. When your question describes something you want to happen, set expected value to true:Only answered verdicts count toward a metric’s compliant rate. Answered means the outcome was true or false; abstain (the judge couldn’t decide) and na (the criterion didn’t apply) are excluded from the denominator. So a criterion that correctly opts out of irrelevant conversations doesn’t drag your rate down — it simply isn’t counted there.
Applies-when / na rules
Some checks only make sense for some conversations. A criterion about appointment booking is meaningless on a call where the caller never wanted an appointment. Use the optional applies-when / na rule to describe the conditions under which the criterion applies; when they aren’t met, the outcome resolves to na and the criterion sits out that conversation instead of being scored as a pass or a fail.Deterministic vs judged criteria
A criterion can be one of two types:| Type | How it’s decided | When to use it |
|---|---|---|
| Deterministic | A hard, reproducible rule — the same conversation always yields the same outcome. | Mechanical checks that don’t need interpretation. |
| Judged | The AI judge (an LLM) reads the question and decides. | Anything that needs reading comprehension or judgment. |
Sanity-check before you save
If the builder offers a lint or preview, use it. Lint flags wording problems (a question that isn’t really yes/no, a missing expected value). Preview lets you see how the judge would read and answer the criterion against a sample conversation, so you can catch a mis-phrased question before it scores hundreds of runs. A quick preview now saves you re-running an evaluation later.Next steps
Build an evaluation
Bundle your metrics with judge instructions into a reusable evaluation you can run against conversations.
Run an evaluation
Trigger runs over one conversation, a batch, a pasted transcript, or an audio URL.

