Skip to main content
A metric is one thing you want to measure about a conversation — for example “Task Resolution” or “Information Accuracy”. Inside each metric you write one or more criteria (also called atoms): single yes/no checks that, taken together, tell you whether the metric was met. This page walks you through the Metric Builder and then — most importantly — teaches you how to write a criterion the AI judge can grade consistently. Getting the criterion wording right is the single highest-leverage thing you can do for trustworthy scores, so we spend the bulk of this page there. You author metrics under Analytics → Metrics. Viewing requires the assessments:read scope; creating or editing requires assessments:manage.

Open the Metric Builder

1

Go to Analytics → Metrics

Open the Analytics section in the dashboard and select the Metrics tab. You’ll see every metric your organization has defined, each with a scored-by badge (more on that in Graduation).
2

Click New metric

Press New metric to open the Metric Builder. This is where you name the metric, choose how it’s measured, and add criteria.
3

Name and describe it

Give the metric a short, plain-language name and a one-line description. The name is what you’ll see on every analytics card and run, so make it concrete (“Booked an appointment”) rather than abstract (“Quality”).
4

Choose the metric kind and modality

Set the metric kind (the category of thing you’re measuring) and the modality — whether this metric is graded from text (the transcript) or audio (the recording). See the warning below before choosing audio.
5

Add your criteria

Add one or more criteria to the metric. Each criterion is a yes/no check. The next section covers how to write them well.
6

Save

Save the metric. It’s now available to add to an evaluation.
Audio metrics need a recording. If you set the modality to audio and you run the metric against a conversation that has no recording, every criterion in that metric auto-resolves to na (“not applicable”) for that conversation — the judge can’t listen to something that isn’t there. Use text modality unless you specifically need to grade audio characteristics (tone, talk-over, dead air) that the transcript can’t capture.

Writing a good criterion

A criterion has a few parts, but two of them carry all the weight: the question and the expected value. Get these right and your scores will be reliable; get them wrong and the judge will answer a question you didn’t mean to ask.
PartWhat it is
QuestionThe exact yes/no question answered about the conversation. This is the text the AI judge reads.
Expected valueThe answer (true or false) that means “good” or “compliant”.
Applies-when / na ruleOptional conditions under which the criterion doesn’t apply, so it can opt out of irrelevant conversations.
TypeWhether the criterion is deterministic (a hard rule) or judged (decided by the LLM).

The question is what the judge actually reads

The AI judge anchors on the criterion question. It reads the question (plus any na rule), consults the evaluation’s free-text instructions as background context, and returns an outcome with a verbatim supporting quote, a confidence, and reasoning. Because the question is the anchor, everything that decides the verdict must live in the question itself. Phrase it as a clear yes/no question. A criterion is a single check that resolves to true or false, so write it that way:
Good:  "Did the agent confirm the caller's email address before sending anything?"
Vague: "Email handling"
Keep criteria atomic. One criterion should test exactly one thing. If your question contains “and” or “or”, split it into two criteria. “Did the agent greet the caller and verify their identity?” is impossible to answer cleanly when the agent did one but not the other — the judge is forced to guess which half you cared about.
Bake every rule into the question, and keep it channel-agnostic. The evaluation’s instructions are context, not rules — the judge treats them as supporting facts, not as the thing to enforce. So any hard rule has to be phrased into the question. And because the same metric can grade phone, web chat, email, and text conversations, write the rule so it doesn’t assume a channel:
Good (rule in the question, channel-agnostic):
  "Did the agent provide a callback or follow-up contact method before ending
   the conversation?"

Risky (rule lives only in the instructions):
  "Was the handoff handled correctly?"   ← "correctly" is defined elsewhere;
                                            the judge has to infer it

Risky (assumes a channel):
  "Did the agent read the disclosure aloud?"  ← meaningless on web chat/email
A quick test: read your question to a colleague who hasn’t seen the instructions. If they can answer yes or no from the question alone, the judge can too. If they ask “what counts as correct here?”, the rule isn’t in the question yet.

Expected value, and the “false-is-good” case

The expected value is the answer that means good. The dashboard calls a verdict compliant when its outcome matches the expected value — so “good” isn’t always “true”. When your question describes something you want to happen, set expected value to true:
Question:       "Did the agent answer the caller's question accurately?"
Expected value: true        → a "true" verdict is compliant (good)
When your question describes a violation — something you want to catch — set expected value to false. A “false” verdict then means the bad thing did not happen, which is the good outcome:
Question:       "Did the agent give the caller incorrect information?"
Expected value: false       → a "false" verdict is compliant (good);
                              a "true" verdict is a violation (bad)
Both styles are valid. Phrasing checks as violations (“Did the agent …?” with expected value false) pairs naturally with human grade-by-exception review, where reviewers only flag the things that went wrong.
Only answered verdicts count toward a metric’s compliant rate. Answered means the outcome was true or false; abstain (the judge couldn’t decide) and na (the criterion didn’t apply) are excluded from the denominator. So a criterion that correctly opts out of irrelevant conversations doesn’t drag your rate down — it simply isn’t counted there.

Applies-when / na rules

Some checks only make sense for some conversations. A criterion about appointment booking is meaningless on a call where the caller never wanted an appointment. Use the optional applies-when / na rule to describe the conditions under which the criterion applies; when they aren’t met, the outcome resolves to na and the criterion sits out that conversation instead of being scored as a pass or a fail.
Question:    "Did the agent collect the caller's preferred appointment time?"
na rule:     "Skip this criterion if the caller did not ask to book an
              appointment."
Write the na rule so the judge can decide it from the conversation itself, the same way you write the question — concrete and observable, not “if relevant”.

Deterministic vs judged criteria

A criterion can be one of two types:
TypeHow it’s decidedWhen to use it
DeterministicA hard, reproducible rule — the same conversation always yields the same outcome.Mechanical checks that don’t need interpretation.
JudgedThe AI judge (an LLM) reads the question and decides.Anything that needs reading comprehension or judgment.
This distinction matters later: a metric whose criteria are all deterministic is reproducible by construction, so it auto-certifies on creation and needs no calibration. Any judged criterion means the metric has to earn trust before it scores automatically — see Graduation.

Sanity-check before you save

If the builder offers a lint or preview, use it. Lint flags wording problems (a question that isn’t really yes/no, a missing expected value). Preview lets you see how the judge would read and answer the criterion against a sample conversation, so you can catch a mis-phrased question before it scores hundreds of runs. A quick preview now saves you re-running an evaluation later.

Next steps

Build an evaluation

Bundle your metrics with judge instructions into a reusable evaluation you can run against conversations.

Run an evaluation

Trigger runs over one conversation, a batch, a pasted transcript, or an audio URL.