> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Creating metrics & criteria

> Use the Metric Builder to define what you measure — and write criteria the AI judge can grade reliably.

A **metric** is one thing you want to measure about a conversation — for example "Task Resolution" or "Information Accuracy". Inside each metric you write one or more **criteria** (also called **atoms**): single yes/no checks that, taken together, tell you whether the metric was met.

This page walks you through the **Metric Builder** and then — most importantly — teaches you how to write a criterion the [AI judge](/evaluation/overview) can grade consistently. Getting the criterion wording right is the single highest-leverage thing you can do for trustworthy scores, so we spend the bulk of this page there.

You author metrics under **Analytics → Metrics**. Viewing requires the `assessments:read` scope; creating or editing requires `assessments:manage`.

## Open the Metric Builder

<Steps>
  <Step title="Go to Analytics → Metrics">
    Open the **Analytics** section in the dashboard and select the **Metrics** tab. You'll see every metric your organization has defined, each with a **scored-by** badge (more on that in [Graduation](/evaluation/graduation)).
  </Step>

  <Step title="Click New metric">
    Press **New metric** to open the Metric Builder. This is where you name the metric, choose how it's measured, and add criteria.
  </Step>

  <Step title="Name and describe it">
    Give the metric a short, plain-language **name** and a one-line **description**. The name is what you'll see on every analytics card and run, so make it concrete ("Booked an appointment") rather than abstract ("Quality").
  </Step>

  <Step title="Choose the metric kind and modality">
    Set the **metric kind** (the category of thing you're measuring) and the **modality** — whether this metric is graded from **text** (the transcript) or **audio** (the recording). See the warning below before choosing audio.
  </Step>

  <Step title="Add your criteria">
    Add one or more **criteria** to the metric. Each criterion is a yes/no check. The next section covers how to write them well.
  </Step>

  <Step title="Save">
    Save the metric. It's now available to add to an [evaluation](/evaluation/evaluations).
  </Step>
</Steps>

<Warning>
  **Audio metrics need a recording.** If you set the modality to **audio** and you run the metric against a conversation that has no recording, every criterion in that metric auto-resolves to **na** ("not applicable") for that conversation — the judge can't listen to something that isn't there. Use **text** modality unless you specifically need to grade audio characteristics (tone, talk-over, dead air) that the transcript can't capture.
</Warning>

## Writing a good criterion

A criterion has a few parts, but two of them carry all the weight: the **question** and the **expected value**. Get these right and your scores will be reliable; get them wrong and the judge will answer a question you didn't mean to ask.

| Part                       | What it is                                                                                                  |
| -------------------------- | ----------------------------------------------------------------------------------------------------------- |
| **Question**               | The exact yes/no question answered about the conversation. **This is the text the AI judge reads.**         |
| **Expected value**         | The answer (**true** or **false**) that means "good" or "compliant".                                        |
| **Applies-when / na rule** | Optional conditions under which the criterion doesn't apply, so it can opt out of irrelevant conversations. |
| **Type**                   | Whether the criterion is **deterministic** (a hard rule) or **judged** (decided by the LLM).                |

### The question is what the judge actually reads

The AI judge anchors on the criterion **question**. It reads the question (plus any na rule), consults the evaluation's free-text **instructions** as background context, and returns an outcome with a verbatim supporting quote, a confidence, and reasoning. Because the question is the anchor, everything that decides the verdict must live *in the question itself*.

**Phrase it as a clear yes/no question.** A criterion is a single check that resolves to true or false, so write it that way:

```
Good:  "Did the agent confirm the caller's email address before sending anything?"
Vague: "Email handling"
```

<Tip>
  **Keep criteria atomic.** One criterion should test exactly one thing. If your question contains "and" or "or", split it into two criteria. "Did the agent greet the caller and verify their identity?" is impossible to answer cleanly when the agent did one but not the other — the judge is forced to guess which half you cared about.
</Tip>

**Bake every rule into the question, and keep it channel-agnostic.** The evaluation's [instructions](/evaluation/evaluations) are *context*, not rules — the judge treats them as supporting facts, not as the thing to enforce. So any hard rule has to be phrased into the question. And because the same metric can grade phone, web chat, email, and text conversations, write the rule so it doesn't assume a channel:

```
Good (rule in the question, channel-agnostic):
  "Did the agent provide a callback or follow-up contact method before ending
   the conversation?"

Risky (rule lives only in the instructions):
  "Was the handoff handled correctly?"   ← "correctly" is defined elsewhere;
                                            the judge has to infer it

Risky (assumes a channel):
  "Did the agent read the disclosure aloud?"  ← meaningless on web chat/email
```

<Tip>
  A quick test: read your question to a colleague who hasn't seen the instructions. If they can answer **yes** or **no** from the question alone, the judge can too. If they ask "what counts as correct here?", the rule isn't in the question yet.
</Tip>

### Expected value, and the "false-is-good" case

The **expected value** is the answer that means *good*. The dashboard calls a verdict **compliant** when its outcome matches the expected value — so "good" isn't always "true".

When your question describes something you *want* to happen, set expected value to **true**:

```
Question:       "Did the agent answer the caller's question accurately?"
Expected value: true        → a "true" verdict is compliant (good)
```

When your question describes a *violation* — something you want to catch — set expected value to **false**. A "false" verdict then means the bad thing did **not** happen, which is the good outcome:

```
Question:       "Did the agent give the caller incorrect information?"
Expected value: false       → a "false" verdict is compliant (good);
                              a "true" verdict is a violation (bad)
```

Both styles are valid. Phrasing checks as violations ("Did the agent ...?" with expected value **false**) pairs naturally with [human grade-by-exception review](/evaluation/running-evaluations), where reviewers only flag the things that went wrong.

<Note>
  Only **answered** verdicts count toward a metric's compliant rate. **Answered** means the outcome was true or false; **abstain** (the judge couldn't decide) and **na** (the criterion didn't apply) are excluded from the denominator. So a criterion that correctly opts out of irrelevant conversations doesn't drag your rate down — it simply isn't counted there.
</Note>

### Applies-when / na rules

Some checks only make sense for some conversations. A criterion about appointment booking is meaningless on a call where the caller never wanted an appointment. Use the optional **applies-when / na rule** to describe the conditions under which the criterion applies; when they aren't met, the outcome resolves to **na** and the criterion sits out that conversation instead of being scored as a pass or a fail.

```
Question:    "Did the agent collect the caller's preferred appointment time?"
na rule:     "Skip this criterion if the caller did not ask to book an
              appointment."
```

<Tip>
  Write the na rule so the judge can decide it from the conversation itself, the same way you write the question — concrete and observable, not "if relevant".
</Tip>

### Deterministic vs judged criteria

A criterion can be one of two types:

| Type              | How it's decided                                                                  | When to use it                                         |
| ----------------- | --------------------------------------------------------------------------------- | ------------------------------------------------------ |
| **Deterministic** | A hard, reproducible rule — the same conversation always yields the same outcome. | Mechanical checks that don't need interpretation.      |
| **Judged**        | The AI judge (an LLM) reads the question and decides.                             | Anything that needs reading comprehension or judgment. |

This distinction matters later: a metric whose criteria are **all deterministic** is reproducible by construction, so it auto-certifies on creation and needs no [calibration](/evaluation/calibration). Any **judged** criterion means the metric has to earn trust before it scores automatically — see [Graduation](/evaluation/graduation).

## Sanity-check before you save

If the builder offers a **lint** or **preview**, use it. Lint flags wording problems (a question that isn't really yes/no, a missing expected value). Preview lets you see how the judge would read and answer the criterion against a sample conversation, so you can catch a mis-phrased question before it scores hundreds of runs. A quick preview now saves you re-running an [evaluation](/evaluation/running-evaluations) later.

## Next steps

<CardGroup cols={2}>
  <Card title="Build an evaluation" icon="layers" href="/evaluation/evaluations">
    Bundle your metrics with judge instructions into a reusable evaluation you can run against conversations.
  </Card>

  <Card title="Run an evaluation" icon="play" href="/evaluation/running-evaluations">
    Trigger runs over one conversation, a batch, a pasted transcript, or an audio URL.
  </Card>
</CardGroup>
