> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Key concepts & glossary

> Plain-language and precise definitions of every term used in evaluation.

Evaluation is how you measure the quality of your agents' conversations consistently — using an AI judge, human reviewers, or both. It lives in the dashboard under the **Analytics** section, where the tabs are **Overview**, **Metrics**, **Evaluations**, **Runs**, and **Calibration**.

This page is a reference. Whenever you hit a term anywhere in the evaluation docs and want the exact meaning, scan the table below, then jump to that term's section for a plain-language line followed by a precise one.

## Quick summary

| Term                          | In one line                                                                                     |
| ----------------------------- | ----------------------------------------------------------------------------------------------- |
| **Metric**                    | A thing you measure about a conversation (e.g. "Task Resolution"); groups one or more criteria. |
| **Criterion** (atom)          | A single yes/no check inside a metric.                                                          |
| **Question**                  | The exact yes/no question the AI judge answers about the conversation.                          |
| **Expected value**            | The answer (`true` or `false`) that means "good / compliant".                                   |
| **Deterministic vs judged**   | A hard, reproducible rule vs a call the LLM makes.                                              |
| **Outcome**                   | One verdict's value: `true`, `false`, `abstain`, or `na`.                                       |
| **Compliant**                 | The outcome matched the expected value.                                                         |
| **Compliant rate**            | compliant ÷ answered (`abstain` and `na` excluded).                                             |
| **Evaluation**                | A reusable bundle of metrics + judge config + instructions.                                     |
| **Instructions**              | Free text the judge reads as scoring context.                                                   |
| **Evaluation run**            | One assessment of one conversation by one assessor.                                             |
| **Assessor**                  | Who produced the verdicts: `ai` or `human`.                                                     |
| **Criterion result**          | The verdict for one criterion in one run (outcome + quote + confidence + reasoning).            |
| **Scoring mode** (scored\_by) | A metric's capability badge: `auto`, `hybrid`, or `human_only`.                                 |
| **Calibration set**           | A frozen set of conversations used to measure agreement rigorously.                             |
| **Rater / assignment**        | A person (or AI) graded into a calibration set, and the work assigned to them.                  |
| **Agreement**                 | How often two assessors give the same verdict.                                                  |
| **Cohen's κ**                 | A chance-corrected agreement score.                                                             |
| **Gate 1 / Gate 2 / proxy**   | Human↔human / AI↔human / internal↔customer agreement gates.                                     |
| **Golden labels**             | A held-out set of known-correct verdicts used as a regression check.                            |
| **Graduation**                | Explicitly turning on auto-scoring for a metric once its gates pass.                            |
| **Modality**                  | Whether a metric reads text or audio.                                                           |

## Metric

**Plain:** a single thing you want to measure about a conversation — for example "Task Resolution" or "Information Accuracy".

**Precise:** a metric groups one or more criteria and carries a `metric_kind`, a `modality` (text or audio), and a `scored_by` status (its [scoring mode](#scoring-mode-scored_by)). You view per-metric results as cards in the **Analytics** **Overview** tab.

## Criterion (atom)

**Plain:** one specific yes/no check inside a metric. Criteria are sometimes called "atoms" because they're the smallest unit that gets a verdict.

**Precise:** a criterion holds a [question](#question), an [expected value](#expected-value), an optional applies-when rule, and a flag for whether it's [deterministic or judged](#deterministic-vs-judged-criterion). Every [evaluation run](#evaluation-run) produces one [criterion result](#criterion-result) per criterion.

### Question

**Plain:** the exact yes/no question that gets answered about the conversation.

**Precise:** this text is what the AI judge actually reads, so any hard rule — for instance a channel rule like "only count this for phone calls" — must be written into the question itself, phrased to stay channel-agnostic. Put rules in the question; put supporting facts in the evaluation's [instructions](#instructions).

### Applies-when (na rule)

**Plain:** an optional condition that says "skip this criterion when it doesn't apply".

**Precise:** when the applies-when / na rule is met, the criterion resolves to the [`na` outcome](#outcome) instead of being scored, and it's excluded from the compliant rate.

## Deterministic vs judged criterion

**Plain:** a deterministic criterion is a hard rule that always gives the same answer; a judged criterion is one the AI decides.

**Precise:** a **deterministic** criterion is reproducible by construction — it never needs the LLM and never needs calibration. A **judged** criterion is decided by the AI judge (or a human). A metric made entirely of deterministic criteria auto-certifies the moment you create it; a metric with any judged criterion has to earn trust through [calibration](#calibration-set) before it can [graduate](#graduation).

## Outcome

**Plain:** the value of a single verdict — what the judge or reviewer answered for one criterion.

**Precise:** an outcome is exactly one of:

| Outcome   | Meaning                                                                                                                                     |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `true`    | The answer to the criterion's question is yes.                                                                                              |
| `false`   | The answer is no.                                                                                                                           |
| `abstain` | The assessor couldn't decide.                                                                                                               |
| `na`      | The criterion doesn't apply to this conversation (e.g. its applies-when rule wasn't met, or an [audio metric](#modality) has no recording). |

Both `abstain` and `na` are non-answers: they are excluded from the [compliant rate](#compliant--compliant-rate) denominator.

## Expected value

**Plain:** the answer that counts as "good".

**Precise:** a criterion's `expected_value` is `true` or `false` — whichever verdict means the conversation was good or compliant. It's set per criterion because some questions are phrased positively and some negatively (see the worked example below).

## Compliant & compliant rate

**Plain:** a verdict is "compliant" when it matches the expected answer. The compliant rate is the share of answered verdicts that came out good.

**Precise:** **compliant** = the [outcome](#outcome) equals the criterion's [expected value](#expected-value). The **compliant rate** shown on dashboard cards is:

```
compliant rate = compliant ÷ answered
answered        = verdicts that are true OR false   (abstain and na are NOT counted)
```

### Worked example: why a "false" verdict can be compliant

Take the criterion **"Did the agent give wrong information?"** Wrong information is bad, so the good answer is "no" — meaning `expected_value = false`.

```
Question:        "Did the agent give wrong information?"
expected_value:  false
Verdict outcome: false   ->  matches expected_value  ->  COMPLIANT  (good!)
Verdict outcome: true    ->  agent gave wrong info   ->  NOT compliant
```

So a `false` verdict here is the *good* result. Always read compliant as "matched expected value", not as "the verdict was true".

## Evaluation

**Plain:** a reusable bundle you point at conversations to score them.

**Precise:** an evaluation bundles a set of [metrics](#metric), the judge configuration, and a free-text [instructions](#instructions) field. You run an evaluation against one or more conversations to produce [runs](#evaluation-run). Build and edit evaluations in the **Evaluations** tab.

## Instructions

**Plain:** notes you give the AI judge so it knows what "correct" looks like.

**Precise:** free text the judge reads as scoring *context* — not as the question. A common use is pasting the agent's [knowledge base](/agents/knowledge-base-attachment) or scoring guidance so the judge knows what counts as in-scope or accurate. The judge anchors on each criterion's [question](#question) and treats instructions as supporting context, so put hard rules in the question and put facts in instructions. Editable in the Evaluation builder.

## Evaluation run

**Plain:** one scoring of one conversation by one assessor.

**Precise:** a run is a single assessment of one conversation by one [assessor](#assessor). Its status is `pending`, `completed`, or `failed`. AI runs are queued and processed asynchronously. You can have multiple runs on the same conversation — for example an AI run and a human run that confirms or corrects it.

## Assessor

**Plain:** who produced the verdicts — the AI or a person.

**Precise:** a run's `assessor_type` is either `ai` (the [AI judge](#the-ai-judge)) or `human` (a reviewer). In the **Overview** tab, the **assessor toggle** ("AI + Human", "AI judge", "Human") chooses *whose* verdicts you're currently looking at.

### The AI judge

**Plain:** a top-tier LLM that reads the conversation and answers each criterion.

**Precise:** for each criterion the judge reads the [question](#question) (plus any na rule) and the evaluation's [instructions](#instructions), then returns an [outcome](#outcome), a verbatim [quote](#criterion-result) of evidence, a [confidence](#criterion-result), and reasoning. It runs asynchronously. For [audio-modality](#modality) metrics, if the conversation has no recording the verdict auto-resolves to `na`.

## Criterion result

**Plain:** the recorded verdict for one criterion within one run, with its evidence.

**Precise:** a criterion result holds the [outcome](#outcome), a verbatim **quote** of the supporting evidence from the transcript, a **confidence**, and reasoning. One run has one criterion result per criterion in the evaluation's metrics.

## Scoring mode (scored\_by)

**Plain:** a per-metric badge that says how much you trust the AI to score this metric on its own.

**Precise:** `scored_by` is a status shown as a badge with three values:

| Badge        | Meaning                                                             |
| ------------ | ------------------------------------------------------------------- |
| `human_only` | The AI judge is not yet trusted to score this metric automatically. |
| `hybrid`     | The AI judge scores it, with human spot-checks.                     |
| `auto`       | The AI judge is trusted to score it automatically.                  |

<Warning>
  The badge is the metric's **scoring capability**, not the data you're currently viewing. Both AI and human runs can exist for a `human_only` metric — the [assessor toggle](#assessor) controls what you *view*, the badge controls what the metric is *allowed to do*. The badge has a popover explaining its measured gates, blockers, and what it would take to [graduate](#graduation).
</Warning>

## Calibration set

**Plain:** a frozen group of conversations you use to rigorously measure whether the AI judge agrees with humans.

**Precise:** a calibration set freezes a fixed set of conversations, assigns multiple [raters](#rater--assignment), and computes [agreement](#agreement). Each rater grades every conversation as a *new, isolated* run, separate from any other evaluation — so the measurement isn't contaminated by existing verdicts. Set these up in the **Calibration** tab.

<Tip>
  For a quick check without any setup, every evaluation also has an **"AI vs Human"** tab that computes [Gate-2](#gate-1--gate-2--proxy) κ on the fly over the AI and human runs that already exist. It takes the most-recent run per conversation and excludes `na`.
</Tip>

## Rater & assignment

**Plain:** a rater is someone (or the AI) doing the grading; an assignment is the work given to them.

**Precise:** in a [calibration set](#calibration-set), each rater grades every frozen conversation as its own isolated run. You need at least two human raters to measure [Gate 1](#gate-1--gate-2--proxy); one human plus one AI rater gives you Gate 2 only.

## Agreement

**Plain:** how often two assessors land on the same verdict.

**Precise:** agreement compares two assessors' [outcomes](#outcome) on the same conversations. Raw % agreement is shown but is misleading on its own — see [Cohen's κ](#cohens-κ-kappa).

## Cohen's κ (kappa)

**Plain:** an agreement score that corrects for lucky guesses.

**Precise:** when outcomes are skewed — say the answer is almost always the same — two assessors can agree 80%+ purely by chance, so raw % overstates trust. **Cohen's κ** corrects for that chance agreement. The dashboard also shows **Gwet AC1**, **Krippendorff α**, and **prevalence** (the share of `true`). Interpret κ with the Landis–Koch bands:

| Cohen's κ   | Interpretation |
| ----------- | -------------- |
| ≥ 0.80      | Almost perfect |
| 0.60 – 0.80 | Substantial    |
| 0.40 – 0.60 | Moderate       |
| 0.20 – 0.40 | Fair           |
| \< 0.20     | Roughly chance |

A metric can show a high % agreement while κ is near 0 — that's the trap κ is there to catch.

## Gate 1 / Gate 2 / proxy

**Plain:** the three agreement checks that decide whether the AI judge is trustworthy.

**Precise:**

| Gate       | Compares                        | Needs                            |
| ---------- | ------------------------------- | -------------------------------- |
| **Gate 1** | human ↔ human                   | at least 2 human raters          |
| **Gate 2** | AI ↔ human                      | 1 human + the AI                 |
| **Proxy**  | internal rater ↔ customer rater | an internal and a customer rater |

[Agreement cards](#agreement) on the calibration page are titled by metric (pooled, "by metric") and by criterion ("by criterion"), and each shows κ, AC1, α, % agreement, prevalence, and a threshold bar.

## Golden labels

**Plain:** a small set of verdicts you already know are correct, kept aside as a sanity check.

**Precise:** golden labels are a held-out set of known-correct verdicts used as a regression check during [graduation](#graduation) — they catch a judge that would otherwise pass the gates but get the known cases wrong.

## Graduation

**Plain:** the deliberate act of turning on auto-scoring for a metric once it has earned trust.

**Precise:** computing agreement on a [calibration set](#calibration-set) *refreshes* each target metric's eligibility (its gates, blockers, and "what it would take") but never auto-graduates it. If the gates pass — for example Gate-2 κ at or above the threshold — the metric shows as **eligible**, and you then graduate it explicitly from its **Status / Graduation** panel. Its [`scored_by`](#scoring-mode-scored_by) then flips to `auto` or `hybrid`. Nothing scores automatically until a human clicks **Graduate**. Conversely, a metric whose gates later fail is demoted automatically — losing trust needs no confirmation. Fully [deterministic](#deterministic-vs-judged-criterion) metrics skip all of this and auto-certify on creation.

## Modality

**Plain:** whether a metric reads the conversation's text or its audio.

**Precise:** a metric's `modality` is `text` or `audio`. For an `audio` metric, if the conversation has no recording, the [AI judge](#the-ai-judge) auto-resolves the verdict to [`na`](#outcome) (so it never counts against the compliant rate).

## Permissions

Viewing evaluation data requires the `assessments:read` scope. Creating or editing metrics and evaluations, triggering runs, and graduating metrics all require `assessments:manage`.

## Next steps

<CardGroup cols={2}>
  <Card title="Quickstart" icon="rocket" href="/evaluation/quickstart">
    Build your first metric and score a conversation end to end.
  </Card>

  <Card title="Creating metrics & criteria" icon="list-checks" href="/evaluation/creating-metrics">
    Write good questions and set expected values that score the way you mean.
  </Card>

  <Card title="Calibration" icon="scale-balanced" href="/evaluation/calibration">
    Measure whether the AI judge agrees with your reviewers.
  </Card>
</CardGroup>
