> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Graduation & auto-scoring

> Turn on automatic AI scoring for a metric once it has earned your trust — and understand the scoring-mode badge.

A **metric** measures one quality of a conversation (for example *Task Resolution*) by grouping one or more yes/no **criteria**. You can have those criteria scored two ways: by the **AI judge** (a top-tier LLM that reads the conversation and answers each question) or by a **human reviewer** grading by hand. The AI judge is fast and cheap, but you shouldn't trust it to score a metric automatically until you've *proven* it agrees with your humans on that metric.

**Graduation** is the moment you flip that switch — you tell Anyreach "the AI judge is trustworthy for this metric, let it score automatically." This page explains the badge that tracks where each metric stands, and the flow for graduating one.

You'll find all of this in the dashboard under the **Analytics** section.

## The scoring-mode badge

Every metric shows a small **scoring mode** badge (also called **scored-by**). It has three values:

| Badge           | Meaning                                                                                                  |
| --------------- | -------------------------------------------------------------------------------------------------------- |
| **human\_only** | The AI judge is **not yet trusted** to score this metric on its own. It hasn't passed calibration.       |
| **hybrid**      | The AI judge scores this metric automatically, with humans **spot-checking** a sample to keep it honest. |
| **auto**        | The AI judge is **fully trusted** and scores this metric automatically.                                  |

A metric starts at **human\_only** and moves "up" to **hybrid** or **auto** only when you graduate it.

<Warning>
  The badge is a **capability**, not a view. It answers "*is the AI trusted to auto-score this metric?*" — it does **not** mean you're only looking at AI verdicts (or only human verdicts) right now.

  Both AI and human runs can exist for a **human\_only** metric at the same time. What you're *viewing* is controlled by the **assessor toggle** ("AI + Human" / "AI judge" / "Human") on the [Analytics overview](/evaluation/analytics), not by this badge. Keep the two ideas separate: the **toggle** chooses verdicts to look at; the **badge** describes scoring capability.
</Warning>

## Why graduation exists

You could just let the AI judge score everything from day one — but then you'd have no idea whether its verdicts match what a careful human would say. Some metrics are easy for an LLM ("Did the agent greet the caller?"); others are subtle and the judge gets them wrong often enough to mislead your dashboards.

Graduation forces a check: prove the judge **agrees with humans** on a given metric *before* you rely on it. That proof comes from **calibration** — running the AI judge and your human reviewers over the same conversations and measuring how often they land on the same verdict. See [Calibration & agreement](/evaluation/calibration) for the full picture.

## The graduation flow

<Steps>
  <Step title="Collect AI and human verdicts on the same conversations">
    Run the AI judge over a set of conversations, and have a reviewer grade those **same** conversations by hand. You now have two opinions per conversation — the judge's and a human's — which is what agreement is measured from. (Human grading is "grade-by-exception": each criterion defaults to the compliant answer and you only flag the exceptions.)
  </Step>

  <Step title="Compute agreement">
    Measure how well the AI and human verdicts line up. Two ways:

    * **Quick:** open the **AI vs Human** tab on an evaluation. It computes agreement on the fly over the AI and human runs that already exist — no setup.
    * **Rigorous:** build a **calibration set** — a frozen set of conversations graded by assigned raters as fresh, isolated runs — and compute agreement there.

    Agreement is reported as **Cohen's κ (kappa)**, not raw percent, because a skewed metric (almost always the same answer) can hit a high raw percent by luck while κ stays near zero. Higher κ means more real agreement.

    <Note>
      Computing agreement on a calibration set **refreshes** each target metric's eligibility — its gates, its blockers, and its "what it would take" guidance. It does **not** graduate anything automatically. Refreshing eligibility and graduating are two separate steps.
    </Note>
  </Step>

  <Step title="Check whether the metric is ELIGIBLE">
    If the measured gates pass — for example **Gate-2 κ** (AI↔human agreement) at or above the threshold — the metric shows as **ELIGIBLE** to graduate. If a gate falls short, the badge popover tells you which one and what it would take to clear it (see below).
  </Step>

  <Step title="Graduate it explicitly">
    Open the metric's **Status / Graduation** panel and click **Graduate**. This is **graduate-on-click**: nothing flips automatically. When you graduate, the metric's scoring mode flips from **human\_only** to **auto** or **hybrid**, and from then on the AI judge scores it without waiting for a human.
  </Step>
</Steps>

<Tip>
  Graduating doesn't lock you out of human review. You can still grade conversations by hand on a graduated metric — a human run can confirm or correct an AI run at any time.
</Tip>

## Golden labels

When you graduate, Anyreach runs a regression check against **golden labels** — a held-out set of conversations whose correct verdicts are already known. Think of them as an answer key the AI judge never trains on. If the judge still gets the golden labels right, you have evidence it isn't just memorizing the conversations you calibrated on. Golden labels guard against a metric "passing" calibration but quietly failing on cases you didn't look at.

## Demotion: losing trust is automatic

Trust can be revoked without a click. If a graduated metric's gates later **fail** — say agreement drifts below the threshold as you collect more human grades — the metric is **demoted automatically** back toward **human\_only**.

The asymmetry is deliberate: **granting** trust requires a human to click **Graduate**, but **losing** trust needs no confirmation. It's always safe to stop auto-scoring a metric the moment it stops agreeing with your humans.

## Deterministic metrics certify themselves

Some criteria are **deterministic** — hard, reproducible rules rather than LLM judgment calls. A metric whose criteria are *all* deterministic is reproducible by construction: it returns the same verdict every time and there's no judge to calibrate. These metrics **auto-certify on creation** — they need no calibration set and no graduation step.

Graduation only matters for metrics that contain at least one **judged** criterion.

## Reading the badge popover

Click the scoring-mode badge to open its popover. It explains, in plain terms, where the metric stands:

* **Gates** — which measured gates (Gate 1 human↔human, Gate 2 AI↔human, Proxy internal↔customer) apply and whether each passes.
* **Blockers** — what's currently stopping the metric from graduating (for example, not enough human runs yet, or κ below threshold).
* **What it would take** — the concrete next action to become eligible (for example, "grade 8 more conversations" or "raise Gate-2 κ to 0.6").
* **Manual override** — where you graduate the metric once it's eligible, or where its mode is shown if it's already auto/hybrid.

Use the popover as your checklist: it tells you exactly why a metric isn't auto-scoring yet and what to do about it.

## Permissions

Viewing scoring modes needs the **assessments:read** scope. Graduating a metric — like creating, editing, or triggering evaluations — needs **assessments:manage**.

## Next steps

<CardGroup cols={2}>
  <Card title="Calibration & agreement" icon="ruler-combined" href="/evaluation/calibration">
    How agreement, κ, and the gates that decide eligibility actually work.
  </Card>

  <Card title="Analytics overview" icon="chart-line" href="/evaluation/analytics">
    Read compliant rates and switch whose verdicts you're viewing with the assessor toggle.
  </Card>
</CardGroup>