> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyreach.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Calibration & agreement

> Prove whether the AI judge agrees with your humans — using chance-corrected agreement, not raw %.

You can ask the AI judge to score thousands of conversations in minutes. But before you let it score them *unsupervised*, you need a reason to trust it. **Calibration** is how you earn that trust: you check whether the AI judge reaches the same verdicts a careful human reviewer would.

Everything here lives under the **Analytics** section of the dashboard, on the **Calibration** tab and on each evaluation's **AI vs Human** view.

## Why raw % agreement lies

The obvious way to check the judge is to count how often it agrees with a human: out of 100 conversations, how many got the same verdict? That number — **raw percent agreement** — looks reassuring and is almost always wrong.

The problem is *skew*. Most criteria have a lopsided answer. Imagine a criterion "Did the agent stay professional?" where the true answer is "yes" 95% of the time. Now suppose both the human and the AI judge just say "yes" to almost everything:

```
                 AI says yes   AI says no
Human says yes       85            5
Human says no         5            5

Raw agreement = (85 + 5) / 100 = 90%
```

90% agreement sounds excellent. But two reviewers who *each guessed "yes" most of the time* would land near there by luck alone — they barely have to look at the conversation. Raw % can't tell skill apart from chance, so on skewed criteria (which is most of them) it systematically overstates how good the judge is.

## Cohen's κ: agreement minus luck

To fix this, Anyreach reports **Cohen's κ (kappa)** — a standard statistic that subtracts the agreement you'd expect *by chance* and rescales what's left. A κ of 0 means "no better than guessing"; a κ of 1 means "perfect agreement." The same 90%-raw example above can have a κ near 0, which is the honest verdict: the judge hasn't been shown to add anything yet.

Read κ with the **Landis–Koch bands**:

| Cohen's κ   | Interpretation |
| ----------- | -------------- |
| ≥ 0.80      | Almost perfect |
| 0.60 – 0.80 | Substantial    |
| 0.40 – 0.60 | Moderate       |
| 0.20 – 0.40 | Fair           |
| \< 0.20     | Roughly chance |

<Note>
  A metric can show a high % agreement *and* a κ near 0 at the same time. When they disagree, trust κ — the high % is the chance illusion described above.
</Note>

Alongside κ, the dashboard shows a few companion measures so you can sanity-check it:

| Measure            | What it adds                                                                                                                                                 |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Gwet AC1**       | An alternative chance-corrected score that stays stable even when one outcome is extremely rare — useful when κ behaves erratically on very skewed criteria. |
| **Krippendorff α** | A general agreement coefficient that handles more than two raters and missing verdicts.                                                                      |
| **Prevalence**     | The share of verdicts that are "true." Tells you *how* skewed the criterion is, which explains why κ and % can diverge.                                      |

## Two paths to a κ

You can measure agreement two ways. The quick path reuses runs you already have; the rigorous path freezes a controlled set.

### Quick: the "AI vs Human" tab

Every evaluation has an **AI vs Human** tab. It computes a **Gate-2** κ (AI-vs-human agreement, defined below) on the fly, over whatever AI and human runs already exist for that evaluation. There's no setup — if you've run the AI judge on some conversations and a reviewer has also graded a few of them, this tab already has data.

To keep the comparison fair, it:

* takes the **most-recent run per conversation** for each assessor, so re-runs don't double-count, and
* **excludes `na`** verdicts (a criterion that didn't apply to a conversation can't agree or disagree).

Use this tab for a fast pulse check while you're still building a metric.

### Rigorous: a Calibration set

When you're ready to make a trust decision, build a **Calibration set** from the **Calibration** tab. A calibration set is a *frozen* list of conversations plus a list of **raters**. Freezing matters: everyone scores the exact same conversations, so the agreement number can't drift as your traffic changes.

<Steps>
  <Step title="Pick the conversations">
    Choose a fixed set of conversations to grade. Aim for a spread of easy and hard cases so the κ reflects real difficulty.
  </Step>

  <Step title="Assign raters">
    A **rater** is anyone (or anything) whose verdicts you want to compare — typically one or more human reviewers plus the AI judge. Each rater grades **every** conversation in the set.
  </Step>

  <Step title="Let raters grade">
    Each rater's grading creates a **new, isolated run** for every conversation — separate from any other evaluation's runs. This isolation is the point: nobody is influenced by an existing verdict, so the agreement number is honest.
  </Step>

  <Step title="Compute agreement">
    Run the agreement computation. The page fills in the κ (and companions) for every gate that has enough raters.
  </Step>
</Steps>

## The three gates

Agreement is reported per **gate** — each gate compares a different *pair* of rater types, and each answers a different trust question:

| Gate       | Compares                        | Question it answers                                                                                                                   |
| ---------- | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| **Gate 1** | human ↔ human                   | Do your own reviewers even agree with each other? (If they don't, the task is ill-defined — fix the criterion before blaming the AI.) |
| **Gate 2** | AI ↔ human                      | Does the AI judge match your humans? This is the gate that justifies auto-scoring.                                                    |
| **Proxy**  | internal rater ↔ customer rater | Does an internal reviewer stand in for the customer's own judgment?                                                                   |

<Note>
  With just **one human rater and one AI rater**, you'll only see a **Gate 2** number. Gate 1 needs **two or more human** raters to compare humans against each other, and the proxy gate needs **both an internal and a customer** rater. Those cards stay empty until you add the raters they require — that's expected, not a bug.
</Note>

## Reading the agreement cards

Both the **AI vs Human** tab and the **Calibration** page show **agreement cards**. They come in two granularities:

* **By metric** — agreement *pooled* across all the criteria in a metric. This is the headline number for that metric.
* **By criterion** — agreement for a single criterion. Use this to find the one atom that's dragging a metric's κ down.

Each card shows the same fields:

| On the card       | How to read it                                                                                      |
| ----------------- | --------------------------------------------------------------------------------------------------- |
| **κ**             | The headline chance-corrected agreement. Compare it to the band table above.                        |
| **AC1** / **α**   | Companion coefficients — a cross-check on κ, especially on skewed criteria.                         |
| **% agreement**   | Raw agreement. Informative, but never the deciding number.                                          |
| **prevalence**    | Share of "true" verdicts — tells you how skewed the criterion is.                                   |
| **threshold bar** | A visual marker of where the κ sits relative to the bar it must clear to be considered trustworthy. |

<Tip>
  If a metric's pooled κ looks weak, switch to the **by criterion** cards. Often a single criterion with a confusing **question** is the culprit — tighten its wording (see [Creating metrics & criteria](/evaluation/creating-metrics)) and re-grade.
</Tip>

## What calibration does and doesn't do

Computing agreement **refreshes** each target metric's eligibility — its gates, its blockers, and the "what it would take" guidance you see on the scoring-mode badge. It does **not** flip any metric to auto-scoring on its own. Turning on auto-scoring is a separate, deliberate step.

<Warning>
  Calibration only updates eligibility. A metric will not start scoring conversations automatically until a human explicitly graduates it. See [Graduation](/evaluation/graduation).
</Warning>

You'll also need the right access: viewing agreement requires the `assessments:read` scope, and building calibration sets, assigning raters, and computing agreement require `assessments:manage`.

## Next steps

<CardGroup cols={2}>
  <Card title="Graduation" icon="graduation-cap" href="/evaluation/graduation">
    Once Gate 2 clears the threshold, turn on auto-scoring — explicitly.
  </Card>

  <Card title="Analytics" icon="chart-line" href="/evaluation/analytics">
    Use the assessor toggle and scoring-mode badges to read results with the right lens.
  </Card>
</CardGroup>
