Skip to main content
You can ask the AI judge to score thousands of conversations in minutes. But before you let it score them unsupervised, you need a reason to trust it. Calibration is how you earn that trust: you check whether the AI judge reaches the same verdicts a careful human reviewer would. Everything here lives under the Analytics section of the dashboard, on the Calibration tab and on each evaluation’s AI vs Human view.

Why raw % agreement lies

The obvious way to check the judge is to count how often it agrees with a human: out of 100 conversations, how many got the same verdict? That number — raw percent agreement — looks reassuring and is almost always wrong. The problem is skew. Most criteria have a lopsided answer. Imagine a criterion “Did the agent stay professional?” where the true answer is “yes” 95% of the time. Now suppose both the human and the AI judge just say “yes” to almost everything:
                 AI says yes   AI says no
Human says yes       85            5
Human says no         5            5

Raw agreement = (85 + 5) / 100 = 90%
90% agreement sounds excellent. But two reviewers who each guessed “yes” most of the time would land near there by luck alone — they barely have to look at the conversation. Raw % can’t tell skill apart from chance, so on skewed criteria (which is most of them) it systematically overstates how good the judge is.

Cohen’s κ: agreement minus luck

To fix this, Anyreach reports Cohen’s κ (kappa) — a standard statistic that subtracts the agreement you’d expect by chance and rescales what’s left. A κ of 0 means “no better than guessing”; a κ of 1 means “perfect agreement.” The same 90%-raw example above can have a κ near 0, which is the honest verdict: the judge hasn’t been shown to add anything yet. Read κ with the Landis–Koch bands:
Cohen’s κInterpretation
≥ 0.80Almost perfect
0.60 – 0.80Substantial
0.40 – 0.60Moderate
0.20 – 0.40Fair
< 0.20Roughly chance
A metric can show a high % agreement and a κ near 0 at the same time. When they disagree, trust κ — the high % is the chance illusion described above.
Alongside κ, the dashboard shows a few companion measures so you can sanity-check it:
MeasureWhat it adds
Gwet AC1An alternative chance-corrected score that stays stable even when one outcome is extremely rare — useful when κ behaves erratically on very skewed criteria.
Krippendorff αA general agreement coefficient that handles more than two raters and missing verdicts.
PrevalenceThe share of verdicts that are “true.” Tells you how skewed the criterion is, which explains why κ and % can diverge.

Two paths to a κ

You can measure agreement two ways. The quick path reuses runs you already have; the rigorous path freezes a controlled set.

Quick: the “AI vs Human” tab

Every evaluation has an AI vs Human tab. It computes a Gate-2 κ (AI-vs-human agreement, defined below) on the fly, over whatever AI and human runs already exist for that evaluation. There’s no setup — if you’ve run the AI judge on some conversations and a reviewer has also graded a few of them, this tab already has data. To keep the comparison fair, it:
  • takes the most-recent run per conversation for each assessor, so re-runs don’t double-count, and
  • excludes na verdicts (a criterion that didn’t apply to a conversation can’t agree or disagree).
Use this tab for a fast pulse check while you’re still building a metric.

Rigorous: a Calibration set

When you’re ready to make a trust decision, build a Calibration set from the Calibration tab. A calibration set is a frozen list of conversations plus a list of raters. Freezing matters: everyone scores the exact same conversations, so the agreement number can’t drift as your traffic changes.
1

Pick the conversations

Choose a fixed set of conversations to grade. Aim for a spread of easy and hard cases so the κ reflects real difficulty.
2

Assign raters

A rater is anyone (or anything) whose verdicts you want to compare — typically one or more human reviewers plus the AI judge. Each rater grades every conversation in the set.
3

Let raters grade

Each rater’s grading creates a new, isolated run for every conversation — separate from any other evaluation’s runs. This isolation is the point: nobody is influenced by an existing verdict, so the agreement number is honest.
4

Compute agreement

Run the agreement computation. The page fills in the κ (and companions) for every gate that has enough raters.

The three gates

Agreement is reported per gate — each gate compares a different pair of rater types, and each answers a different trust question:
GateComparesQuestion it answers
Gate 1human ↔ humanDo your own reviewers even agree with each other? (If they don’t, the task is ill-defined — fix the criterion before blaming the AI.)
Gate 2AI ↔ humanDoes the AI judge match your humans? This is the gate that justifies auto-scoring.
Proxyinternal rater ↔ customer raterDoes an internal reviewer stand in for the customer’s own judgment?
With just one human rater and one AI rater, you’ll only see a Gate 2 number. Gate 1 needs two or more human raters to compare humans against each other, and the proxy gate needs both an internal and a customer rater. Those cards stay empty until you add the raters they require — that’s expected, not a bug.

Reading the agreement cards

Both the AI vs Human tab and the Calibration page show agreement cards. They come in two granularities:
  • By metric — agreement pooled across all the criteria in a metric. This is the headline number for that metric.
  • By criterion — agreement for a single criterion. Use this to find the one atom that’s dragging a metric’s κ down.
Each card shows the same fields:
On the cardHow to read it
κThe headline chance-corrected agreement. Compare it to the band table above.
AC1 / αCompanion coefficients — a cross-check on κ, especially on skewed criteria.
% agreementRaw agreement. Informative, but never the deciding number.
prevalenceShare of “true” verdicts — tells you how skewed the criterion is.
threshold barA visual marker of where the κ sits relative to the bar it must clear to be considered trustworthy.
If a metric’s pooled κ looks weak, switch to the by criterion cards. Often a single criterion with a confusing question is the culprit — tighten its wording (see Creating metrics & criteria) and re-grade.

What calibration does and doesn’t do

Computing agreement refreshes each target metric’s eligibility — its gates, its blockers, and the “what it would take” guidance you see on the scoring-mode badge. It does not flip any metric to auto-scoring on its own. Turning on auto-scoring is a separate, deliberate step.
Calibration only updates eligibility. A metric will not start scoring conversations automatically until a human explicitly graduates it. See Graduation.
You’ll also need the right access: viewing agreement requires the assessments:read scope, and building calibration sets, assigning raters, and computing agreement require assessments:manage.

Next steps

Graduation

Once Gate 2 clears the threshold, turn on auto-scoring — explicitly.

Analytics

Use the assessor toggle and scoring-mode badges to read results with the right lens.