Calibration & agreement

You can ask the AI judge to score thousands of conversations in minutes. But before you let it score them unsupervised, you need a reason to trust it. Calibration is how you earn that trust: you check whether the AI judge reaches the same verdicts a careful human reviewer would. Everything here lives under the Analytics section of the dashboard, on the Calibration tab and on each evaluation’s AI vs Human view.

Why raw % agreement lies

The obvious way to check the judge is to count how often it agrees with a human: out of 100 conversations, how many got the same verdict? That number — raw percent agreement — looks reassuring and is almost always wrong. The problem is skew. Most criteria have a lopsided answer. Imagine a criterion “Did the agent stay professional?” where the true answer is “yes” 95% of the time. Now suppose both the human and the AI judge just say “yes” to almost everything:

                 AI says yes   AI says no
Human says yes       85            5
Human says no         5            5

Raw agreement = (85 + 5) / 100 = 90%

90% agreement sounds excellent. But two reviewers who each guessed “yes” most of the time would land near there by luck alone — they barely have to look at the conversation. Raw % can’t tell skill apart from chance, so on skewed criteria (which is most of them) it systematically overstates how good the judge is.

Cohen’s κ: agreement minus luck

To fix this, Anyreach reports Cohen’s κ (kappa) — a standard statistic that subtracts the agreement you’d expect by chance and rescales what’s left. A κ of 0 means “no better than guessing”; a κ of 1 means “perfect agreement.” The same 90%-raw example above can have a κ near 0, which is the honest verdict: the judge hasn’t been shown to add anything yet. Read κ with the Landis–Koch bands:

Cohen’s κ	Interpretation
≥ 0.80	Almost perfect
0.60 – 0.80	Substantial
0.40 – 0.60	Moderate
0.20 – 0.40	Fair
< 0.20	Roughly chance

A metric can show a high % agreement and a κ near 0 at the same time. When they disagree, trust κ — the high % is the chance illusion described above.

Alongside κ, the dashboard shows a few companion measures so you can sanity-check it:

Measure	What it adds
Gwet AC1	An alternative chance-corrected score that stays stable even when one outcome is extremely rare — useful when κ behaves erratically on very skewed criteria.
Krippendorff α	A general agreement coefficient that handles more than two raters and missing verdicts.
Prevalence	The share of verdicts that are “true.” Tells you how skewed the criterion is, which explains why κ and % can diverge.

Two paths to a κ

You can measure agreement two ways. The quick path reuses runs you already have; the rigorous path freezes a controlled set.

Quick: the “AI vs Human” tab

Every evaluation has an AI vs Human tab. It computes a Gate-2 κ (AI-vs-human agreement, defined below) on the fly, over whatever AI and human runs already exist for that evaluation. There’s no setup — if you’ve run the AI judge on some conversations and a reviewer has also graded a few of them, this tab already has data. To keep the comparison fair, it:

takes the most-recent run per conversation for each assessor, so re-runs don’t double-count, and
excludes na verdicts (a criterion that didn’t apply to a conversation can’t agree or disagree).

Use this tab for a fast pulse check while you’re still building a metric.

Rigorous: a Calibration set

When you’re ready to make a trust decision, build a Calibration set from the Calibration tab. A calibration set is a frozen list of conversations plus a list of raters. Freezing matters: everyone scores the exact same conversations, so the agreement number can’t drift as your traffic changes.

Pick the conversations

Choose a fixed set of conversations to grade. Aim for a spread of easy and hard cases so the κ reflects real difficulty.

Assign raters

A rater is anyone (or anything) whose verdicts you want to compare — typically one or more human reviewers plus the AI judge. Each rater grades every conversation in the set.

Let raters grade

Each rater’s grading creates a new, isolated run for every conversation — separate from any other evaluation’s runs. This isolation is the point: nobody is influenced by an existing verdict, so the agreement number is honest.

Compute agreement

Run the agreement computation. The page fills in the κ (and companions) for every gate that has enough raters.

The three gates

Agreement is reported per gate — each gate compares a different pair of rater types, and each answers a different trust question:

Gate	Compares	Question it answers
Gate 1	human ↔ human	Do your own reviewers even agree with each other? (If they don’t, the task is ill-defined — fix the criterion before blaming the AI.)
Gate 2	AI ↔ human	Does the AI judge match your humans? This is the gate that justifies auto-scoring.
Proxy	internal rater ↔ customer rater	Does an internal reviewer stand in for the customer’s own judgment?

With just one human rater and one AI rater, you’ll only see a Gate 2 number. Gate 1 needs two or more human raters to compare humans against each other, and the proxy gate needs both an internal and a customer rater. Those cards stay empty until you add the raters they require — that’s expected, not a bug.

Reading the agreement cards

Both the AI vs Human tab and the Calibration page show agreement cards. They come in two granularities:

By metric — agreement pooled across all the criteria in a metric. This is the headline number for that metric.
By criterion — agreement for a single criterion. Use this to find the one atom that’s dragging a metric’s κ down.

Each card shows the same fields:

On the card	How to read it
κ	The headline chance-corrected agreement. Compare it to the band table above.
AC1 / α	Companion coefficients — a cross-check on κ, especially on skewed criteria.
% agreement	Raw agreement. Informative, but never the deciding number.
prevalence	Share of “true” verdicts — tells you how skewed the criterion is.
threshold bar	A visual marker of where the κ sits relative to the bar it must clear to be considered trustworthy.

If a metric’s pooled κ looks weak, switch to the by criterion cards. Often a single criterion with a confusing question is the culprit — tighten its wording (see Creating metrics & criteria) and re-grade.

What calibration does and doesn’t do

Computing agreement refreshes each target metric’s eligibility — its gates, its blockers, and the “what it would take” guidance you see on the scoring-mode badge. It does not flip any metric to auto-scoring on its own. Turning on auto-scoring is a separate, deliberate step.

Calibration only updates eligibility. A metric will not start scoring conversations automatically until a human explicitly graduates it. See Graduation.

You’ll also need the right access: viewing agreement requires the assessments:read scope, and building calibration sets, assigning raters, and computing agreement require assessments:manage.

Next steps

Graduation

Once Gate 2 clears the threshold, turn on auto-scoring — explicitly.

Analytics

Use the assessor toggle and scoring-mode badges to read results with the right lens.

​Why raw % agreement lies

​Cohen’s κ: agreement minus luck

​Two paths to a κ

​Quick: the “AI vs Human” tab

​Rigorous: a Calibration set

​The three gates

​Reading the agreement cards

​What calibration does and doesn’t do

​Next steps

Graduation

Analytics

Why raw % agreement lies

Cohen’s κ: agreement minus luck

Two paths to a κ

Quick: the “AI vs Human” tab

Rigorous: a Calibration set

The three gates

Reading the agreement cards

What calibration does and doesn’t do

Next steps