Why raw % agreement lies
The obvious way to check the judge is to count how often it agrees with a human: out of 100 conversations, how many got the same verdict? That number — raw percent agreement — looks reassuring and is almost always wrong. The problem is skew. Most criteria have a lopsided answer. Imagine a criterion “Did the agent stay professional?” where the true answer is “yes” 95% of the time. Now suppose both the human and the AI judge just say “yes” to almost everything:Cohen’s κ: agreement minus luck
To fix this, Anyreach reports Cohen’s κ (kappa) — a standard statistic that subtracts the agreement you’d expect by chance and rescales what’s left. A κ of 0 means “no better than guessing”; a κ of 1 means “perfect agreement.” The same 90%-raw example above can have a κ near 0, which is the honest verdict: the judge hasn’t been shown to add anything yet. Read κ with the Landis–Koch bands:| Cohen’s κ | Interpretation |
|---|---|
| ≥ 0.80 | Almost perfect |
| 0.60 – 0.80 | Substantial |
| 0.40 – 0.60 | Moderate |
| 0.20 – 0.40 | Fair |
| < 0.20 | Roughly chance |
A metric can show a high % agreement and a κ near 0 at the same time. When they disagree, trust κ — the high % is the chance illusion described above.
| Measure | What it adds |
|---|---|
| Gwet AC1 | An alternative chance-corrected score that stays stable even when one outcome is extremely rare — useful when κ behaves erratically on very skewed criteria. |
| Krippendorff α | A general agreement coefficient that handles more than two raters and missing verdicts. |
| Prevalence | The share of verdicts that are “true.” Tells you how skewed the criterion is, which explains why κ and % can diverge. |
Two paths to a κ
You can measure agreement two ways. The quick path reuses runs you already have; the rigorous path freezes a controlled set.Quick: the “AI vs Human” tab
Every evaluation has an AI vs Human tab. It computes a Gate-2 κ (AI-vs-human agreement, defined below) on the fly, over whatever AI and human runs already exist for that evaluation. There’s no setup — if you’ve run the AI judge on some conversations and a reviewer has also graded a few of them, this tab already has data. To keep the comparison fair, it:- takes the most-recent run per conversation for each assessor, so re-runs don’t double-count, and
- excludes
naverdicts (a criterion that didn’t apply to a conversation can’t agree or disagree).
Rigorous: a Calibration set
When you’re ready to make a trust decision, build a Calibration set from the Calibration tab. A calibration set is a frozen list of conversations plus a list of raters. Freezing matters: everyone scores the exact same conversations, so the agreement number can’t drift as your traffic changes.Pick the conversations
Choose a fixed set of conversations to grade. Aim for a spread of easy and hard cases so the κ reflects real difficulty.
Assign raters
A rater is anyone (or anything) whose verdicts you want to compare — typically one or more human reviewers plus the AI judge. Each rater grades every conversation in the set.
Let raters grade
Each rater’s grading creates a new, isolated run for every conversation — separate from any other evaluation’s runs. This isolation is the point: nobody is influenced by an existing verdict, so the agreement number is honest.
The three gates
Agreement is reported per gate — each gate compares a different pair of rater types, and each answers a different trust question:| Gate | Compares | Question it answers |
|---|---|---|
| Gate 1 | human ↔ human | Do your own reviewers even agree with each other? (If they don’t, the task is ill-defined — fix the criterion before blaming the AI.) |
| Gate 2 | AI ↔ human | Does the AI judge match your humans? This is the gate that justifies auto-scoring. |
| Proxy | internal rater ↔ customer rater | Does an internal reviewer stand in for the customer’s own judgment? |
With just one human rater and one AI rater, you’ll only see a Gate 2 number. Gate 1 needs two or more human raters to compare humans against each other, and the proxy gate needs both an internal and a customer rater. Those cards stay empty until you add the raters they require — that’s expected, not a bug.
Reading the agreement cards
Both the AI vs Human tab and the Calibration page show agreement cards. They come in two granularities:- By metric — agreement pooled across all the criteria in a metric. This is the headline number for that metric.
- By criterion — agreement for a single criterion. Use this to find the one atom that’s dragging a metric’s κ down.
| On the card | How to read it |
|---|---|
| κ | The headline chance-corrected agreement. Compare it to the band table above. |
| AC1 / α | Companion coefficients — a cross-check on κ, especially on skewed criteria. |
| % agreement | Raw agreement. Informative, but never the deciding number. |
| prevalence | Share of “true” verdicts — tells you how skewed the criterion is. |
| threshold bar | A visual marker of where the κ sits relative to the bar it must clear to be considered trustworthy. |
What calibration does and doesn’t do
Computing agreement refreshes each target metric’s eligibility — its gates, its blockers, and the “what it would take” guidance you see on the scoring-mode badge. It does not flip any metric to auto-scoring on its own. Turning on auto-scoring is a separate, deliberate step. You’ll also need the right access: viewing agreement requires theassessments:read scope, and building calibration sets, assigning raters, and computing agreement require assessments:manage.
Next steps
Graduation
Once Gate 2 clears the threshold, turn on auto-scoring — explicitly.
Analytics
Use the assessor toggle and scoring-mode badges to read results with the right lens.

