The scoring-mode badge
Every metric shows a small scoring mode badge (also called scored-by). It has three values:| Badge | Meaning |
|---|---|
| human_only | The AI judge is not yet trusted to score this metric on its own. It hasn’t passed calibration. |
| hybrid | The AI judge scores this metric automatically, with humans spot-checking a sample to keep it honest. |
| auto | The AI judge is fully trusted and scores this metric automatically. |
Why graduation exists
You could just let the AI judge score everything from day one — but then you’d have no idea whether its verdicts match what a careful human would say. Some metrics are easy for an LLM (“Did the agent greet the caller?”); others are subtle and the judge gets them wrong often enough to mislead your dashboards. Graduation forces a check: prove the judge agrees with humans on a given metric before you rely on it. That proof comes from calibration — running the AI judge and your human reviewers over the same conversations and measuring how often they land on the same verdict. See Calibration & agreement for the full picture.The graduation flow
Collect AI and human verdicts on the same conversations
Run the AI judge over a set of conversations, and have a reviewer grade those same conversations by hand. You now have two opinions per conversation — the judge’s and a human’s — which is what agreement is measured from. (Human grading is “grade-by-exception”: each criterion defaults to the compliant answer and you only flag the exceptions.)
Compute agreement
Measure how well the AI and human verdicts line up. Two ways:
- Quick: open the AI vs Human tab on an evaluation. It computes agreement on the fly over the AI and human runs that already exist — no setup.
- Rigorous: build a calibration set — a frozen set of conversations graded by assigned raters as fresh, isolated runs — and compute agreement there.
Computing agreement on a calibration set refreshes each target metric’s eligibility — its gates, its blockers, and its “what it would take” guidance. It does not graduate anything automatically. Refreshing eligibility and graduating are two separate steps.
Check whether the metric is ELIGIBLE
If the measured gates pass — for example Gate-2 κ (AI↔human agreement) at or above the threshold — the metric shows as ELIGIBLE to graduate. If a gate falls short, the badge popover tells you which one and what it would take to clear it (see below).
Golden labels
When you graduate, Anyreach runs a regression check against golden labels — a held-out set of conversations whose correct verdicts are already known. Think of them as an answer key the AI judge never trains on. If the judge still gets the golden labels right, you have evidence it isn’t just memorizing the conversations you calibrated on. Golden labels guard against a metric “passing” calibration but quietly failing on cases you didn’t look at.Demotion: losing trust is automatic
Trust can be revoked without a click. If a graduated metric’s gates later fail — say agreement drifts below the threshold as you collect more human grades — the metric is demoted automatically back toward human_only. The asymmetry is deliberate: granting trust requires a human to click Graduate, but losing trust needs no confirmation. It’s always safe to stop auto-scoring a metric the moment it stops agreeing with your humans.Deterministic metrics certify themselves
Some criteria are deterministic — hard, reproducible rules rather than LLM judgment calls. A metric whose criteria are all deterministic is reproducible by construction: it returns the same verdict every time and there’s no judge to calibrate. These metrics auto-certify on creation — they need no calibration set and no graduation step. Graduation only matters for metrics that contain at least one judged criterion.Reading the badge popover
Click the scoring-mode badge to open its popover. It explains, in plain terms, where the metric stands:- Gates — which measured gates (Gate 1 human↔human, Gate 2 AI↔human, Proxy internal↔customer) apply and whether each passes.
- Blockers — what’s currently stopping the metric from graduating (for example, not enough human runs yet, or κ below threshold).
- What it would take — the concrete next action to become eligible (for example, “grade 8 more conversations” or “raise Gate-2 κ to 0.6”).
- Manual override — where you graduate the metric once it’s eligible, or where its mode is shown if it’s already auto/hybrid.
Permissions
Viewing scoring modes needs the assessments:read scope. Graduating a metric — like creating, editing, or triggering evaluations — needs assessments:manage.Next steps
Calibration & agreement
How agreement, κ, and the gates that decide eligibility actually work.
Analytics overview
Read compliant rates and switch whose verdicts you’re viewing with the assessor toggle.

