Graduation & auto-scoring

A metric measures one quality of a conversation (for example Task Resolution) by grouping one or more yes/no criteria. You can have those criteria scored two ways: by the AI judge (a top-tier LLM that reads the conversation and answers each question) or by a human reviewer grading by hand. The AI judge is fast and cheap, but you shouldn’t trust it to score a metric automatically until you’ve proven it agrees with your humans on that metric. Graduation is the moment you flip that switch — you tell Anyreach “the AI judge is trustworthy for this metric, let it score automatically.” This page explains the badge that tracks where each metric stands, and the flow for graduating one. You’ll find all of this in the dashboard under the Analytics section.

The scoring-mode badge

Every metric shows a small scoring mode badge (also called scored-by). It has three values:

Badge	Meaning
human_only	The AI judge is not yet trusted to score this metric on its own. It hasn’t passed calibration.
hybrid	The AI judge scores this metric automatically, with humans spot-checking a sample to keep it honest.
auto	The AI judge is fully trusted and scores this metric automatically.

A metric starts at human_only and moves “up” to hybrid or auto only when you graduate it.

The badge is a capability, not a view. It answers “is the AI trusted to auto-score this metric?” — it does not mean you’re only looking at AI verdicts (or only human verdicts) right now.Both AI and human runs can exist for a human_only metric at the same time. What you’re viewing is controlled by the assessor toggle (“AI + Human” / “AI judge” / “Human”) on the Analytics overview, not by this badge. Keep the two ideas separate: the toggle chooses verdicts to look at; the badge describes scoring capability.

Why graduation exists

You could just let the AI judge score everything from day one — but then you’d have no idea whether its verdicts match what a careful human would say. Some metrics are easy for an LLM (“Did the agent greet the caller?”); others are subtle and the judge gets them wrong often enough to mislead your dashboards. Graduation forces a check: prove the judge agrees with humans on a given metric before you rely on it. That proof comes from calibration — running the AI judge and your human reviewers over the same conversations and measuring how often they land on the same verdict. See Calibration & agreement for the full picture.

The graduation flow

Collect AI and human verdicts on the same conversations

Run the AI judge over a set of conversations, and have a reviewer grade those same conversations by hand. You now have two opinions per conversation — the judge’s and a human’s — which is what agreement is measured from. (Human grading is “grade-by-exception”: each criterion defaults to the compliant answer and you only flag the exceptions.)

Compute agreement

Measure how well the AI and human verdicts line up. Two ways:

Quick: open the AI vs Human tab on an evaluation. It computes agreement on the fly over the AI and human runs that already exist — no setup.
Rigorous: build a calibration set — a frozen set of conversations graded by assigned raters as fresh, isolated runs — and compute agreement there.

Agreement is reported as Cohen’s κ (kappa), not raw percent, because a skewed metric (almost always the same answer) can hit a high raw percent by luck while κ stays near zero. Higher κ means more real agreement.

Computing agreement on a calibration set refreshes each target metric’s eligibility — its gates, its blockers, and its “what it would take” guidance. It does not graduate anything automatically. Refreshing eligibility and graduating are two separate steps.

Check whether the metric is ELIGIBLE

If the measured gates pass — for example Gate-2 κ (AI↔human agreement) at or above the threshold — the metric shows as ELIGIBLE to graduate. If a gate falls short, the badge popover tells you which one and what it would take to clear it (see below).

Graduate it explicitly

Open the metric’s Status / Graduation panel and click Graduate. This is graduate-on-click: nothing flips automatically. When you graduate, the metric’s scoring mode flips from human_only to auto or hybrid, and from then on the AI judge scores it without waiting for a human.

Graduating doesn’t lock you out of human review. You can still grade conversations by hand on a graduated metric — a human run can confirm or correct an AI run at any time.

Golden labels

When you graduate, Anyreach runs a regression check against golden labels — a held-out set of conversations whose correct verdicts are already known. Think of them as an answer key the AI judge never trains on. If the judge still gets the golden labels right, you have evidence it isn’t just memorizing the conversations you calibrated on. Golden labels guard against a metric “passing” calibration but quietly failing on cases you didn’t look at.

Demotion: losing trust is automatic

Trust can be revoked without a click. If a graduated metric’s gates later fail — say agreement drifts below the threshold as you collect more human grades — the metric is demoted automatically back toward human_only. The asymmetry is deliberate: granting trust requires a human to click Graduate, but losing trust needs no confirmation. It’s always safe to stop auto-scoring a metric the moment it stops agreeing with your humans.

Deterministic metrics certify themselves

Some criteria are deterministic — hard, reproducible rules rather than LLM judgment calls. A metric whose criteria are all deterministic is reproducible by construction: it returns the same verdict every time and there’s no judge to calibrate. These metrics auto-certify on creation — they need no calibration set and no graduation step. Graduation only matters for metrics that contain at least one judged criterion. Click the scoring-mode badge to open its popover. It explains, in plain terms, where the metric stands:

Gates — which measured gates (Gate 1 human↔human, Gate 2 AI↔human, Proxy internal↔customer) apply and whether each passes.
Blockers — what’s currently stopping the metric from graduating (for example, not enough human runs yet, or κ below threshold).
What it would take — the concrete next action to become eligible (for example, “grade 8 more conversations” or “raise Gate-2 κ to 0.6”).
Manual override — where you graduate the metric once it’s eligible, or where its mode is shown if it’s already auto/hybrid.

Use the popover as your checklist: it tells you exactly why a metric isn’t auto-scoring yet and what to do about it.

Permissions

Viewing scoring modes needs the assessments:read scope. Graduating a metric — like creating, editing, or triggering evaluations — needs assessments:manage.

Next steps

Calibration & agreement

How agreement, κ, and the gates that decide eligibility actually work.

Analytics overview

Read compliant rates and switch whose verdicts you’re viewing with the assessor toggle.

​The scoring-mode badge

​Why graduation exists

​The graduation flow

​Golden labels

​Demotion: losing trust is automatic

​Deterministic metrics certify themselves

​Reading the badge popover

​Permissions

​Next steps

Calibration & agreement

Analytics overview

The scoring-mode badge

Why graduation exists

The graduation flow

Golden labels

Demotion: losing trust is automatic

Deterministic metrics certify themselves

Reading the badge popover

Permissions

Next steps