- The AI judge scores conversations automatically and at scale. This is the everyday path — point it at one conversation or thousands and let it work.
- Human review lets a person grade a conversation themselves, to confirm or correct what the judge said. Those human verdicts are the ground truth that calibration later compares the judge against.
true, false, abstain, or na); and a run is one assessment of one conversation by one assessor. If any of that is fuzzy, read Core concepts first.
Triggering runs and grading conversations both require the
assessments:manage scope. With only assessments:read you can view results but not produce them.Part A — Run the AI judge
The AI judge is a top-tier LLM that reads a conversation and answers each criterion’s yes/no question for you. You start it from the Trigger Run dialog.Open the Trigger Run dialog
From the Analytics section, open Trigger Run. This is where you tell Anyreach what to score and which conversations to score it against.
Pick an evaluation
Choose the evaluation you want to run. The evaluation decides which metrics and criteria get scored, and it carries the instructions the judge will read as context (more on that below).
Choose a source
Pick what the judge should read. You have three options:
For Conversations, you can select one or many. Selecting more than one is a batch: each conversation becomes its own independent run, and the runs are processed in parallel rather than as one combined score. That keeps every conversation’s verdicts separate and comparable.
| Source | Use it when | Result |
|---|---|---|
| Conversations | You want to score real interactions from your inbox | One run per selected conversation |
| Transcript | You want to test against text you paste in by hand | A single run over that text |
| Audio URL | You want to score a recording by link | A single run over that audio |
What the judge does for each criterion
For every criterion in the evaluation’s metrics, the judge reads three things — the criterion question, any na rule attached to it, and the evaluation’s instructions — and then returns:- an outcome (
true,false,abstain, orna), - a verbatim quote from the conversation that supports the verdict,
- a confidence, and
- short reasoning.
For an audio-modality metric, if the conversation has no recording the judge can’t listen to anything, so the verdict auto-resolves to
na (not applicable) rather than guessing.Run statuses
Each run moves through a status you can watch in the Runs tab:| Status | Meaning |
|---|---|
pending | Queued or in progress — the judge hasn’t finished yet. |
completed | The run finished and produced verdicts. |
failed | The run couldn’t complete (for example, a processing error). |
Batches and partial failures
When you run a batch and some conversations fail while others succeed, Anyreach doesn’t make you start over. The failed conversations stay selected in the Trigger Run dialog, so you can fix the cause and re-trigger just those, while the ones that already completed keep their results.Part B — Review or correct by hand
The AI judge is fast, but you don’t have to take its word for it. A human run is the same kind of run, produced by a person instead of the model. You use human runs to spot-check the judge, to score metrics the judge isn’t trusted on yet, and to build the ground-truth verdicts that calibration measures the judge against.Grade-by-exception
Grading a conversation yourself is designed to be quick. It’s grade-by-exception: every criterion starts on the compliant answer — the one that means “no problem here” — and you only change the criteria where something actually went wrong. You’re flagging exceptions, not filling out every field from scratch.false, so leaving it on the compliant default means you’re asserting the agent did not give wrong information.
Confirming or correcting an AI run
A human run can stand on its own, but it can also confirm or correct an existing AI run on the same conversation. When you agree with the judge, your human verdicts line up with its verdicts; when you disagree, they diverge. That agreement (and disagreement) is exactly what calibration measures later — it’s how Anyreach decides whether the judge is trustworthy enough to score a metric automatically. The more honest human verdicts you record, the more reliable that measurement becomes.Both AI and human runs can exist for the same conversation and the same metric at once. On the Analytics overview, the assessor toggle (“AI + Human”, “AI judge”, “Human”) controls whose verdicts you’re looking at.
Next steps
Read your results
See compliant rates per metric, filter by assessor and date, and drill into the exceptions.
Calibrate the judge
Compare AI and human runs with Cohen’s κ to decide whether the judge can be trusted.

