Running an evaluation

Once you’ve built an evaluation, running it is how you actually get scores. An evaluation is a reusable bundle of metrics plus the judge configuration and a free-text instructions field. When you run it, Anyreach produces verdicts for every criterion in those metrics, against the conversations you choose. There are two ways to get verdicts, and you’ll use both:

The AI judge scores conversations automatically and at scale. This is the everyday path — point it at one conversation or thousands and let it work.
Human review lets a person grade a conversation themselves, to confirm or correct what the judge said. Those human verdicts are the ground truth that calibration later compares the judge against.

Everything here lives in the dashboard under the Analytics section. A quick reminder of the vocabulary, in case you’re new: a metric is a thing you measure (like “Task Resolution”); a criterion is one yes/no check inside it; an outcome is the verdict for that check (true, false, abstain, or na); and a run is one assessment of one conversation by one assessor. If any of that is fuzzy, read Core concepts first.

Triggering runs and grading conversations both require the assessments:manage scope. With only assessments:read you can view results but not produce them.

Part A — Run the AI judge

The AI judge is a top-tier LLM that reads a conversation and answers each criterion’s yes/no question for you. You start it from the Trigger Run dialog.

Open the Trigger Run dialog

From the Analytics section, open Trigger Run. This is where you tell Anyreach what to score and which conversations to score it against.

Pick an evaluation

Choose the evaluation you want to run. The evaluation decides which metrics and criteria get scored, and it carries the instructions the judge will read as context (more on that below).

Choose a source

Pick what the judge should read. You have three options:

Source	Use it when	Result
Conversations	You want to score real interactions from your inbox	One run per selected conversation
Transcript	You want to test against text you paste in by hand	A single run over that text
Audio URL	You want to score a recording by link	A single run over that audio

For Conversations, you can select one or many. Selecting more than one is a batch: each conversation becomes its own independent run, and the runs are processed in parallel rather than as one combined score. That keeps every conversation’s verdicts separate and comparable.

Trigger the run

Confirm, and the runs are queued. The judge works asynchronously, so the dialog doesn’t block — verdicts fill in as each run finishes.

What the judge does for each criterion

For every criterion in the evaluation’s metrics, the judge reads three things — the criterion question, any na rule attached to it, and the evaluation’s instructions — and then returns:

an outcome (true, false, abstain, or na),
a verbatim quote from the conversation that supports the verdict,
a confidence, and
short reasoning.

The judge anchors on the question as the rule it’s deciding, and treats the instructions as supporting context — so the question carries the hard rule, while the instructions tell the judge what “correct” or “in scope” looks like (for example, a pasted copy of the agent’s knowledge base).

For an audio-modality metric, if the conversation has no recording the judge can’t listen to anything, so the verdict auto-resolves to na (not applicable) rather than guessing.

Run statuses

Each run moves through a status you can watch in the Runs tab:

Status	Meaning
`pending`	Queued or in progress — the judge hasn’t finished yet.
`completed`	The run finished and produced verdicts.
`failed`	The run couldn’t complete (for example, a processing error).

Batches and partial failures

When you run a batch and some conversations fail while others succeed, Anyreach doesn’t make you start over. The failed conversations stay selected in the Trigger Run dialog, so you can fix the cause and re-trigger just those, while the ones that already completed keep their results.

Batches are the fast way to score a whole slice of history — filter your inbox down to the conversations you care about, then select them all in Trigger Run.

Part B — Review or correct by hand

The AI judge is fast, but you don’t have to take its word for it. A human run is the same kind of run, produced by a person instead of the model. You use human runs to spot-check the judge, to score metrics the judge isn’t trusted on yet, and to build the ground-truth verdicts that calibration measures the judge against.

Grade-by-exception

Grading a conversation yourself is designed to be quick. It’s grade-by-exception: every criterion starts on the compliant answer — the one that means “no problem here” — and you only change the criteria where something actually went wrong. You’re flagging exceptions, not filling out every field from scratch.

Default state:  every criterion → compliant (assume good)
Your job:       flip only the criteria where the agent slipped

Remember that “compliant” depends on the criterion’s expected_value: for a criterion like “Did the agent give wrong information?” the expected (good) answer is false, so leaving it on the compliant default means you’re asserting the agent did not give wrong information.

Confirming or correcting an AI run

A human run can stand on its own, but it can also confirm or correct an existing AI run on the same conversation. When you agree with the judge, your human verdicts line up with its verdicts; when you disagree, they diverge. That agreement (and disagreement) is exactly what calibration measures later — it’s how Anyreach decides whether the judge is trustworthy enough to score a metric automatically. The more honest human verdicts you record, the more reliable that measurement becomes.

Both AI and human runs can exist for the same conversation and the same metric at once. On the Analytics overview, the assessor toggle (“AI + Human”, “AI judge”, “Human”) controls whose verdicts you’re looking at.

Next steps

Read your results

See compliant rates per metric, filter by assessor and date, and drill into the exceptions.

Calibrate the judge

Compare AI and human runs with Cohen’s κ to decide whether the judge can be trusted.

​Part A — Run the AI judge

​What the judge does for each criterion

​Run statuses

​Batches and partial failures

​Part B — Review or correct by hand

​Grade-by-exception

​Confirming or correcting an AI run

​Next steps

Read your results

Calibrate the judge

Part A — Run the AI judge

What the judge does for each criterion

Run statuses

Batches and partial failures

Part B — Review or correct by hand

Grade-by-exception

Confirming or correcting an AI run

Next steps