JudgeEvaluator

The JudgeEvaluator uses an LLM as a judge to score agent outputs against a set of CategoricalMetrics. For each data point and each metric, it sends the agent's input/output to the judge LLM and records its verdict.

Note

JudgeEvaluator only accepts Categorical metrics. Passing a Numerical metric will log a warning and skip it.

Usage

from railtracks import evaluations as evals
import railtracks as rt

relevance = evals.metrics.Categorical(
    name="Relevance",
    categories=["Relevant", "Irrelevant"],
)

sentiment = evals.metrics.Categorical(
    name="Sentiment",
    categories=["Positive", "Negative", "Neutral"],
)

judge = evals.JudgeEvaluator(
    llm=rt.llm.OpenAILLM(model_name="gpt-4o"),
    metrics=[relevance, sentiment],
    reasoning=True,  # include the judge's reasoning in results
)

Parameters

Parameter	Description
`llm`	The LLM used as the judge.
`metrics`	Metrics to evaluate.
`system_prompt`	Override the default judge system prompt.
`timeout`	Timeout (seconds) for the judge flow.
`reasoning`	If `True`, the judge LLM also returns reasoning per result.
`verbose`	Log progress per data point.

Custom System Prompt

By default, JudgeEvaluator uses a built-in prompt that instructs the LLM to score agent quality. You can override it to focus on your domain:

judge = eval.JudgeEvaluator(
    llm=rt.llm.OpenAILLM(model_name="gpt-4o"),
    metrics=[relevance],
    system_prompt="You are a financial analyst. Evaluate whether the agent's response is accurate and compliant with regulations.",
)

Results

For each metric the evaluator produces one MetricResult per data point, plus an aggregate breakdown across categories. When reasoning=True, a corresponding {metric_name}_reasoning result is also stored alongside each verdict.