Categorical Metrics
Categorical metrics are the recommended metrics to use with Evaluators that are themselves an LLM (or another agent). Research has shown that LLMs inherently struggle with context regarding providing a numerical score for tasks therefore categories (ie "labels") are a more reliable metric.
In Railtracks we mainly use these metrics in JudgeEvaluator
Usage
from railtracks import evaluation as eval
sentiment = eval.metrics.Categorical(
name="Sentiment",
categories=["Positive", "Negative", "Neutral"],
description="Tone of the agent's response.", # optional
)
Pass metrics into a JudgeEvaluator to evaluate agent runs against each category.