Evaluations

Evaluations in railtracks are a useful tool to analyze, aggregate, and finally visualize agent runs invoked previously. Sessions are automatically stored in .railtracks/data/sessions, so evaluations can be run at any time after invoking your agent.

Evaluation Definition

import railtracks as rt
from railtracks import evaluations as evals

# load the data
data = evals.extract_agent_data_points(".railtracks/data/sessions/")

# Default Evaluators
t_evaluator = evals.ToolUseEvaluator()
llm_evaluator = evals.LLMInferenceEvaluator()

# Configurable Evaluators
judge_evaluator = evals.JudgeEvaluator(
    llm=rt.llm.OpenAILLM(model_name="gpt-5.2"),
    metrics=[
        evals.metrics.Categorical(
            name="Helpfulness",
            description=(
                "How helpful was the agent's response in addressing "
                "the user's query or completing the task? Consider "
                "factors such as relevance, accuracy, and completeness."
            ),
            categories=["Not Helpful", "Somewhat Helpful", "Very Helpful"],
        ),
        evals.metrics.Categorical(
            name="Efficiency",
            description=(
                "How efficiently did the agent complete the task? "
                "Consider factors such as speed, resource usage, "
                "and overall effectiveness."
            ),
            categories=["Not Efficient", "Somewhat Efficient", "Very Efficient"],
        ),
    ],
    reasoning=True,
)

results = evals.evaluate(
    data=data,
    evaluators=[t_evaluator, llm_evaluator, judge_evaluator],
)

As long as you have previously run an agent using railtracks, the script above will then prompt you with:

Multiple agents found in the data:
  0: WebsearchAgent -> 5 data points
  1: FinanceAgent -> 5 data points

Select agent index(es) (comma-separated), or -1 to evaluate all:

Upon selection, the results of the evaluation are automatically saved to your .railtracks/data/evaluations folder. You can subsequently use the railtracks viz command to look and analyze the results.