Skip to content

LLMInferenceEvaluator

The LLMInferenceEvaluator analyzes the LLM calls recorded in past agent runs and reports token usage, cost, and latency statistics. It requires no configuration, all metrics are derived from the session data automatically.

Usage

from railtracks import evaluations as evals

data = evals.extract_agent_data_points(".railtracks/data/sessions/")

evaluator = evals.LLMInferenceEvaluator()
results = evals.evaluate(data=data, evaluators=[evaluator])

Metrics Tracked

The following metrics are collected per LLM call, broken down by model name, model provider, and call index:

Metric Description
InputTokens Number of prompt tokens sent to the model.
OutputTokens Number of tokens in the model's response.
TokenCost Total cost of the call in USD.
Latency Wall-clock time for the call in seconds.

Aggregated (mean) values are calculated across runs for each (model_name, model_provider, call_index) group, making it straightforward to compare cost and speed across agent versions or model providers.