LLMInferenceEvaluator

The LLMInferenceEvaluator analyzes the LLM calls recorded in past agent runs and reports token usage, cost, and latency statistics. It requires no configuration, all metrics are derived from the session data automatically.

Usage

from railtracks import evaluations as evals

data = evals.extract_agent_data_points(".railtracks/data/sessions/")

evaluator = evals.LLMInferenceEvaluator()
results = evals.evaluate(data=data, evaluators=[evaluator])

Metrics Tracked

The following metrics are collected per LLM call, broken down by model name, model provider, and call index:

Metric	Description
`InputTokens`	Number of prompt tokens sent to the model.
`OutputTokens`	Number of tokens in the model's response.
`TokenCost`	Total cost of the call in USD.
`Latency`	Wall-clock time for the call in seconds.

Aggregated (mean) values are calculated across runs for each (model_name, model_provider, call_index) group, making it straightforward to compare cost and speed across agent versions or model providers.