Preface

Testing has been an essential part of software engineering lifecycle. While "Agents" are still software products, due to being stochastic in nature, they require evaluation from certain other paradigms which is an active area of research with currently no widespread agreed upon standards. In Railtracks we follow the philosophy of continuing to allow flexibility for users to define what "Evaluation of Agents" means to them.

We have set the structural outline below of two potential avenues:

Evaluations that analyze the past results of an agent
Evaluations that require an agent to be invoked

Currently in Railtracks we have focused on providing direct support for the first case and indirect support for the second case. We are actively working on providing APIs for "Agent Experimentation" which is what we believe to be the encompassing term for the second case above.

Evaluation Flow

The diagram below illustrates a typical evaluation workflow:

graph TD
    Developer([Developer]) --> BuildAgent[Agent Build]

    BuildAgent --> Dataset[Dataset]

    subgraph Evaluation ["Evaluation Pipeline"]
        Dataset --> Evaluator[Evaluator]
        Evaluator --> Metric[Metric]
        Metric --> Result[Result]
    end

    Result -->|Iterate & Improve| BuildAgent

    Result --> Deploy[Deployment]
    Deploy --> |User Feedback| BuildAgent

    %% === COLOR THEMING ===
    %% Define color classes based on consistent theme
    classDef userClass fill:#60A5FA,fill-opacity:0.3
    classDef buildClass fill:#FBBF24,fill-opacity:0.3
    classDef evalClass fill:#34D399,fill-opacity:0.3
    classDef resultClass fill:#BFDBFE,fill-opacity:0.3
    classDef pipelineClass fill:#FECACA,fill-opacity:0.3
    classDef deployClass fill:#34D399 ,fill-opacity:0.3

    %% Apply color classes
    class Developer userClass;
    class BuildAgent buildClass;
    class Dataset,Evaluator,Metric pipelineClass;
    class Result resultClass;
    class Deploy deployClass;

    %% Subgraph style
    style Evaluation fill:transparent,stroke:#FFFFFF,stroke-width:1px