Evaluation

How to evaluate your AI workflow in Stack AI

The evaluation section helps you analyze the performance of your LLM over several tests.

How to run evaluations?

To run an evaluation, fill your evaluate platform by specifying the different inputs and parameters of the workflow. You specify the following parameters:

  • Inputs (nodes with an id of in-)

  • URLs (nodes with an id of url-)

  • Images (nodes with an id of img2text-)

All of these parameters are set dynamically for it test.

Add up-to 50 evaluations in parallel to your LLM workflow

Uploading data

You can upload a CSV with evaluations with the values of each parameter and output to be used. For instance, a CSV may have the following structure:

Once your evaluations are done, you can download your results as a separate CSV.

Model grading

You can grade a flow performance by selecting an "Output to evaluate" and specifying a "Grade Criteria" in the input box.

Under the hood, the grading is running a pipeline of multiple LLMs that scores the completions of other LLMs in the pipeline.

Evaluation criteria work better under the following guidelines:

  • Specify clear instructions: state exactly what is the goal of the LLM workflow and how you expect it to respond.

  • Enumerate how to grade: if possible, mention what corresponds to a 10-point score, what corresponds to a 5-point score, and what corresponds to a 1-point score.

  • Use plain English: avoid using technical jargon that wouldn't be immediately understandable by an LLM.

If the field "ground truth" is completed, the model grader also evaluates how closely the LLM completion matches the ground truth.

How to optimize prompts

(Coming soon)

Last updated