Galileo allows you to compare multiple evaluation runs side-by-side. This lets you view how different configurations of your system (i.e. different params, prompt templates, retriever strategies, etc.) handled the same set of queries, enabling you to quickly evaluate, analyze, and annotate your experiments. Galileo allows you to do this for both single-step workflows, or multi-step / chain workflows.

How do I get started?

To enter the Compare Runs mode, select the runs you want to compare from your and click “Compare Runs” on the Action Bar.

For two runs to be comparable, the same evaluation dataset must be used to create them.
Once you’re in Compare Runs you can:

  • Compare how your different configurations responded to the same input.

  • Compare Metrics

  • Expand to see the full Trace of the multi-step workflow and identify which steps went wrong

  • Review and add Human Feedback

  • Toggle back and forth between inputs on your eval set.