
For two runs to be comparable, the same evaluation dataset must be used to create them.
- Compare how your different configurations responded to the same input.
- Compare Metrics
- Expand to see the full Trace of the multi-step workflow and identify which steps went wrong
- Review and add Human Feedback
- Toggle back and forth between inputs on your eval set.