Logging and Comparing against your Expected Answers
Expected outputs are a key element for evaluating LLM applications. They provide benchmarks to measure model accuracy, identify errors, and ensure consistent assessments.
By comparing model responses to these predefined targets, you can pinpoint areas of improvement and track performance changes over time.
Including expected outputs in your evaluation process also aids in benchmarking your application, ensuring fair and replicable evaluations.
Logging Expected Output
There are a few ways to create runs, and each way has a slightly different way of logging your Expected Output:
PQ.run() or Playground UI
If you’re using pq.run()
or creating runs through the Playground UI, simply include your expected answers in a column called target
in your evaluation set.
Python Logger
If you’re logging your runs via EvaluateRun,
you can set the expected output using the ground_truth
parameter in the workflow creation methods.
To log your runs with Galileo, you’d start with the same typical flow of logging into Galileo:
Next you can construct your EvaluateRun object:
Now you can integrate this logging into your existing application and include the expected output in your evaluation set.
Langchain Callback
If you’re using a Langchain Callback, add your expected output by calling add_targets
on your callback handler.
REST Endpoint
If you’re logging Evaluation runs via the REST endpoint, set the target field in the root node of each workflow.
Comparing Output and Expected Output
When Expected Output gets logged, it’ll appear next to your Output wherever your output is shown.
Metrics
When you add a ground truth, BLEU and ROUGE-1 will automatically be computed and appear on the UI. BLEU and ROUGE measure syntactical equivalence (i.e. word-by-word similarity) between your Ground Truth and actual responses.
Additionally, Ground Truth Adherence can be added as a metric to measure the semantic equivalence between your Ground Truth and actual responses.