Test Custom Metrics

Before you rely on a custom metric to measure your agentic system’s performance, test it to confirm it scores the way you intend. Galileo gives you three ways to test, from a quick sanity check to a quantitative measure of how closely the metric matches your ground truth.

Choose how to test

Method	Best for
Manual input	A quick check of a single input/output pair as you draft the prompt.
Current Logs	Seeing how the metric behaves on your most recent real sessions, traces, or spans.
Datasets	Measuring how closely the metric matches known-correct answers, with a macro F1 score or RMSE (root mean square error).

You’ll find all three on the Test Metric tab when creating or editing a metric.

Test with manual input

Provide an input and output and select Test. You’ll see the metric result, plus an explanation if step-by-step reasoning is turned on.

The manual input tab showing an input, output, a Test button, and a false result with an explanation below

Due to the complex structure of sessions and traces, Galileo supports manual input only for metrics that apply to spans.

Test against current logs

Select the project, choose the source type, then pick a Log Stream or Experiment. Select Test Metric to run the metric against your last 5 logged sessions, traces, or spans. Select any row to see the explanation.

The current logs tab showing 4 recent spans with input, output, metric score, and explanation columns

This is the fastest way to see how a metric behaves on real, representative data.

Test against a labeled dataset

Manual and log-based tests confirm a metric runs correctly. To know how much you can trust it, test it against a labeled dataset. Add a Metric Ground Truth column with the correct result for each row, and Galileo reports a single score for how well the metric’s output matches it: a macro F1 score for label-based metrics, or RMSE (root mean square error) for number-based metrics.

The datasets tab showing a loaded dataset with input, generated output, metric value, and Metric Ground Truth columns, alongside a Macro F1 score panel showing aligned and misaligned row counts

Dataset testing is available for metrics that apply to a trace or LLM span, where you can provide Metric Ground Truth per row. Galileo does not support dataset testing for session-level metrics.

Why test against a dataset

Testing against a labeled dataset does more than confirm a metric runs. It gives you a concrete score you can track and compare over time. The workflows below use the macro F1 score for label-based metrics; for number-based metrics, the same applies with RMSE.

Confirm scoring intent

Test before you deploy. A high macro F1 score against a representative set of labeled examples tells you the metric will score consistently on similar data in production.

Set a baseline

Record the macro F1 score when you first create a metric. Run the same dataset test again after any change (prompt, model, or number of judges) to see whether it helped or hurt.

Measure real improvements

Running the same dataset test again after iterating on a prompt confirms the improvement is real, not just a shift on a handful of examples. With Autotune, test on held-out examples to confirm the F1 went up without overfitting.

Catch drift early

Run the same dataset test again after a model upgrade or API change. If the macro F1 score drops, you know the metric needs attention before it affects your evaluations.

Run a dataset test

Add a dataset

On the Test Metric tab, select Datasets, then choose an existing dataset or upload one with your inputs and outputs.

Add Metric Ground Truth

For each row, provide the expected result in the Metric Ground Truth column. This is the value the metric should return when it scores correctly. A single dataset can hold ground truth for several metrics at once, stored separately and keyed by metric name, so you can reuse one dataset across your metric library.

Run the test

Run the metric across the dataset. Galileo scores every row and compares the output against your Metric Ground Truth.

Compute the score

Once the metric has run and Metric Ground Truth is in place for each row, select Compute Score. Galileo compares the metric’s output against the ground truth and reports a single aggregate score: a macro F1 score (0–1, higher is better) for boolean, categorical, or multi-label metrics, or RMSE (lower is better) for count, discrete, or percentage metrics. Use this score to decide whether the metric is ready to deploy or needs further iteration.

How scores are calculated

Galileo compares the metric’s output against the Metric Ground Truth for every labeled row, then reports a single score that fits the metric’s output type.

Label-based metrics: macro F1

For metrics that return a label (boolean, categorical, or multi-label), Galileo reports a macro F1 score: a single number from 0 to 1 for how well the metric’s labels match your Metric Ground Truth. F1 combines two things:

Precision — how many of the rows the metric flagged were actually correct. False positives bring this down.
Recall — how many of the rows that should have been flagged the metric actually caught. False negatives bring this down.

F1 is the harmonic mean of the two, so a metric only scores well when it keeps both false positives and false negatives low. The macro F1 averages this across every label, so each label counts equally and a metric does not score well just by getting the most common case right. Iterate on the prompt, model, or number of judges and test again until the macro F1 is high enough for your use case.

For a deeper dive, see F1 score in machine learning and micro, macro, and weighted F1 averages explained.

Number-based metrics: RMSE

For metrics that return a number (count, discrete, or percentage), a macro F1 score does not apply. Galileo reports RMSE (root mean square error): how far the metric’s values are from your Metric Ground Truth on average. Lower is better, and 0 means an exact match.

For a deeper dive, see MAE, MSE, and RMSE simplified and comparing the robustness of MAE, MSE, and RMSE.

Score by output type

Output type	Score	Reading it
Boolean, Categorical, Multi-label	Macro F1	0–1, higher is better
Count, Discrete, Percentage	RMSE	Lower is better (0 = exact match)

Improve metrics with Autotune

Turn feedback into prompt improvements, then test again to confirm the gain.

Custom LLM-as-a-Judge Metrics

Create the metrics you’ll test and validate.

Custom Code-Based Metrics

Create and test code-based metrics the same way.

Ground Truth Adherence

The output-level analog of the Metric Ground Truth column.

Overview

Get Started

Observability

Evaluation Metrics

AI Assistant

Luna Studio

Experiments

Agent Control

Annotations

Integrations

Security

References

Test Custom Metrics

Choose how to test

Test with manual input

Test against current logs

Test against a labeled dataset

Why test against a dataset

Confirm scoring intent

Set a baseline

Measure real improvements

Catch drift early

Run a dataset test

How scores are calculated

Label-based metrics: macro F1

Number-based metrics: RMSE

Score by output type

Improve metrics with Autotune

Custom LLM-as-a-Judge Metrics

Custom Code-Based Metrics

Ground Truth Adherence

​Choose how to test

​Test with manual input

​Test against current logs

​Test against a labeled dataset

​Why test against a dataset

Confirm scoring intent

Set a baseline

Measure real improvements

Catch drift early

​Run a dataset test

​How scores are calculated

​Label-based metrics: macro F1

​Number-based metrics: RMSE

​Score by output type

​Related resources

Improve metrics with Autotune

Custom LLM-as-a-Judge Metrics

Custom Code-Based Metrics

Ground Truth Adherence

Choose how to test

Test with manual input

Test against current logs

Test against a labeled dataset

Why test against a dataset

Run a dataset test

How scores are calculated

Label-based metrics: macro F1

Number-based metrics: RMSE

Score by output type

Related resources