The Metric select is searchable. It includes trainable Galileo LLM-as-judge metrics available to your workspace, plus custom metrics and prompts your org created.The picker is organized into three groups:
Galileo presets — built-in Galileo scorers that Luna Studio can train.
Custom Galileo metrics — custom metrics already created in Galileo.
Saved custom prompts — prompts previously authored in Luna Studio.
Metrics that exist but are not trainable yet can appear disabled with a “not trainable yet” suffix. Multimodal metrics are filtered out because Luna Studio trains text metrics today.
Yes/no questions (“Is this toxic?”, “Does the answer cite a source?”).
Categorical
Picking one of a fixed list (e.g. positive / neutral / negative).
Other Galileo output types are not trainable in Luna Studio yet. The output type also constrains what label values your test set can use during validation. See Test sets.
A single LLM call inside a trace. The default and most common.
Retriever
A retrieval step (e.g. evaluating chunk relevance).
Agent span
A single agent step inside a trace.
Trace
The full trace — input, intermediate steps, and final output.
The right step depends on what your metric needs to see. For “is the final answer toxic?” use LLM span or Trace. For “are retrieved chunks relevant?” use Retriever.
Be specific. Define exactly what counts as a positive vs negative result.
Give examples. One or two short examples per outcome class is plenty.
Constrain the output. End the prompt with something like “Respond with only true or false.” for Boolean metrics.
Avoid open scales. “Score 1–10” is harder for an LLM judge to keep consistent than a binary or 3-class categorical.
Pro tip: For best results, we recommend first creating your metric in the Galileo console and using the Autotune feature to test and refine it on a labelled test dataset. This helps you optimize the metric’s performance before
launching a full training run in Luna Studio.