Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.galileo.ai/llms.txt

Use this file to discover all available pages before exploring further.

Most metrics in Luna Studio come from Galileo presets or custom Galileo metrics. When you need a metric that does not fit an existing option, define a custom LLM-as-judge prompt in Step 1 of the run creation flow.
Custom metric in the run creation flow

Open custom prompt mode

From Step 1 of the run creation flow, open the metric dropdown and click Use custom prompt.

Fields

FieldRequiredNotes
Metric nameNoOptional display name. If blank, Luna Studio derives one from the run context.
Output typeYesThe trainable return shape: Boolean or Categorical.
StepYesThe trace step the metric runs against: LLM span, Retriever, Agent span, Trace.
Input stepYesTraining input shape: Single message, Input / output pair, Full trace, or Full session.
ModalityRead-only. Fixed to Text today.
PromptYesThe LLM-as-judge prompt. Required.

Output types in detail

Output typeWhen to use
BooleanYes/no questions (“Is this toxic?”, “Does the answer cite a source?”).
CategoricalPicking one of a fixed list (e.g. positive / neutral / negative).
Other Galileo output types are not trainable in Luna Studio yet. The output type also constrains what label values your test set can use during validation. See Test sets.

Steps in detail

StepWhere it fires
LLM spanA single LLM call inside a trace. The default and most common.
RetrieverA retrieval step (e.g. evaluating chunk relevance).
Agent spanA single agent step inside a trace.
TraceThe full trace — input, intermediate steps, and final output.
The right step depends on what your metric needs to see. For “is the final answer toxic?” → LLM span or Trace. For “are retrieved chunks relevant?” → Retriever.

Input steps

Input stepWhen to use
Single messageOne text input per row.
Input / output pairRows that include both the prompt/input and model output.
Full traceTrace-level metrics that need the full request flow.
Full sessionSession-level metrics that need multiple related traces.
Full trace and full session inputs require user-supplied training data; synthetic generation is disabled for those shapes.

Prompt-writing tips

  • Be specific. Define exactly what counts as a positive vs negative result.
  • Give examples. One or two short examples per outcome class is plenty.
  • Constrain the output. End the prompt with something like “Respond with only true or false.” for Boolean metrics.
  • Avoid open scales. “Score 1–10” is harder for an LLM-judge to keep consistent than a binary or 3-class categorical.

Submit

Continue through the run creation flow. Luna Studio saves the metric definition with the run and fine-tunes it once you launch.

Designing outside Luna Studio

Use the standalone Galileo metrics workflow when you want to design and test a metric outside of Luna Studio before bringing it into a run.

Where to go next

Step 1: Metric (in the run creation flow)

Define a custom metric inside a new run.

Test sets

Schema rules and best practices for evaluation data.

Register a metric

Publish a fine-tuned metric to Galileo.