> ## Documentation Index
> Fetch the complete documentation index at: https://docs.galileo.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Custom metrics

> Define a metric with a custom LLM-as-judge prompt inside the run creation flow.

Most metrics in Luna Studio come from Galileo presets or custom Galileo metrics. When you need a metric that does not fit an existing option, define a custom LLM-as-judge prompt in [Step 1 of the run creation flow](/luna-studio/ui/runs/new-run/step-1-metric#write-a-custom-prompt).

<Frame caption="Custom metric, run creation flow — Metric step with Custom prompt selected">
  <img src="https://mintcdn.com/v2galileo/-aQkdd7oOglUYIo1/images/luna-studio/metrics/create-custom-metric.png?fit=max&auto=format&n=-aQkdd7oOglUYIo1&q=85&s=c9513c4c6501fc1421368f65aef93c96" alt="Custom metric in the run creation flow" width="1024" height="583" data-path="images/luna-studio/metrics/create-custom-metric.png" />
</Frame>

## Open custom prompt mode

From Step 1 of the run creation flow, open the metric dropdown and click **Use custom prompt**.

## Fields

| Field       | Required | Notes                                                                                   |
| ----------- | -------- | --------------------------------------------------------------------------------------- |
| Metric name | No       | Optional display name. If blank, Luna Studio derives one from the run context.          |
| Output type | Yes      | The trainable return shape: Boolean or Categorical.                                     |
| Step        | Yes      | The trace step the metric runs against: LLM span, Retriever, Agent span, Trace.         |
| Input step  | Yes      | Training input shape: Single message, Input / output pair, Full trace, or Full session. |
| Modality    | —        | Read-only. Fixed to **Text** today.                                                     |
| Prompt      | Yes      | The LLM-as-judge prompt. Required.                                                      |

## Output types in detail

| Output type | When to use                                                             |
| ----------- | ----------------------------------------------------------------------- |
| Boolean     | Yes/no questions ("Is this toxic?", "Does the answer cite a source?").  |
| Categorical | Picking one of a fixed list (e.g. `positive` / `neutral` / `negative`). |

Other Galileo output types are not trainable in Luna Studio yet. The output type also constrains what label values your test set can use during validation. See [Test sets](/luna-studio/ui/datasets/test-sets#required-schema).

## Steps in detail

| Step       | Where it fires                                                 |
| ---------- | -------------------------------------------------------------- |
| LLM span   | A single LLM call inside a trace. The default and most common. |
| Retriever  | A retrieval step (e.g. evaluating chunk relevance).            |
| Agent span | A single agent step inside a trace.                            |
| Trace      | The full trace — input, intermediate steps, and final output.  |

The right step depends on what your metric needs to see. For "is the final answer toxic?" → LLM span or Trace. For "are retrieved chunks relevant?" → Retriever.

## Input steps

| Input step          | When to use                                               |
| ------------------- | --------------------------------------------------------- |
| Single message      | One text input per row.                                   |
| Input / output pair | Rows that include both the prompt/input and model output. |
| Full trace          | Trace-level metrics that need the full request flow.      |
| Full session        | Session-level metrics that need multiple related traces.  |

Full trace and full session inputs require user-supplied training data; synthetic generation is disabled for those shapes.

## Prompt-writing tips

* **Be specific.** Define exactly what counts as a positive vs negative result.
* **Give examples.** One or two short examples per outcome class is plenty.
* **Constrain the output.** End the prompt with something like "Respond with only `true` or `false`." for Boolean metrics.
* **Avoid open scales.** "Score 1–10" is harder for an LLM-judge to keep consistent than a binary or 3-class categorical.

## Submit

Continue through the run creation flow. Luna Studio saves the metric definition with the run and fine-tunes it once you launch.

## Designing outside Luna Studio

Use the standalone Galileo metrics workflow when you want to design and test a metric outside of Luna Studio before bringing it into a run.

## Where to go next

<CardGroup cols={2}>
  <Card title="Step 1: Metric (in the run creation flow)" icon="sparkles" href="/luna-studio/ui/runs/new-run/step-1-metric">
    Define a custom metric inside a new run.
  </Card>

  <Card title="Test sets" icon="database" href="/luna-studio/ui/datasets/test-sets">
    Schema rules and best practices for evaluation data.
  </Card>

  <Card title="Register a metric" icon="circle-check" href="/luna-studio/ui/runs/register-metric">
    Publish a fine-tuned metric to Galileo.
  </Card>
</CardGroup>
