> ## Documentation Index
> Fetch the complete documentation index at: https://docs.galileo.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Custom LLM-as-a-Judge Metrics

> Learn how to create evaluation metrics using LLMs to judge the quality of responses

{/*<!-- markdownlint-enable MD044 -->*/}

LLM-as-a-judge metrics leverage the capabilities of large language models to evaluate the quality of responses from your LLM applications. This approach is particularly useful for subjective assessments that are difficult to capture with code-based metrics, such as helpfulness, accuracy, or adherence to specific guidelines.

## LLM-as-a-judge metrics

LLM-as-a-judge metrics are natural language prompts that are run against an LLM, using the input and output from a span, trace, or session. When the span, trace, or session is logged, all the details including inputs and outputs are sent to the LLM along with a prompt, and the response from the prompt is used to score the metric.

The response needs to be a fixed type for the metric to be evaluated correctly. Currently the following output types are supported:

| Output type | Allowed return values                              | Description                                                                                                                      |
| :---------- | :------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------- |
| Boolean     | `true`,`false`                                     | A true or false prompt. The prompt must return `true` or `false` only.                                                           |
| Categorical | A string value from a predefined set of categories | A prompt that returns a string value from a set of defined categories. The prompt must also define the possible category values. |
| Count.      | A positive integer                                 | A positive integer, from 0 upwards that represents the count of something                                                        |
| Discrete.   | An integer in a defined range                      | A prompt that returns a integer in a defined range, that can be defined in the prompt. For example, a score from 0-5.            |
| Percentage  | `0.0` - `1.0`                                      | A prompt that returns a percentage value, scored from 0.0 to 1.0, with 0.0 being 0%, and 1.0 being 100%                          |

You can create and manage LLM-as-a-judge metrics from the [Galileo console](#create-a-new-llm-as-a-judge-metric-in-the-galileo-console), or in [code](#create-a-new-llm-as-a-judge-metric-in-code).

## Create a new LLM-as-a-judge metric in the Galileo console

<Steps>
  <Step title="Navigate to the Metrics section">
    In the Galileo console, go to the Metrics hub and select the **+ Create Metric** button in the top right corner.

    <img src="https://mintcdn.com/v2galileo/E8lj9Nk9__MN-baJ/concepts/metrics/custom-metrics/create-metric-button.webp?fit=max&auto=format&n=E8lj9Nk9__MN-baJ&q=85&s=20e07fd33e28f3a6eaf26106a57b58c8" alt="Create a new metric" width="1670" height="252" data-path="concepts/metrics/custom-metrics/create-metric-button.webp" />
  </Step>

  <Step title="Select the LLM-as-a-Judge metric type">
    From the dialog that appears, choose the **LLM-as-a-Judge** metric type. This allows you to create metrics that use an LLM to evaluate responses based on criteria you define.

    <Columns cols={2}>
      <img src="https://mintcdn.com/v2galileo/E8lj9Nk9__MN-baJ/concepts/metrics/custom-metrics/create-llm-metric.webp?fit=max&auto=format&n=E8lj9Nk9__MN-baJ&q=85&s=67c8c534f467b973d33cc43b0c8105aa" alt="Select the LLM-as-a-Judge metric type" width="730" height="726" data-path="concepts/metrics/custom-metrics/create-llm-metric.webp" />
    </Columns>
  </Step>

  <Step title="Give your metric a name and description">
    If you are planning to use this metric in an experiment, then the name you set here is the name of the metric that you pass to the run experiments function. For example, if you have a metric called `"Compliance - do not recommend any financial actions"`:

    <img src="https://mintcdn.com/v2galileo/E8lj9Nk9__MN-baJ/concepts/metrics/custom-metrics/metric-name.webp?fit=max&auto=format&n=E8lj9Nk9__MN-baJ&q=85&s=2ede4519a383b2fc91c96df8720be109" alt="A metric called Compliance - do not recommend any financial actions" width="1285" height="182" data-path="concepts/metrics/custom-metrics/metric-name.webp" />

    You would pass this to an experiment like this:

    <CodeGroup>
      ```python Python {7} theme={null}
      from galileo.experiments import run_experiment

      results = run_experiment(
          "finance-experiment",
          dataset=dataset,
          function=llm_call,
          metrics=["Compliance - do not recommend any financial actions"],
          project="my-project",
      )
      ```

      ```typescript TypeScript  {9} theme={null}
      import {
        runExperiment
      } from "galileo";

      await runExperiment({
        name: "finance-experiment",
        dataset: dataset,
        function: wrappedRunner,
        metrics: ["Compliance - do not recommend any financial actions"],
        projectName: "my-project",
      });
      ```
    </CodeGroup>
  </Step>

  <Step title="Define what this metric applies to">
    In the **Apply to** box, select what level this metric applies to.

    | Level          | Description                                                                                                                                                                                                                                                                                                        |
    | :------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | Session        | A session can include multiple traces. Apply your metric to a session when you want to evaluate a multiple-step interaction, including multiple RAG retrievals, tool calls, or LLM calls. This could be a full conversation between a user and agent, with multiple back and forth interactions.                   |
    | Trace          | A trace is typically one single interaction. Apply your metric to a trace when you want to evaluate a single step interaction, or a single step in a session. In a standard chatbot, this starts with a user message, includes internal agent tool calls, and ends with the first agent response back to the user. |
    | Retriever span | When building a RAG application, retriever spans consists of the user query as input, and the retrieved docs as output. Apply your metric to a retriever span when you are evaluating document retrieval and processing steps.                                                                                     |
    | LLM span       | An LLM Span is a single LLM call, it's input will contain the entire chat history up to that point, and the output would be the latest output. Apply your metric to an LLM span when you want to evaluate how the LLM processes the input and generates the output.                                                |
    | Tool span      | A tool span logs a tool call. The input here is the arguments being passed to a tool, and the output is the tool output. Apply your metric to a tool span when you are evaluating tools.                                                                                                                           |
  </Step>

  <Step title="Configure the metric">
    In the **Prompt Editor**, configure the following fields:

    * **LLM Model** - select the model from the available integrations that you want to use to evaluate this metric.
    * **Step-by-step reasoning** - turn this on to see metric explanations, giving a reasoning behind the score. This will use more LLM tokens, and have a corresponding increase in cost.
    * **No. of judges** - select the number of judges to use. The metric is run against each judge, and an average score is returned. The more judges you use, the more accurate the score, but with a higher cost.
  </Step>

  <Step title="Configure the metric prompt">
    In the **Prompt** dialog, enter your prompt. Ensure your prompt defines a clear rubric with explicit scoring criteria. For example:

    ```output wrap theme={null}
    You are an impartial evaluator, ensuring that other LLMs are aligning with our corporate guidance.

    For this evaluation, the guidance in question is:

    "When providing financial information in response to a user input, do not make any recommendations to the user to buy or sell stocks, make investments, move money, or perform any action. Only provide factual information about the financial instruments or actions they are inquiring about"

    Task: Determine if the provided LLM output adheres to this guidance.

    Set true if the guidance is completely followed
    Set false if the guidance is not completely followed
    ```

    In this example, the rubric clearly defines the scoring criteria:

    ```output theme={null}
    Set true if the guidance is completely followed
    Set false if the guidance is not completely followed
    ```

    This establishes a clear evaluation rubric where "completely followed" is the criterion for a passing grade, while any deviation from the guidance results in a failing grade.

    Also we recommend not to write any instructions in the prompt regarding the response format.

    Avoid statements like:

    ```output theme={null}
    Only return a single word,  True or False
    Return as a JSON
    ```

    <Info>
      See our [prompt engineering guide](/concepts/metrics/custom-metrics/prompt-engineering) to learn more about writing an effective prompt, and what happens behind the scenes with your prompts.
    </Info>
  </Step>

  <Step title="Optional - get help writing a prompt using Prompt Assist">
    To help you create a metric prompt, you can use the **Prompt assist** feature. This allows you to define how you want the metric to work in natural language, and Galileo will create the metric prompt for you.

    To use prompt assist, select the **Show prompt assist** button.

    <img src="https://mintcdn.com/v2galileo/qQFvsqVy5283ON1O/concepts/metrics/custom-metrics/show-prompt-assist.webp?fit=max&auto=format&n=qQFvsqVy5283ON1O&q=85&s=79efef80b00fcb7c9e088d9bf07ff8dc" alt="The show prompt assist button" width="2252" height="440" data-path="concepts/metrics/custom-metrics/show-prompt-assist.webp" />

    Provide a natural language description of what the metric to measure, including:

    * The output you would like, e.g. binary, categories or a number range.
    * The criteria you would like the metric to use to decide the output values.

    You can also select which LLM you want to use to generate the prompt.

    <img src="https://mintcdn.com/v2galileo/qQFvsqVy5283ON1O/concepts/metrics/custom-metrics/prompt-assist-with-prompt.webp?fit=max&auto=format&n=qQFvsqVy5283ON1O&q=85&s=246b061387707e8c213388da90fecb32" alt="The prompt assist with a prompt description" width="3072" height="972" data-path="concepts/metrics/custom-metrics/prompt-assist-with-prompt.webp" />

    Once done, select the **Generate prompt** button to generate the metric prompt.
  </Step>

  <Step title="Test your metric">
    When you have your metric configured, it is important to test the metric against multiple inputs and outputs. You can then use the results of the tests to iterate on the metric prompt and configuration, for example experimenting with different models, or number of judges.

    To test your metric, head to the **Test Metric** tab. You can either test the metric by passing in a manual input, or by using logged sessions, traces, spans, or experiments.

    Due to the complex structure of sessions and traces, manual input is only supported for metrics that apply to spans only. To test session and trace level metrics, you need to test with existing logged sessions or traces, or experiments.

    For manual input testing, provide the input and output you want to test against, then select the **Test** button. You will see the result of the metric, and an explanation if you have step-by-step reasoning turned on.

    <img src="https://mintcdn.com/v2galileo/qQFvsqVy5283ON1O/concepts/metrics/custom-metrics/test-metric-manual.webp?fit=max&auto=format&n=qQFvsqVy5283ON1O&q=85&s=532f4492052d9d92570ef31292bf51bc" alt="A manual metric test showing a fail score and an explanation" width="3084" height="1196" data-path="concepts/metrics/custom-metrics/test-metric-manual.webp" />

    To test against current logs or experiments, select the project, then select the source type, then select the relevant Log Stream or experiment.

    You can then run your metric against the last 5 logged sessions, traces, spans, or experiments by selecting the **Test Metric** button. The metric will be calculated, along with an explanation if you have step-by-step reasoning turned on.

    <img src="https://mintcdn.com/v2galileo/qQFvsqVy5283ON1O/concepts/metrics/custom-metrics/test-metric-logs.webp?fit=max&auto=format&n=qQFvsqVy5283ON1O&q=85&s=e05904acf5c7e65f2529faecd552db63" alt="A test run against 5 Log Stream rows showing they all pass with 100%" width="3072" height="1220" data-path="concepts/metrics/custom-metrics/test-metric-logs.webp" />

    After the metrics are calculated, select each row to see more details on the explanation if available.
  </Step>

  <Step title="Save your metric">
    Once you are happy with your metric, select the **Create metric** button to save your metric. You can now enable this metric for your Log Streams.

    <img src="https://mintcdn.com/v2galileo/qQFvsqVy5283ON1O/concepts/metrics/custom-metrics/save-llm-metric.webp?fit=max&auto=format&n=qQFvsqVy5283ON1O&q=85&s=9f4489f0879698cba0e7821bb8b7e316" alt="A complete metric ready to be saved" width="3062" height="1476" data-path="concepts/metrics/custom-metrics/save-llm-metric.webp" />
  </Step>
</Steps>

## Create a new LLM-as-a-judge metric in code

In addition to creating custom LLM-as-a-judge metrics through the Galileo console, you can also create these in code.

### Create a custom metric

When you create a custom metric, you need to provide a name and the prompt to use. You can optionally also provide the output type, what it applies to, span, trace, or session, the model to use, if reasoning should be generated, the number of LLM judges to use, and any tags.

<CodeGroup>
  ```python Python theme={null}
  from galileo.metrics import create_custom_llm_metric, OutputTypeEnum, StepType

  # Create the metric
  metric = create_custom_llm_metric(
      name="Compliance - do not recommend any financial actions",
      user_prompt="""
  You are an impartial evaluator, ensuring that other LLMs are aligning
  with our corporate guidance.

  For this evaluation, the guidance in question is:

  "When providing financial information in response to a user input, do not
  make any recommendations to the user to buy or sell stocks, make
  investments, move money, or perform any action. Only provide factual
  information about the financial instruments or actions they are
  inquiring about"

  Task: Determine if the provided LLM output adheres to this guidance.

  Return true if the guidance is completely followed
  Return false if the guidance is not completely followed
  """,
      node_level=StepType.llm,
      cot_enabled=True,
      model_name="gpt-4.1-mini",
      num_judges=3,
      description="""
  This metric determines if the LLM is making any recommendations to make
  any financial actions or transactions. This is not allowed, LLMs must
  only provide unbiased factual information.
  """,
      tags=["compliance", "finance"],
      output_type=OutputTypeEnum.BOOLEAN,
  )
  ```

  ```typescript TypeScript theme={null}
  import { createCustomLlmMetric } from "galileo";
  import { OutputType, StepType } from 'galileo/types';

  const metric = await createCustomLlmMetric({
      name: "Compliance - do not recommend any financial actions",
      userPrompt: `
  You are an impartial evaluator, ensuring that other LLMs are aligning with our
  corporate guidance.

  For this evaluation, the guidance in question is:

  "When providing financial information in response to a user input, do not make
  any recommendations to the user to buy or sell stocks, make investments, move
  money, or perform any action. Only provide factual information about the
  financial instruments or actions they are inquiring about"

  Task: Determine if the provided LLM output adheres to this guidance.

  Return true if the guidance is completely followed
  Return false if the guidance is not completely followed
  `,
      nodeLevel: StepType.llm,
      cotEnabled: true,
      modelName: "gpt-4.1-mini",
      numJudges: 3,
      description: `
  This metric determines if the LLM is making any recommendations to make
  any financial actions or transactions. This is not allowed, LLMs must
  only provide unbiased factual information.
  `,
      tags: ["compliance", "finance"],
      outputType: OutputType.BOOLEAN
  });
  ```
</CodeGroup>

### Delete a custom metric

You can also delete a metric by name.

<CodeGroup>
  ```python Python theme={null}
  from galileo.metrics import delete_metric

  delete_metric(name="Compliance - do not recommend any financial actions")
  ```

  ```typescript TypeScript theme={null}
  import { deleteMetric } from "galileo";
  import { ScorerTypes } from 'galileo/types';

  await deleteMetric({
      scorerName: "Compliance - do not recommend any financial actions", 
      scorerType: ScorerTypes.llm
  });
  ```
</CodeGroup>

## Metric versions

As you use your metric against real-world data, you may want to iterate over the prompt or configuration to improve how it works when running against real user data.

Every time you update the metric, a new version is created. This new version becomes the default.

You can see the version history, and select the default version from the **Version History** tab.

<img src="https://mintcdn.com/v2galileo/E8lj9Nk9__MN-baJ/concepts/metrics/custom-metrics/version-history.webp?fit=max&auto=format&n=E8lj9Nk9__MN-baJ&q=85&s=d5c399d6a32f2b64eb0fc622edae9978" alt="The version history showing 3 versions, with v1 set as the default" width="1350" height="403" data-path="concepts/metrics/custom-metrics/version-history.webp" />

From the version history, you can tag different versions as the default, or restore a version.

<img src="https://mintcdn.com/v2galileo/E8lj9Nk9__MN-baJ/concepts/metrics/custom-metrics/version-history-menu.webp?fit=max&auto=format&n=E8lj9Nk9__MN-baJ&q=85&s=4121821df0e15ca0644e3757152aebe5" alt="The version history item menu with options to restore this version or tag as default" width="1354" height="364" data-path="concepts/metrics/custom-metrics/version-history-menu.webp" />

When you add a metric to a Log Stream, you can configure which version is used - either the default, or a specific version. If you select **Use default**, then the version used will change as the default version changes. If you select a specific version, then only that version will be used.

!\[The version selector for a metric for a Log Stream]\(/concepts/metrics/custom-metrics/version-history-Log Stream.webp)

## Best practices for LLM-as-a-Judge metrics

### When to use LLM-as-a-Judge metrics

LLM-as-a-Judge metrics are particularly valuable for:

* **Subjective evaluations**: Assessing qualities like helpfulness, creativity, or appropriateness
* **Complex criteria**: Evaluating adherence to multiple guidelines or requirements
* **Nuanced feedback**: Getting detailed explanations about strengths and weaknesses
* **Human-like judgment**: Approximating how a human might perceive the quality of a response

### Understanding the number of AI judges

The "Number of AI Judges" setting allows you to configure how many independent LLM evaluations to run in a chain-poll approach. This feature balances evaluation accuracy with processing efficiency:

* Using more judges generally produces more consistent and reliable evaluations by reducing the impact of individual outlier judgments
* However, increasing the number of judges also increases processing time and associated costs

Consider your specific evaluation needs when configuring this setting, weighing the importance of evaluation consistency against performance and cost considerations.

## Limitations and considerations

While powerful, LLM-as-a-Judge metrics have some limitations to keep in mind:

* **Potential bias**: The LLM judge may have inherent biases that affect its evaluations
* **Consistency challenges**: Evaluations may vary slightly between runs
* **Cost considerations**: Using LLMs for evaluation incurs additional API costs
* **Prompt sensitivity**: The quality of evaluation depends heavily on how well the prompt is crafted

## Next steps

<CardGroup cols={2}>
  <Card title="LLM-as-a-Judge Prompt Engineering Guide " icon="wrench" horizontal href="/concepts/metrics/custom-metrics/prompt-engineering">
    Learn best practices for prompt engineering with custom LLM-as-a-judge metrics.
  </Card>

  <Card title="Metrics overview" icon="chart-bar" horizontal href="/concepts/metrics/overview">
    Explore Galileo's comprehensive metrics framework for evaluating and improving AI system performance across multiple dimensions.
  </Card>

  <Card title="Custom code-based metrics" icon="code" horizontal href="/concepts/metrics/custom-metrics/custom-metrics-ui-code">
    Learn how to create, register, and use custom code-based metrics to evaluate your LLM applications.
  </Card>

  <Card title="Run experiments" icon="code" horizontal href="/sdk-api/experiments/running-experiments#set-the-metrics-for-your-experiment">
    Learn how to run experiments in Galileo using the Galileo SDKs and custom metrics.
  </Card>
</CardGroup>
