> ## Documentation Index
> Fetch the complete documentation index at: https://docs.galileo.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Experiment Metrics

> Learn how to use metrics in your experiments

{/* <!-- markdownlint-enable MD044 --> */}

Galileo AI provides a set of built-in, preset metrics designed to evaluate various aspects of LLM, agent, and retrieval-based workflows. You can also create custom metrics using LLM-as-a-judge, or code. This guide provides a reference for using these metrics in your experiments.

## Out-of-the-box metrics reference

The table below summarizes gives the constants used in code to access each metric. To use these metrics, import the relevant enum.

<CodeGroup>
  ```python Python theme={null}
  from galileo import GalileoMetrics
  ```

  ```typescript TypeScript theme={null}
  import { GalileoMetrics } from "galileo";
  ```
</CodeGroup>

### LLM-as-a-judge Metrics

<Tabs>
  <Tab title="Python">
    | Metric                                                                                           | Enum Value                                                        |
    | :----------------------------------------------------------------------------------------------- | :---------------------------------------------------------------- |
    | [Action Advancement](/concepts/metrics/agentic/action-advancement)                               | `GalileoMetrics.action_advancement`                               |
    | [Action Completion](/concepts/metrics/agentic/action-completion)                                 | `GalileoMetrics.action_completion`                                |
    | [Agent Efficiency](/concepts/metrics/agentic/agent-efficiency)                                   | `GalileoMetrics.agent_efficiency`                                 |
    | [Agent Flow](/concepts/metrics/agentic/agent-flow)                                               | `GalileoMetrics.agent_flow`                                       |
    | [BLEU](/concepts/metrics/expression-and-readability/bleu-and-rouge#understanding-bleu-score)     | `GalileoMetrics.bleu`                                             |
    | [Chunk Attribution Utilization](/concepts/metrics/rag/generation-quality/chunk-attribution)      | `GalileoMetrics.chunk_attribution_utilization`                    |
    | [Completeness](/concepts/metrics/rag/generation-quality/completeness)                            | `GalileoMetrics.completeness`                                     |
    | [Context Adherence](/concepts/metrics/rag/generation-quality/context-adherence)                  | `GalileoMetrics.context_adherence`                                |
    | [Context Precision](/concepts/metrics/rag/retrieval-quality/context-precision)                   | `GalileoMetrics.context_precision`                                |
    | [Context Relevance (Query Adherence)](/concepts/metrics/rag/retrieval-quality/context-relevance) | `GalileoMetrics.context_relevance`                                |
    | [Conversation Quality](/concepts/metrics/agentic/conversation-quality)                           | `GalileoMetrics.conversation_quality`                             |
    | [Correctness (factuality)](/concepts/metrics/response-quality/correctness)                       | `GalileoMetrics.correctness`                                      |
    | [Ground Truth Adherence](/concepts/metrics/response-quality/ground-truth-adherence)              | `GalileoMetrics.ground_truth_adherence`                           |
    | [Instruction Adherence](/concepts/metrics/response-quality/instruction-adherence)                | `GalileoMetrics.instruction_adherence`                            |
    | [Interruption Detection](/concepts/metrics/multimodal-quality/interruption-detection)            | `GalileoMetrics.interruption_detection`                           |
    | [PII (personally identifiable information)](/concepts/metrics/safety-and-compliance/pii)         | `GalileoMetrics.input_pii`, `GalileoMetrics.output_pii`           |
    | [Prompt Injection](/concepts/metrics/safety-and-compliance/prompt-injection)                     | `GalileoMetrics.prompt_injection`                                 |
    | [Prompt Perplexity](/concepts/metrics/model-confidence/prompt-perplexity)                        | `GalileoMetrics.prompt_perplexity`                                |
    | [ROUGE](/concepts/metrics/expression-and-readability/bleu-and-rouge#understanding-rouge)         | `GalileoMetrics.rouge`                                            |
    | [Sexism / Bias](/concepts/metrics/safety-and-compliance/sexism)                                  | `GalileoMetrics.input_sexism`, `GalileoMetrics.output_sexism`     |
    | [Tone](/concepts/metrics/expression-and-readability/tone)                                        | `GalileoMetrics.input_tone`, `GalileoMetrics.output_tone`         |
    | [Tool Errors](/concepts/metrics/agentic/tool-error)                                              | `GalileoMetrics.tool_error_rate`                                  |
    | [Tool Selection Quality](/concepts/metrics/agentic/tool-selection-quality)                       | `GalileoMetrics.tool_selection_quality`                           |
    | [Reasoning Coherence](/concepts/metrics/agentic/reasoning-coherence)                             | `GalileoMetrics.reasoning_coherence`                              |
    | [SQL Correctness](/concepts/metrics/text2sql/sql-correctness)                                    | `GalileoMetrics.sql_correctness`                                  |
    | [SQL Adherence](/concepts/metrics/text2sql/sql-adherence)                                        | `GalileoMetrics.sql_adherence`                                    |
    | [SQL Injection](/concepts/metrics/text2sql/sql-injection)                                        | `GalileoMetrics.sql_injection`                                    |
    | [SQL Efficiency](/concepts/metrics/text2sql/sql-efficiency)                                      | `GalileoMetrics.sql_efficiency`                                   |
    | [Toxicity](/concepts/metrics/safety-and-compliance/toxicity)                                     | `GalileoMetrics.input_toxicity`, `GalileoMetrics.output_toxicity` |
    | [User Intent Change](/concepts/metrics/agentic/intent-change)                                    | `GalileoMetrics.user_intent_change`                               |
    | [Visual Fidelity](/concepts/metrics/multimodal-quality/visual-fidelity)                          | `GalileoMetrics.visual_fidelity`                                  |
    | [Visual Quality](/concepts/metrics/multimodal-quality/visual-quality)                            | `GalileoMetrics.visual_quality`                                   |
  </Tab>

  <Tab title="TypeScript">
    | Metric                                                                                           | Enum Value                                                      |
    | :----------------------------------------------------------------------------------------------- | :-------------------------------------------------------------- |
    | [Action Advancement](/concepts/metrics/agentic/action-advancement)                               | `GalileoMetrics.actionAdvancement`                              |
    | [Action Completion](/concepts/metrics/agentic/action-completion)                                 | `GalileoMetrics.actionCompletion`                               |
    | [Agent Efficiency](/concepts/metrics/agentic/agent-efficiency)                                   | `GalileoMetrics.agentEfficiency`                                |
    | [Agent Flow](/concepts/metrics/agentic/agent-flow)                                               | `GalileoMetrics.agentFlow`                                      |
    | [BLEU](/concepts/metrics/expression-and-readability/bleu-and-rouge#understanding-bleu-score)     | `GalileoMetrics.bleu`                                           |
    | [Chunk Attribution Utilization](/concepts/metrics/rag/generation-quality/chunk-attribution)      | `GalileoMetrics.chunkAttributionUtilization`                    |
    | [Completeness](/concepts/metrics/rag/generation-quality/completeness)                            | `GalileoMetrics.completeness`                                   |
    | [Context Adherence](/concepts/metrics/rag/generation-quality/context-adherence)                  | `GalileoMetrics.contextAdherence`                               |
    | [Context Precision](/concepts/metrics/rag/retrieval-quality/context-precision)                   | `GalileoMetrics.contextPrecision`                               |
    | [Context Relevance (Query Adherence)](/concepts/metrics/rag/retrieval-quality/context-relevance) | `GalileoMetrics.contextRelevance`                               |
    | [Conversation Quality](/concepts/metrics/agentic/conversation-quality)                           | `GalileoMetrics.conversationQuality`                            |
    | [Correctness (factuality)](/concepts/metrics/response-quality/correctness)                       | `GalileoMetrics.correctness`                                    |
    | [Ground Truth Adherence](/concepts/metrics/response-quality/ground-truth-adherence)              | `GalileoMetrics.groundTruthAdherence`                           |
    | [Instruction Adherence](/concepts/metrics/response-quality/instruction-adherence)                | `GalileoMetrics.instructionAdherence`                           |
    | [PII (personally identifiable information)](/concepts/metrics/safety-and-compliance/pii)         | `GalileoMetrics.inputPii`, `GalileoMetrics.outputPii`           |
    | [Prompt Injection](/concepts/metrics/safety-and-compliance/prompt-injection)                     | `GalileoMetrics.promptInjection`                                |
    | [Prompt Perplexity](/concepts/metrics/model-confidence/prompt-perplexity)                        | `GalileoMetrics.promptPerplexity`                               |
    | [ROUGE](/concepts/metrics/expression-and-readability/bleu-and-rouge#understanding-rouge)         | `GalileoMetrics.rouge`                                          |
    | [Sexism / Bias](/concepts/metrics/safety-and-compliance/sexism)                                  | `GalileoMetrics.inputSexism`, `GalileoMetrics.outputSexism`     |
    | [Tone](/concepts/metrics/expression-and-readability/tone)                                        | `GalileoMetrics.inputTone`, `GalileoMetrics.outputTone`         |
    | [Tool Errors](/concepts/metrics/agentic/tool-error)                                              | `GalileoMetrics.toolErrorRate`                                  |
    | [Tool Selection Quality](/concepts/metrics/agentic/tool-selection-quality)                       | `GalileoMetrics.toolSelectionQuality`                           |
    | [SQL Correctness](/concepts/metrics/text2sql/sql-correctness)                                    | `GalileoMetrics.sqlCorrectness`                                 |
    | [SQL Adherence](/concepts/metrics/text2sql/sql-adherence)                                        | `GalileoMetrics.sqlAdherence`                                   |
    | [SQL Injection](/concepts/metrics/text2sql/sql-injection)                                        | `GalileoMetrics.sqlInjection`                                   |
    | [SQL Efficiency](/concepts/metrics/text2sql/sql-efficiency)                                      | `GalileoMetrics.sqlEfficiency`                                  |
    | [Toxicity](/concepts/metrics/safety-and-compliance/toxicity)                                     | `GalileoMetrics.inputToxicity`, `GalileoMetrics.outputToxicity` |
    | [User Intent Change](/concepts/metrics/agentic/intent-change)                                    | `GalileoMetrics.userIntentChange`                               |
  </Tab>
</Tabs>

### Luna-2 metrics

If you are using the [Galileo Luna-2 model](/concepts/luna/luna), then use these metric values.

<Tabs>
  <Tab title="Python">
    | Metric                                                                                      | Enum Value                                                                  |
    | :------------------------------------------------------------------------------------------ | :-------------------------------------------------------------------------- |
    | [Action Advancement](/concepts/metrics/agentic/action-advancement)                          | `GalileoMetrics.action_advancement_luna`                                    |
    | [Action Completion](/concepts/metrics/agentic/action-completion)                            | `GalileoMetrics.action_completion_luna`                                     |
    | [Chunk Attribution Utilization](/concepts/metrics/rag/generation-quality/chunk-attribution) | `GalileoMetrics.chunk_attribution_utilization_luna`                         |
    | [Completeness](/concepts/metrics/rag/generation-quality/completeness)                       | `GalileoMetrics.completeness_luna`                                          |
    | [Context Adherence](/concepts/metrics/rag/generation-quality/context-adherence)             | `GalileoMetrics.context_adherence_luna`                                     |
    | [PII (personally identifiable information)](/concepts/metrics/safety-and-compliance/pii)    | `GalileoMetrics.input_pii`, `GalileoMetrics.output_pii`                     |
    | [Prompt Injection](/concepts/metrics/safety-and-compliance/prompt-injection)                | `GalileoMetrics.prompt_injection_luna`                                      |
    | [Sexism / Bias](/concepts/metrics/safety-and-compliance/sexism)                             | `GalileoMetrics.input_sexism_luna`, `GalileoMetrics.output_sexism_luna`     |
    | [Tone](/concepts/metrics/expression-and-readability/tone)                                   | `GalileoMetrics.input_tone`, `GalileoMetrics.output_tone`                   |
    | [Tool Errors](/concepts/metrics/agentic/tool-error)                                         | `GalileoMetrics.tool_error_rate_luna`                                       |
    | [Tool Selection Quality](/concepts/metrics/agentic/tool-selection-quality)                  | `GalileoMetrics.tool_selection_quality_luna`                                |
    | [Toxicity](/concepts/metrics/safety-and-compliance/toxicity)                                | `GalileoMetrics.input_toxicity_luna`, `GalileoMetrics.output_toxicity_luna` |
    | [Uncertainty](/concepts/metrics/model-confidence/uncertainty)                               | `GalileoMetrics.uncertainty`                                                |
  </Tab>

  <Tab title="TypeScript">
    | Metric                                                                                      | Enum Value                                                              |
    | :------------------------------------------------------------------------------------------ | :---------------------------------------------------------------------- |
    | [Action Advancement](/concepts/metrics/agentic/action-advancement)                          | `GalileoMetrics.actionAdvancementLuna`                                  |
    | [Action Completion](/concepts/metrics/agentic/action-completion)                            | `GalileoMetrics.actionCompletionLuna`                                   |
    | [Chunk Attribution Utilization](/concepts/metrics/rag/generation-quality/chunk-attribution) | `GalileoMetrics.chunkAttributionUtilizationLuna`                        |
    | [Completeness](/concepts/metrics/rag/generation-quality/completeness)                       | `GalileoMetrics.completenessLuna`                                       |
    | [Context Adherence](/concepts/metrics/rag/generation-quality/context-adherence)             | `GalileoMetrics.contextAdherenceLuna`                                   |
    | [PII (personally identifiable information)](/concepts/metrics/safety-and-compliance/pii)    | `GalileoMetrics.inputPii`, `GalileoMetrics.outputPii`                   |
    | [Prompt Injection](/concepts/metrics/safety-and-compliance/prompt-injection)                | `GalileoMetrics.promptInjectionLuna`                                    |
    | [Sexism / Bias](/concepts/metrics/safety-and-compliance/sexism)                             | `GalileoMetrics.inputSexismLuna`, `GalileoMetrics.outputSexismLuna`     |
    | [Tone](/concepts/metrics/expression-and-readability/tone)                                   | `GalileoMetrics.inputTone`, `GalileoMetrics.outputTone`                 |
    | [Tool Errors](/concepts/metrics/agentic/tool-error)                                         | `GalileoMetrics.toolErrorRateLuna`                                      |
    | [Tool Selection Quality](/concepts/metrics/agentic/tool-selection-quality)                  | `GalileoMetrics.toolSelectionQualityLuna`                               |
    | [Toxicity](/concepts/metrics/safety-and-compliance/toxicity)                                | `GalileoMetrics.inputToxicityLuna`, `GalileoMetrics.outputToxicityLuna` |
    | [Uncertainty](/concepts/metrics/model-confidence/uncertainty)                               | `GalileoMetrics.uncertainty`                                            |
  </Tab>
</Tabs>

## How do I use metrics in experiments?

The `run experiment` function ([Python](/sdk-api/python/reference#galileo-experiments), [TypeScript](/sdk-api/typescript/reference#galileo-experiments)) takes a list of metrics as part of its arguments.

### Preset metrics

Supply a list of one or more metric names into the `run_experiment` function as shown below:

<CodeGroup>
  ```python Python theme={null}
  import os
  from galileo.experiments import run_experiment
  from galileo.datasets import get_dataset
  from galileo.openai import openai
  from galileo import GalileoMetrics

  dataset = get_dataset(name="fictional_character_names")

  # Define a custom "runner" function for your experiment.
  def my_custom_llm_runner(input):
    client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    return client.chat.completions.create(
          model="gpt-4o",
          messages=[
            { "role": "system", "content": "You are a great storyteller." },
            { "role": "user", "content": f"Write a story about {input["topic"]}" },
          ],
      ).choices[0].message.content

  # Run the experiment!
  results = run_experiment(
      "test-experiment",
      project="my-test-project-1",
      dataset=dataset,
      function=my_custom_llm_runner,
      metrics=[
          # List metrics here
          GalileoMetrics.action_advancement,
          GalileoMetrics.completeness,
          GalileoMetrics.instruction_adherence
      ],
  )
  ```

  ```typescript TypeScript theme={null}
  import { GalileoMetrics, runExperiment } from "galileo";

  async function runFunctionExperiment() {
    const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

    // Define a custom "runner" function for your experiment.
    const myCustomLLMRunner = async (input: any) => {
      const result = await openai.chat.completions.create({
        model: "gpt-4.1-mini",
        messages: [
          { role: "system", content: "You are a great storyteller." },
          { role: "user", content: `Write a story about ${input["topic"]}` }
        ]
      });

      return [result.choices[0].message.content];
    };

    // Run the experiment!
    await runExperiment({
      name: `Test Experiment`,
      projectName: "my-test-project-1",
      datasetName: "fictional_character_names",
      function: myCustomLLMRunner,
      metrics: [
        // List metrics here
        GalileoMetrics.actionAdvancement,
        GalileoMetrics.completeness,
        GalileoMetrics.instructionAdherence
      ]
    });
  }

  runFunctionExperiment();
  ```
</CodeGroup>

For more information, read about running experiments with the [Galileo SDKs](/sdk-api/experiments).

### Custom metrics

You can use [custom metrics](/concepts/metrics/custom-metrics/custom-metrics-ui-code) in the same way as Galileo's preset metrics. At a high level, this involves the following steps:

1. Create your metric in the [Galileo Console](https://app.galileo.ai) (or in [code](/concepts/metrics/custom-metrics/custom-metrics-ui-code#custom-scorers)). Your custom metric will return a numerical score based on its input.
2. Pass the name of your new metric into the `run experiment`, like in the example below.

For example, if you have a metric called `"Compliance - do not recommend any financial actions"`:

<img src="https://mintcdn.com/v2galileo/E8lj9Nk9__MN-baJ/concepts/metrics/custom-metrics/metric-name.webp?fit=max&auto=format&n=E8lj9Nk9__MN-baJ&q=85&s=2ede4519a383b2fc91c96df8720be109" alt="A metric called Compliance - do not recommend any financial actions" width="1285" height="182" data-path="concepts/metrics/custom-metrics/metric-name.webp" />

You would pass this to an experiment like this:

<CodeGroup>
  ```python Python {7} theme={null}
  from galileo.experiments import run_experiment

  results = run_experiment(
      "finance-experiment",
      dataset=dataset,
      function=llm_call,
      metrics=["Compliance - do not recommend any financial actions"],
      project="my-project",
  )
  ```

  ```typescript TypeScript  {9} theme={null}
  import {
    runExperiment
  } from "galileo";

  await runExperiment({
    name: "finance-experiment",
    dataset: dataset,
    function: wrappedRunner,
    metrics: ["Compliance - do not recommend any financial actions"],
    projectName: "my-project",
  });
  ```
</CodeGroup>

Custom metrics provide the flexibility to define precisely what you want to measure, enabling deep analysis and targeted improvement. For a detailed walkthrough on creating them, see [Custom Metrics](/concepts/metrics/custom-metrics/custom-metrics-ui-code).

## Ground truth data

**Ground truth** is the authoritative, validated answer or label used to benchmark model performance. For LLM metrics, this often means a gold-standard answer, fact, or supporting evidence against which outputs are compared.

The following metrics require ground truth data to compute their scores, as they involve direct comparison to a reference answer, label, or fact.

* [BLEU and ROUGE](/concepts/metrics/expression-and-readability/bleu-and-rouge)
* [Ground Truth Adherence](/concepts/metrics/response-quality/ground-truth-adherence)

<Note>These metrics are only supported in experiments, as they require the ground truth to be set in the dataset used by the experiment.</Note>

To set the ground truth, set this in the output of your dataset either in the [Galileo Console](/sdk-api/experiments/datasets#create-and-manage-datasets-via-the-console), or [in code](/sdk-api/experiments/running-experiments#ground-truth).

## Are metrics LLM-agnostic?

Yes, all metrics are designed to work across any LLM integrated with Galileo.

## Next steps

<CardGroup cols={2}>
  <Card title="Metrics Overview " icon="chart-bar" horizontal href="/concepts/metrics/overview">
    Explore Galileo's comprehensive metrics framework for evaluating and improving AI system performance across multiple dimensions.
  </Card>

  <Card title="Experiments Overview " icon="flask" horizontal href="/sdk-api/experiments/experiments">
    Learn how to use datasets and experiments to improve your application.
  </Card>

  <Card title="Run experiments" icon="code" horizontal href="/sdk-api/experiments/running-experiments#set-the-metrics-for-your-experiment">
    Learn how to run experiments in Galileo using the Galileo SDKs and custom metrics.
  </Card>
</CardGroup>
