Metrics Overview

Galileo comes with a set of ready to use Out-of-the-Box metrics that allow you to see how your AI is performing. With these metrics, you can quickly spot problems, track improvements, and make your AI work better for your users. These metrics apply to different node types (such as session, trace, or different span types), depending on the metric. You can then expand these metrics with custom metrics, using LLM-as-a-judge, or custom code-based metrics. To calculate Out-of-the-Box metrics, or LLM-as-a-judge metrics, you first need to configure an integration with an LLM, or Luna-2. Connect Galileo to your language model by adding your API key on the integrations page from within the Galileo application. You can improve the metric calculation based on your requirements using Autotune. This allows you to continuously provide feedback in natural language that automatically improves the metrics to align better with your domain, or expected inputs and outputs. Metrics can be used with experiments, and Log Streams.

Configure Galileo for Out-of-the-Box and LLM-as-a-judge metrics

Most Out-of-the-Box metrics and all LLM-as-a-judge metrics are LLM-based metrics. LLM-based metrics use an LLM to evaluate inputs and outputs. To use these metrics from Log Streams or Experiments, you first need to configure an integration with an LLM platform.

Navigate to the Integrations page

In the Galileo console UI, navigate to the LLM integrations page by opening the user menu on the bottom-left corner, and then selecting Integrations.

Add an integration

Locate the LLM provider you are using (or specify a custom integration), then select the +Add Integration button.

Add settings

Specify settings for your integration (such as an API key), then select Save changes.

Using metrics effectively

To get the most value from Galileo’s metrics:

Start with key metrics - Focus on metrics most relevant to your use case
Establish baselines - Understand your current performance before making changes
Track trends over time - Monitor how metrics change as you iterate on your system
Combine multiple metrics - Look at related metrics together for a more complete picture
Set thresholds - Define acceptable ranges for critical metrics
Improve the metrics - Use CLHF to continuously improve the metrics

Out-of-the-Box metric categories

Our metrics can be broken down into seven key categories, each addressing a specific aspect of AI system performance. You may benefit from using metrics from more than one category. Galileo also supports custom metrics that are able to be implemented alongside the Out-of-the-Box metric options. The Metrics Comparison provides a full list of the Out-of-the-Box metrics for each category.

1. Agentic performance metrics

Agentic Performance Metrics evaluate how effectively AI agents perform tasks, use tools, and progress toward goals.

When to Use	Example Use Cases
When building and optimizing AI systems that take actions, make decisions, or use tools to accomplish tasks.	Evaluating a travel planning agent’s ability to book complete itineraries Assessing a coding assistant’s appropriate use of APIs and libraries Measuring a data analysis agent’s tool selection effectiveness

2. Expression and readability metrics

Expression And Readability Metrics assess the style, tone, clarity, and overall presentation of AI-generated content.

When to Use	Example Use Cases
When the format, tone, and presentation of AI outputs are important for user experience or brand consistency.	Ensuring a luxury brand chatbot maintains a sophisticated tone Verifying educational content is presented at the appropriate reading level Measuring clarity and conciseness in technical documentation generation

3. Model confidence metrics

Model Confidence Metrics measure how certain or uncertain your AI model is about its responses.

When to Use	Example Use Cases
When you want to flag low-confidence responses for review, improve system reliability, or better understand model uncertainty.	Flagging uncertain answers in a customer support chatbot for human review Identifying low-confidence predictions in a medical diagnosis assistant Improving user trust by surfacing confidence scores in AI-generated content

4. Multimodal quality metrics

Multimodal Quality Metrics evaluate whether multimodal inputs and outputs (such as images and audio conversations) are usable and compliant for the task at hand.

When to Use	Example Use Cases
When your AI workflow includes images or audio conversations and you need quality/compliance signals.	Gating blurry document photos before extraction Enforcing brand rules on generated marketing images Detecting turn-taking issues in voice agents (overlap/barge-in)

5. Response quality metrics

Response Quality Metrics help you evaluate how correctly, consistently, and in line with ground truth your AI follows instructions and answers user queries in any setting — with or without RAG (correctness, instruction adherence, ground truth adherence).

6. RAG metrics

RAG Metrics help you evaluate retrieval and generation quality in RAG pipelines, including retrieval quality (chunk relevance, context relevance, context precision, Precision @ K) and generation quality (chunk attribution utilization, context adherence, completeness).

When to Use	Example Use Cases
When evaluating how well retrieval systems find relevant context and how well models use that context to produce accurate, complete, and well-grounded responses.	Measuring factual accuracy in a medical information system Evaluating how well a RAG system uses retrieved information Assessing if customer service responses address all parts of a query

7. Safety and compliance metrics

Safety And Compliance Metrics identify potential risks, harmful content, bias, or privacy concerns in AI interactions.

When to Use	Example Use Cases
When ensuring AI systems meet regulatory requirements, protect user privacy, and avoid generating harmful or biased content.	Detecting PII in healthcare chatbot conversations Identifying potential prompt injection attacks in public-facing systems Measuring bias in hiring or loan approval recommendation systems

8. Text-to-SQL metrics

Text-to-SQL metrics evaluate the accuracy and effectiveness of SQL queries generated by AI models from natural language inputs.

When to Use	Example Use Cases
Use these metrics to evaluate the quality of SQL queries generated from natural language inputs. They’re essential when building Text-to-SQL systems that need to produce syntactically correct queries grounded in your database schema.	A data analytics assistant where users ask questions in natural language and expect accurate query results A customer-facing data analytics chatbot that must prevent injection attacks from user inputs A business intelligence platform where ad-hoc queries must not impact database availability

Next steps

Custom LLM-as-a-judge metrics

Learn how to create evaluation metrics using LLMs to judge the quality of responses

Custom code-based metrics

Learn how to create, register, and use custom metrics to evaluate your LLM applications

Improve LLM-as-a-Judge Metrics with Autotune

Learn how to improve your LLM-as-a-judge metrics using expert feedback with Autotune

Documentation Index

​Configure Galileo for Out-of-the-Box and LLM-as-a-judge metrics

​Using metrics effectively

​Out-of-the-Box metric categories

​1. Agentic performance metrics

​2. Expression and readability metrics

​3. Model confidence metrics

​4. Multimodal quality metrics

​5. Response quality metrics

​6. RAG metrics

​7. Safety and compliance metrics

​8. Text-to-SQL metrics

​Next steps