> ## Documentation Index
> Fetch the complete documentation index at: https://docs.galileo.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# RAG Metrics

> Evaluate retrieval and generation quality in your RAG pipeline using Galileo's RAG metrics

RAG metrics help you measure how well your retrieval-augmented generation system finds relevant context and produces accurate, complete, and well-grounded responses. These metrics are organized into two categories:

* **Retrieval Quality** – Evaluate whether the right chunks are retrieved and how well they are ranked (chunk relevance, context relevance, context precision, Precision @ K).
* **Generation Quality** – Evaluate whether the model uses the context effectively and grounds responses in retrieved context (chunk attribution utilization, context adherence, completeness). For ground truth adherence, correctness, and instruction adherence — which apply to both RAG and non-RAG use cases — see [Response Quality metrics](/concepts/metrics/response-quality/response-quality-overview).

## Problems RAG metrics help you solve

* **Answers hallucinate or contradict your source documents.** Use [Context Adherence](/concepts/metrics/rag/generation-quality/context-adherence) to see whether the model's claims stay grounded in the retrieved context.
* **Retrieved documents look unrelated to the question.** Use [Chunk Relevance](/concepts/metrics/rag/retrieval-quality/chunk-relevance) and [Context Relevance](/concepts/metrics/rag/retrieval-quality/context-relevance) to understand whether individual chunks — and the overall context — actually help answer the query.
* **Retrieval returns lots of noise in the top K results.** Use [Context Precision](/concepts/metrics/rag/retrieval-quality/context-precision) and [Precision @ K](/concepts/metrics/rag/retrieval-quality/precision-at-k) to quantify how many of the highest-ranked chunks are truly relevant and how that changes as you adjust K.
* **The model ignores some of the retrieved information.** Use [Chunk Attribution Utilization](/concepts/metrics/rag/generation-quality/chunk-attribution) to see which chunks influenced the answer and where useful context was left on the table.
* **Answers are grounded but still feel incomplete.** Use [Completeness](/concepts/metrics/rag/generation-quality/completeness) to detect when important details from the retrieved context never make it into the final response.

## Diagnose your RAG problem

Not sure which metric to start with? Walk through these symptoms to find the right one.

<AccordionGroup>
  <Accordion title="My answers contain made-up information" icon="triangle-exclamation">
    **Diagnosis:** The model is hallucinating — generating claims not supported by your documents.

    **Start with:** [Context Adherence](/concepts/metrics/rag/generation-quality/context-adherence) to measure how well responses stay grounded in retrieved context.

    **If Context Adherence is high but answers are still wrong:** The problem may be in retrieval. Check [Context Relevance](/concepts/metrics/rag/retrieval-quality/context-relevance) to see if you're retrieving the right documents in the first place.
  </Accordion>

  <Accordion title="My retrieval returns too much noise" icon="volume-high">
    **Diagnosis:** Your retriever is pulling in chunks that don't help answer the query.

    **Start with:** [Chunk Relevance](/concepts/metrics/rag/retrieval-quality/chunk-relevance) to see which individual chunks are useful vs. noise.

    **Then check:** [Context Precision](/concepts/metrics/rag/retrieval-quality/context-precision) to get an aggregate view of how much of your retrieved context is actually relevant.

    **To tune your K:** Use [Precision @ K](/concepts/metrics/rag/retrieval-quality/precision-at-k) to find the sweet spot where adding more chunks stops helping.
  </Accordion>

  <Accordion title="The answer is correct but feels thin" icon="feather">
    **Diagnosis:** The model is being too conservative — it's grounded but not using all available information.

    **Start with:** [Completeness](/concepts/metrics/rag/generation-quality/completeness) to detect when relevant details from the context are left out of the response.

    **Also check:** [Chunk Attribution Utilization](/concepts/metrics/rag/generation-quality/chunk-attribution) to see which chunks actually influenced the answer and which were ignored.
  </Accordion>

  <Accordion title="Some retrieved chunks never get used" icon="box-archive">
    **Diagnosis:** You're retrieving context that the model ignores — either the chunks are marginally relevant or your prompt isn't encouraging full utilization.

    **Start with:** [Chunk Attribution Utilization](/concepts/metrics/rag/generation-quality/chunk-attribution) to see exactly which chunks contributed to the response.

    **If attribution is low but chunks are relevant:** Consider adjusting your prompt to encourage the model to incorporate more context.
  </Accordion>
</AccordionGroup>

## How RAG metrics connect

RAG evaluation flows from retrieval to generation. Here's how the metrics relate:

```mermaid theme={null}
flowchart LR
    subgraph Retrieval["Retrieval Quality"]
        CR[Chunk Relevance]
        CTR[Context Relevance]
        CP[Context Precision]
        PK[Precision @ K]
    end

    subgraph Generation["Generation Quality"]
        CA[Context Adherence]
        CAU[Chunk Attribution]
        COMP[Completeness]
    end

    CR --> CP
    CR --> PK
    CR --> CTR
    CTR --> CA
    CA --> COMP
    CTR --> CAU

    style Retrieval fill:#e8f4f8,stroke:#0ea5e9
    style Generation fill:#f0fdf4,stroke:#22c55e
```

**Reading the diagram:**

* **Chunk Relevance** is the foundation — it determines whether individual chunks help answer the query and feeds into precision metrics.
* **Context Relevance** asks whether the overall context is sufficient, bridging retrieval and generation.
* **Context Adherence** checks if the model stays grounded, while **Completeness** checks if it uses everything relevant.
* **Chunk Attribution** reveals which specific chunks actually influenced the response.

<Warning title="Common pitfall">
  **High Context Adherence + Low Completeness?** Your model is being too conservative. It's staying grounded but only using a fraction of the available context. Adjust your prompt to encourage more comprehensive answers.
</Warning>

<Tip title="Retrieval vs. generation problems">
  If **Context Relevance is low**, fix your retriever first — better embeddings, different chunking strategy, or higher K. If Context Relevance is high but **Context Adherence is low**, the problem is in generation — the model has the right context but isn't using it faithfully.
</Tip>

Below is a quick reference table of all RAG metrics by category:

### Retrieval Quality

| Name                                                                                             | Description                                                                                                             | Supported Nodes | When to Use                                                                                                             | Example Use Case                                                                                                                        |
| :----------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------- | :-------------- | :---------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------- |
| [Chunk Relevance](/concepts/metrics/rag/retrieval-quality/chunk-relevance)                       | Measures whether each retrieved chunk contains information that could help answer the user's query.                     | Retriever span  | When evaluating the relevance of individual retrieved chunks to the query.                                              | A RAG system that needs to ensure each retrieved document chunk contributes useful information toward answering user questions.         |
| [Context Relevance (Query Adherence)](/concepts/metrics/rag/retrieval-quality/context-relevance) | Evaluates whether the retrieved context is relevant to the user's query.                                                | Retriever span  | When assessing the quality of your retrieval system's results.                                                          | An internal knowledge base search that retrieves company policies relevant to specific employee questions.                              |
| [Context Precision](/concepts/metrics/rag/retrieval-quality/context-precision)                   | Measures the percentage of relevant chunks in the retrieved context, weighted by their position in the retrieval order. | Retriever span  | When evaluating the overall quality of your retrieval system's results and ranking effectiveness.                       | A document search system that needs to ensure retrieved chunks are relevant and properly ranked by importance.                          |
| [Precision @ K](/concepts/metrics/rag/retrieval-quality/precision-at-k)                          | Measures the percentage of relevant chunks among the top K retrieved chunks at a specific rank position.                | Retriever span  | When determining the optimal number of chunks to retrieve (Top K) and evaluating ranking quality at specific positions. | A RAG system that needs to optimize retrieval parameters to balance between capturing all relevant chunks and avoiding irrelevant ones. |

### Generation Quality

| Name                                                                                        | Description                                                                                                                       | Supported Nodes | When to Use                                                                                                             | Example Use Case                                                                                                                                                 |
| :------------------------------------------------------------------------------------------ | :-------------------------------------------------------------------------------------------------------------------------------- | :-------------- | :---------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [Chunk Attribution Utilization](/concepts/metrics/rag/generation-quality/chunk-attribution) | Assesses whether the response uses the retrieved chunks in its response, and properly attributes information to source documents. | Retriever span  | When implementing RAG systems and want to ensure proper attribution and that retrieved information is used efficiently. | A legal research assistant that must cite specific cases and statutes when providing legal information.                                                          |
| [Context Adherence](/concepts/metrics/rag/generation-quality/context-adherence)             | Measures how well the response aligns with the provided context.                                                                  | LLM span        | When you want to ensure the model is grounding its responses in the provided context.                                   | A financial advisor bot that must base investment recommendations on the client's specific financial situation and goals.                                        |
| [Completeness](/concepts/metrics/rag/generation-quality/completeness)                       | Measures how thoroughly the response covers the relevant information available in the provided context.                           | LLM span        | When evaluating if responses fully address the user's intent.                                                           | A healthcare chatbot, when provided with a patient's medical record as context, must include all relevant critical information from that record in its response. |

***

## Next steps

* [Back to Metrics Overview](/concepts/metrics/overview)
* [Compare all metrics](/concepts/metrics/metric-comparison)