Precision	Recall	F1-Score
{negativeLabel}	{negClass.precision.toFixed(2)}	{negClass.recall.toFixed(2)}	{negClass.f1.toFixed(2)}
{positiveLabel}	{posClass.precision.toFixed(2)}	{posClass.recall.toFixed(2)}	{posClass.f1.toFixed(2)}

Precision

Recall

F1-Score

{negativeLabel}

{negClass.precision.toFixed(2)}

{negClass.recall.toFixed(2)}

{negClass.f1.toFixed(2)}

{positiveLabel}

{posClass.precision.toFixed(2)}

{posClass.recall.toFixed(2)}

{posClass.f1.toFixed(2)}

{}

{titlePrefix}Confusion Matrix (Normalized)

{}

Predicted

{}

{displayPredictedLabels.left}

{displayPredictedLabels.right}

{}

Actual

{displayActualLabels.top}

{showCounts &&

{displayMatrix.tl.count}

}

{formatValue(displayMatrix.tl.pct)}

{showCounts &&

{displayMatrix.tr.count}

}

{formatValue(displayMatrix.tr.pct)}

{}

{displayActualLabels.bottom}

{showCounts &&

{displayMatrix.bl.count}

}

{formatValue(displayMatrix.bl.pct)}

{showCounts &&

{displayMatrix.br.count}

}

{formatValue(displayMatrix.br.pct)}

{}

{displayFormat === "fraction" ? "0.0" : "0%"}

{palette.map((color, idx) =>

)}

{displayFormat === "fraction" ? "1.0" : "100%"}

; }; export const DefinitionCard = ({children}) => { return

{children}

; }; export const Pill = ({label, color, backgroundColor}) => {label} ; Context Relevance measures whether your retrieved context, taken together, contains enough information to fully answer the user query. ## Context relevance Context Relevance asks whether your retrieved context, as a whole, contains enough information to fully answer the user query. High Context Relevance values indicate strong confidence that there is enough context to fully answer the question. Low Context Relevance values are a sign that you need to increase your Top K, modify your retrieval strategy, or use better embeddings. Context Relevance is differentiated from [Context Adherence](/concepts/metrics/rag/generation-quality/context-adherence): Context Relevance evaluates whether the retrieved context is relevant to a user's query whereas Context Adherence determines how well the response aligns to provided context. ### Chunk Relevance vs. Context Relevance * **Chunk Relevance** ([Chunk Relevance](/concepts/metrics/rag/retrieval-quality/chunk-relevance)) evaluates **each chunk individually**: does this chunk contain anything useful for answering the query? * **Context Relevance** evaluates **the retrieved context as a whole**: do all of these chunks, taken together, cover everything needed to answer the query end-to-end? Use **Chunk Relevance** when you're tuning chunking or reranking (which chunks should show up at all), and **Context Relevance** when you're deciding if retrieval "succeeded" for a given query and whether to adjust Top K, retriever configuration, or fallback behavior. ### Reading Context Relevance with Context Precision * **High Context Relevance & High Context Precision**: Retrieved context is both sufficient and mostly noise-free — focus next on generation quality and grounding. * **High Context Relevance & Low Context Precision**: The right information is present but mixed with a lot of irrelevant chunks — keep your recall but prune noise (better filters, reranking, or a lower Top K). * **Low Context Relevance & High Context Precision**: Most chunks are on-topic, but together they still miss pieces needed for a full answer — broaden retrieval (higher Top K, alternate retriever, or additional data sources). * **Low Context Relevance & Low Context Precision**: Retrieval is both incomplete and noisy — revisit embeddings, indexing, and query formulation end-to-end. ## Best practices Leverage Context Relevance when assessing the quality of your retrieval system's results and determining how accurately it adheres to queries. Use context relevance alongside context adherence, correctness, and completeness metrics for a comprehensive view of response quality. ## Performance Benchmarks We evaluated Context Relevance against human expert labels on an internal dataset of RAG samples using top frontier models. | Model | F1 (True) | | :---------------------- | :-------: | | GPT-4.1 | 0.82 | | GPT-4.1-mini (judges=3) | 0.85 | | Claude Sonnet 4.5 | 0.81 | | Gemini 3 Flash | 0.81 | ### GPT-4.1 Classification Report Benchmarks based on internal evaluation dataset. Performance may vary by use case. ## Related Resources If you would like to dive deeper or start implementing Context Relevance, check out the following resources: ### Examples * [Context Relevance Examples](https://app.galileo.ai) - Log in and explore the "Context Relevance" Log Stream in the "Preset Metric Examples" Project to see this metric in action. ### Related Concepts * [Context Adherence](/concepts/metrics/rag/generation-quality/context-adherence) * [Ground Truth Adherence](/concepts/metrics/response-quality/ground-truth-adherence) * [Correctness](/concepts/metrics/response-quality/correctness)