Precision	Recall	F1-Score
{negativeLabel}	{negClass.precision.toFixed(2)}	{negClass.recall.toFixed(2)}	{negClass.f1.toFixed(2)}
{positiveLabel}	{posClass.precision.toFixed(2)}	{posClass.recall.toFixed(2)}	{posClass.f1.toFixed(2)}

Precision

Recall

F1-Score

{negativeLabel}

{negClass.precision.toFixed(2)}

{negClass.recall.toFixed(2)}

{negClass.f1.toFixed(2)}

{positiveLabel}

{posClass.precision.toFixed(2)}

{posClass.recall.toFixed(2)}

{posClass.f1.toFixed(2)}

{}

{titlePrefix}Confusion Matrix (Normalized)

{}

Predicted

{}

{displayPredictedLabels.left}

{displayPredictedLabels.right}

{}

Actual

{displayActualLabels.top}

{showCounts &&

{displayMatrix.tl.count}

}

{formatValue(displayMatrix.tl.pct)}

{showCounts &&

{displayMatrix.tr.count}

}

{formatValue(displayMatrix.tr.pct)}

{}

{displayActualLabels.bottom}

{showCounts &&

{displayMatrix.bl.count}

}

{formatValue(displayMatrix.bl.pct)}

{showCounts &&

{displayMatrix.br.count}

}

{formatValue(displayMatrix.br.pct)}

{}

{displayFormat === "fraction" ? "0.0" : "0%"}

{palette.map((color, idx) =>

)}

{displayFormat === "fraction" ? "1.0" : "100%"}

; }; export const DefinitionCard = ({children}) => { return

{children}

; }; export const Scale = ({low, mid, high, lowLabel = "Low", midLabel = "Mid", highLabel = "High", lowDescription, midDescription, highDescription, midColor = "yellow", inverted = false}) => { const lowColor = inverted ? "green" : "red"; const highColor = inverted ? "red" : "green"; const gradientId = inverted ? "greenToRed" : "redToGreen"; return

{low}

{mid &&

{mid}

}

{high}

{lowLabel}

{lowDescription &&

{lowDescription}

}

{mid &&

{midLabel}

{midDescription &&

{midDescription}

}

{highLabel}

{highDescription &&

{highDescription}

}

; }; Ground Truth Adherence measures whether a model's response is semantically equivalent to your reference answer (Ground Truth). Ground Truth Adherence is a continuous metric ranging from 0 to 1: This metric helps evaluate how closely your model's outputs match expected or ideal responses, which is particularly valuable for: * Evaluating model performance against a benchmark dataset * Ensuring consistency in critical applications * Measuring the impact of model or prompt changes This metric is only supported in experiments, and requires a Ground Truth to be set in the `output` column of your experiment's [dataset](/sdk-api/experiments/datasets). ## Calculation method Ground Truth Adherence is computed through a multi-step process: Additional evaluation requests are sent to OpenAI's GPT4o model to analyze the semantic relationship between responses. A carefully engineered chain-of-thought prompt asks the model to evaluate whether the response and Ground Truth convey the same meaning. The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus. Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on semantic equivalence. The final Ground Truth Adherence score is computed as the ratio of 'yes' responses to the total number of evaluation responses. We also surface one of the generated explanations, always choosing one that aligns with the majority judgment among the responses. This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing. ## Understanding ground truth adherence

Differentiating from Other Metrics

It's important to understand how Ground Truth Adherence differs from related metrics:

Ground Truth Adherence: Measures semantic equivalence to a reference answer.

Correctness: Measures factual accuracy regardless of any reference answer.

Context Adherence: Measures alignment with provided context, not a reference answer.

## Optimizing your AI system

Addressing Low Ground Truth Adherence

When responses have low Ground Truth Adherence scores, your model is generating outputs that differ semantically from your reference answers. To improve your system:

Analyze divergence patterns: Identify common ways in which responses differ from ground truth.

Refine your prompts: Adjust instructions to guide the model toward your expected output format and content.

Consider few-shot examples: Provide examples in your prompt that demonstrate the desired response pattern.

Evaluate ground truth quality: Ensure your reference answers are clear, consistent, and representative of ideal responses.

## Best practices Create a varied set of reference answers that cover different response styles and edge cases. Define what constitutes semantic equivalence for your specific use case and domain. Track Ground Truth Adherence when upgrading models to ensure consistent performance. Use Ground Truth Adherence alongside metrics like Correctness and Instruction Adherence for a complete evaluation. When optimizing for Ground Truth Adherence, remember that there may be multiple valid ways to express the same information. Consider whether strict adherence to specific wording is necessary, or if semantic equivalence is sufficient for your use case. ## Performance Benchmarks We evaluated Ground Truth Adherence against human expert labels on an internal dataset using top frontier models. | Model | F1 (True) | | :---------------------- | :-------: | | GPT-4.1 | 0.90 | | GPT-4.1-mini (judges=3) | 0.91 | | Claude Sonnet 4.5 | 0.91 | | Gemini 3 Flash | 0.94 | ### GPT-4.1 Classification Report Benchmarks based on internal evaluation dataset. Performance may vary by use case. ## Related Resources If you would like to dive deeper or start implementing Ground Truth Adherence, check out the following resources: ### Related Concepts * [Context Adherence](/concepts/metrics/rag/generation-quality/context-adherence) * [Context Relevance](/concepts/metrics/rag/retrieval-quality/context-relevance) * [Correctness](/concepts/metrics/response-quality/correctness)