Precision	Recall	F1-Score
{negativeLabel}	{negClass.precision.toFixed(2)}	{negClass.recall.toFixed(2)}	{negClass.f1.toFixed(2)}
{positiveLabel}	{posClass.precision.toFixed(2)}	{posClass.recall.toFixed(2)}	{posClass.f1.toFixed(2)}

Precision

Recall

F1-Score

{negativeLabel}

{negClass.precision.toFixed(2)}

{negClass.recall.toFixed(2)}

{negClass.f1.toFixed(2)}

{positiveLabel}

{posClass.precision.toFixed(2)}

{posClass.recall.toFixed(2)}

{posClass.f1.toFixed(2)}

{}

{titlePrefix}Confusion Matrix (Normalized)

{}

Predicted

{}

{displayPredictedLabels.left}

{displayPredictedLabels.right}

{}

Actual

{displayActualLabels.top}

{showCounts &&

{displayMatrix.tl.count}

}

{formatValue(displayMatrix.tl.pct)}

{showCounts &&

{displayMatrix.tr.count}

}

{formatValue(displayMatrix.tr.pct)}

{}

{displayActualLabels.bottom}

{showCounts &&

{displayMatrix.bl.count}

}

{formatValue(displayMatrix.bl.pct)}

{showCounts &&

{displayMatrix.br.count}

}

{formatValue(displayMatrix.br.pct)}

{}

{displayFormat === "fraction" ? "0.0" : "0%"}

{palette.map((color, idx) =>

)}

{displayFormat === "fraction" ? "1.0" : "100%"}

; }; export const DefinitionCard = ({children}) => { return

{children}

; }; export const Scale = ({low, mid, high, lowLabel = "Low", midLabel = "Mid", highLabel = "High", lowDescription, midDescription, highDescription, midColor = "yellow", inverted = false}) => { const lowColor = inverted ? "green" : "red"; const highColor = inverted ? "red" : "green"; const gradientId = inverted ? "greenToRed" : "redToGreen"; return

{low}

{mid &&

{mid}

}

{high}

{lowLabel}

{lowDescription &&

{lowDescription}

}

{mid &&

{midLabel}

{midDescription &&

{midDescription}

}

{highLabel}

{highDescription &&

{highDescription}

}

; }; Correctness measures whether a given model response contains factually accurate information. Correctness is a continuous metric ranging from 0 to 1: This metric is particularly valuable for uncovering open-domain hallucinations: factual errors that don't relate to any specific documents or context provided to the model. ## Calculation method Correctness is computed through a multi-step process: Additional evaluation requests are sent to OpenAI's GPT4-o model to analyze the response. A carefully engineered chain-of-thought prompt is used to ask the model to evaluate whether the response contains factually accurate information. The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus. Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on factual accuracy. The final Correctness score is computed as the ratio of 'yes' responses to the total number of evaluation responses. We also surface one of the generated explanations, always choosing one that aligns with the majority judgment among the responses: * If the score is greater than 0.5, the explanation will provide an argument that the response is factual * If the score is less than 0.5, the explanation will provide an argument that it is not factual This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing. ## Understanding correctness

How Correctness Differs from Context Adherence

It's important to understand the distinction between related metrics:

Correctness: Measures whether a model response has factually correct information, regardless of whether that information is contained in the provided context.

Context Adherence: Measures whether the response adheres specifically to the information provided in the context.

Example: In a Text-to-SQL scenario, a response could be factually correct (high Correctness) but not derived from the provided context (low Context Adherence). Conversely, a response could faithfully represent the context (high Context Adherence) but contain factual errors if the context itself is incorrect.

## Optimizing your AI system

Addressing Low Correctness Scores

When a response has a low Correctness score, it's likely that the response contains non-factual information. To improve your system:

Flag and examine potentially non-factual responses: Identify patterns in responses that tend to contain factual errors.

Adjust your prompts: Instruct the model to stick to information it's given in the context and avoid speculation.

Implement verification steps: Add additional checks for factual accuracy before responses reach end users.

Consider model selection: Some models may be more factually accurate than others for specific domains.

## Best practices For critical applications, implement automated fact-checking against trusted knowledge bases or databases. Instruct models to ground their responses in verifiable information and cite sources when possible. Track Correctness scores across different knowledge domains to identify areas where your model may be less reliable. Develop domain-specific guardrails that can catch common factual errors before they reach users. When optimizing for Correctness, remember that even human experts can disagree on certain facts. Consider implementing confidence levels for responses, especially in domains with evolving knowledge or subjective elements. ## Performance Benchmarks We evaluated Correctness against human expert labels on an internal dataset using top frontier models. | Model | F1 (True) | | :---------------------- | :-------: | | GPT-4.1 | 0.90 | | GPT-4.1-mini (judges=3) | 0.89 | | Claude Sonnet 4.5 | 0.89 | | Gemini 3 Flash | 0.92 | ### GPT-4.1 Classification Report Benchmarks based on internal evaluation dataset. Performance may vary by use case. ## Related Resources If you would like to dive deeper or start implementing Correctness, check out the following resources: ### Examples * [Correctness Examples](https://app.galileo.ai) - Log in and explore the "Correctness" Log Stream in the "Preset Metric Examples" Project to see this metric in action. ### Related Concepts * [Context Adherence](/concepts/metrics/rag/generation-quality/context-adherence) * [Context Relevance](/concepts/metrics/rag/retrieval-quality/context-relevance) * [Ground Truth Adherence](/concepts/metrics/response-quality/ground-truth-adherence)