Precision	Recall	F1-Score
{negativeLabel}	{negClass.precision.toFixed(2)}	{negClass.recall.toFixed(2)}	{negClass.f1.toFixed(2)}
{positiveLabel}	{posClass.precision.toFixed(2)}	{posClass.recall.toFixed(2)}	{posClass.f1.toFixed(2)}

Precision

Recall

F1-Score

{negativeLabel}

{negClass.precision.toFixed(2)}

{negClass.recall.toFixed(2)}

{negClass.f1.toFixed(2)}

{positiveLabel}

{posClass.precision.toFixed(2)}

{posClass.recall.toFixed(2)}

{posClass.f1.toFixed(2)}

{}

{titlePrefix}Confusion Matrix (Normalized)

{}

Predicted

{}

{displayPredictedLabels.left}

{displayPredictedLabels.right}

{}

Actual

{displayActualLabels.top}

{showCounts &&

{displayMatrix.tl.count}

}

{formatValue(displayMatrix.tl.pct)}

{showCounts &&

{displayMatrix.tr.count}

}

{formatValue(displayMatrix.tr.pct)}

{}

{displayActualLabels.bottom}

{showCounts &&

{displayMatrix.bl.count}

}

{formatValue(displayMatrix.bl.pct)}

{showCounts &&

{displayMatrix.br.count}

}

{formatValue(displayMatrix.br.pct)}

{}

{displayFormat === "fraction" ? "0.0" : "0%"}

{palette.map((color, idx) =>

)}

{displayFormat === "fraction" ? "1.0" : "100%"}

; }; export const MetricWhenToUse = ({description, useCases}) => { return

When to Use This Metric

{description} {useCases != null && useCases.map((useCase, index) =>

{useCase.title}{useCase.description ? `: ${useCase.description}` : ''}

)} ; }; export const DefinitionCard = ({children}) => { return

{children}

; }; export const Pill = ({label, color, backgroundColor}) => {label} ; Visual Fidelity is a binary metric that evaluates whether a generated image in an LLM span satisfies every applicable provided brand rule, based solely on visible evidence in the image. The Visual Fidelity metric is a rule-adherence check: using only visible evidence from the image and the provided brand rules, the evaluator determines whether the image satisfies every applicable rule. The metric is grounded in explicit rule compliance rather than pure aesthetics, prompt reconstruction, or any separate image-quality standard not written in the rules. To use this metric, you will need to duplicate and edit the prompt to provide your rules in the specified section of the prompt. ## Visual Fidelity at a glance | Property | Description | | :----------------------------- | :----------------- | | **Name** | Visual Fidelity | | **Category** | Multimodal Quality | | **Metric Level** | LLM Span | | **LLM-as-a-judge Support** | ✅ | | **Luna Support** | ❌ | | **Protect Runtime Protection** | ❌ | | **Value Type** | Boolean | ## Score interpretation | Score | Label | Meaning | | :-------- | :------------ | :---------------------------------------------------------------------------------------- | | **False** | Non-Compliant | One or more applicable provided rules are violated based on visible evidence in the image | | **True** | Compliant | All applicable provided rules pass based on visible evidence in the image | ## When to use this metric ## Example scenario

Brand rules compliance for a generated banner

Provided rules: “Logo must appear in the top-left”, “Primary color must be #E35454”, “No competitor logos”.

Compliant ( ): {" "} The generated banner visibly satisfies each applicable rule (logo placement, correct primary color usage, no prohibited content).

Non-Compliant ( ): {" "} The logo is missing or misplaced, the primary color rule is violated, or prohibited content is present — any single rule violation fails the metric.

## Inputs considered The evaluator examines the following when available: * The generated image produced by the LLM span (output image) * The set of provided brand or content rules that apply to the image Only rules that are **applicable** to the generated image are evaluated; inapplicable rules are skipped and do not affect the score. Compliance is determined solely from what is **visually observable** — the evaluator does not infer intent or reconstruct the original prompt. ## Calculation method Visual Fidelity is computed through a multi-step process: Determine which provided rules are applicable to the generated image and should be evaluated. Using only visible evidence in the image, evaluate each applicable rule as pass or fail. The evaluator does not reconstruct prompts or apply any external image-quality standard. Return if and only if all applicable rules pass. Otherwise return . This metric is typically computed by prompting an LLM with access to the generated image and the provided rules, which may require additional LLM calls to compute and can impact usage and billing. ## Best practices Each rule should describe something that can be confirmed or denied from visual inspection alone. Avoid rules that require knowledge of the generation prompt or model internals. Express one constraint per rule so that a failing rule identifies a specific violation rather than a bundle of requirements. Treat brand rule sets as versioned artifacts so changes in guidelines can be tracked and their effect on compliance scores can be measured. Use Instruction Adherence to also check if your LLM generating images is following your instructions. ## Performance Benchmarks We evaluated Visual Fidelity against human expert labels on an internal dataset of varied samples using top frontier models. | Model | F1 (True) | | :---------------- | :-------: | | GPT-4.1 | 0.79 | | Claude Sonnet 4.6 | 0.76 | | Gemini 3.1 Flash | 0.81 | ## Related Resources If you would like to dive deeper or start implementing Visual Fidelity, check out the following resources: ### Examples * [Visual Fidelity Examples](https://app.galileo.ai) - Log in and explore the "Visual Fidelity" Log Stream in the "Preset Metric Examples" Project to see this metric in action. ### Related Concepts * [Instruction Adherence](/concepts/metrics/response-quality/instruction-adherence)