Skip to main content
The Visual Fidelity metric is a rule-adherence check: using only visible evidence from the image and the provided brand rules, the evaluator determines whether the image satisfies every applicable rule. The metric is grounded in explicit rule compliance rather than pure aesthetics, prompt reconstruction, or any separate image-quality standard not written in the rules.
To use this metric, you will need to duplicate and edit the prompt to provide your rules in the specified section of the prompt.

Visual Fidelity at a glance

PropertyDescription
NameVisual Fidelity
CategoryMultimodal Quality
Metric LevelLLM Span
LLM-as-a-judge Support
Luna Support
Protect Runtime Protection
Value TypeBoolean

Score interpretation

ScoreLabelMeaning
FalseNon-CompliantOne or more applicable provided rules are violated based on visible evidence in the image
TrueCompliantAll applicable provided rules pass based on visible evidence in the image

When to use this metric

Example scenario

Brand rules compliance for a generated banner

Provided rules: “Logo must appear in the top-left”, “Primary color must be #E35454”, “No competitor logos”.
The generated banner visibly satisfies each applicable rule (logo placement, correct primary color usage, no prohibited content).
The logo is missing or misplaced, the primary color rule is violated, or prohibited content is present — any single rule violation fails the metric.

Inputs considered

The evaluator examines the following when available:
  • The generated image produced by the LLM span (output image)
  • The set of provided brand or content rules that apply to the image
Only rules that are applicable to the generated image are evaluated; inapplicable rules are skipped and do not affect the score. Compliance is determined solely from what is visually observable — the evaluator does not infer intent or reconstruct the original prompt.

Calculation method

Visual Fidelity is computed through a multi-step process:
1

Rule scoping

Determine which provided rules are applicable to the generated image and should be evaluated.
2

Visible-evidence evaluation

Using only visible evidence in the image, evaluate each applicable rule as pass or fail. The evaluator does not reconstruct prompts or apply any external image-quality standard.
3

All-rules decision

Return if and only if all applicable rules pass. Otherwise return .
This metric is typically computed by prompting an LLM with access to the generated image and the provided rules, which may require additional LLM calls to compute and can impact usage and billing.

Best practices

Write observable rules

Each rule should describe something that can be confirmed or denied from visual inspection alone. Avoid rules that require knowledge of the generation prompt or model internals.

Keep rules atomic

Express one constraint per rule so that a failing rule identifies a specific violation rather than a bundle of requirements.

Version your rule sets

Treat brand rule sets as versioned artifacts so changes in guidelines can be tracked and their effect on compliance scores can be measured.

Combine with Instruction Adherence

Use Instruction Adherence to also check if your LLM generating images is following your instructions.

Performance Benchmarks

We evaluated Visual Fidelity against human expert labels on an internal dataset of varied samples using top frontier models.
ModelF1 (True)
GPT-4.10.79
Claude Sonnet 4.60.76
Gemini 3.1 Flash0.81
If you would like to dive deeper or start implementing Visual Fidelity, check out the following resources:

Examples

  • Visual Fidelity Examples - Log in and explore the “Visual Fidelity” Log Stream in the “Preset Metric Examples” Project to see this metric in action.