Skip to main content
The Visual Quality metric is task-grounded rather than purely aesthetic: an image / PDF is considered high quality only if the specific task can be completed reliably from the visual evidence. An image / PDF is considered low quality when blur, glare, compression artifacts, darkness, occlusion, crop, distortion, or other degradation makes task-critical information unprocessable.

Visual Quality at a glance

PropertyDescription
NameVisual Quality
CategoryMultimodal Quality
Metric LevelLLM Span
LLM-as-a-judge Support
Luna Support
Protect Runtime Protection
Value Typeboolean

Score interpretation

ScoreLabelMeaning
FalseLow QualityThe quality of the image / PDF is low and information required for performing the task is unprocessable
TrueHigh QualityThe quality of the image / PDF is high and performing the task is feasible

When to use this metric

Example scenario

Reading a serial number from a photo

Task prompt: “Read the device serial number and return it exactly.”
The serial number region is in focus, not occluded, and the characters are legible.
Motion blur or glare obscures the serial number region, making characters unreadable and the task infeasible.

Inputs considered

The evaluator examines the following when available:
  • The raw input image provided to the LLM span
  • The text prompt that defines the task the model is expected to perform on that image
Quality is evaluated strictly with respect to whether the task described in the prompt can be completed; a low-resolution image that is perfectly adequate for one task may be insufficient for another.

Calculation method

Visual Quality is computed through a multi-step process:
1

Task grounding

The evaluator reads the adjoining text prompt to identify what information in the image is required to complete the task.
2

Visual evidence review

The evaluator inspects the image for quality issues (blur, glare, compression, darkness, occlusion, crop, distortion) that affect task-critical regions.
3

Binary decision

The evaluator returns a binary label: if the task is feasible from the visual evidence, otherwise .
This metric is typically computed by prompting an LLM with access to the image and the text prompt, which may require additional LLM calls to compute and can impact usage and billing.

Best practices

Duplicate and create custom metric

We recommend you to duplicate this metric, modify the prompt and define your task in the prompt for better performance.

Use metric as a diagnostic tool

Visual Quality helps you diagnose why a certain action wasn’t completed by the agent, by telling if image quality was the reason.

Combine with Action Completion

Use Action Completion to check if your agent is completing the user asks, then use Visual Quality to diagnose where action completion is low.

Performance Benchmarks

We evaluated Visual Quality against human expert labels on an internal dataset of varied samples using top frontier models.
ModelF1 (True)
GPT-4.10.84
Claude Sonnet 4.60.81
Gemini 3.1 Flash0.85
If you would like to dive deeper or start implementing Visual Quality, check out the following resources:

Examples

  • Visual Quality Examples - Log in and explore the “Visual Quality” Log Stream in the “Preset Metric Examples” Project to see this metric in action.