Visual Quality - Galileo

The Visual Quality metric is task-grounded rather than purely aesthetic: an image / PDF is considered high quality only if the specific task can be completed reliably from the visual evidence. An image / PDF is considered low quality when blur, glare, compression artifacts, darkness, occlusion, crop, distortion, or other degradation makes task-critical information unprocessable.

Visual Quality at a glance

Property	Description
Name	Visual Quality
Category	Multimodal Quality
Metric Level	LLM Span
LLM-as-a-judge Support	✅
Luna Support	❌
Protect Runtime Protection	❌
Value Type	boolean

Score interpretation

Score	Label	Meaning
False	Low Quality	The quality of the image / PDF is low and information required for performing the task is unprocessable
True	High Quality	The quality of the image / PDF is high and performing the task is feasible

When to use this metric

Example scenario

Reading a serial number from a photo

Task prompt: “Read the device serial number and return it exactly.”

The serial number region is in focus, not occluded, and the characters are legible.

Motion blur or glare obscures the serial number region, making characters unreadable and the task infeasible.

Inputs considered

The evaluator examines the following when available:

The raw input image provided to the LLM span
The text prompt that defines the task the model is expected to perform on that image

Quality is evaluated strictly with respect to whether the task described in the prompt can be completed; a low-resolution image that is perfectly adequate for one task may be insufficient for another.

Calculation method

Visual Quality is computed through a multi-step process:

Task grounding

The evaluator reads the adjoining text prompt to identify what information in the image is required to complete the task.

Visual evidence review

The evaluator inspects the image for quality issues (blur, glare, compression, darkness, occlusion, crop, distortion) that affect task-critical regions.

Binary decision

The evaluator returns a binary label: if the task is feasible from the visual evidence, otherwise .

This metric is typically computed by prompting an LLM with access to the image and the text prompt, which may require additional LLM calls to compute and can impact usage and billing.

Best practices

Duplicate and create custom metric

We recommend you to duplicate this metric, modify the prompt and define your task in the prompt for better performance.

Use metric as a diagnostic tool

Visual Quality helps you diagnose why a certain action wasn’t completed by the agent, by telling if image quality was the reason.

Combine with Action Completion

Use Action Completion to check if your agent is completing the user asks, then use Visual Quality to diagnose where action completion is low.

Performance Benchmarks

We evaluated Visual Quality against human expert labels on an internal dataset of varied samples using top frontier models.

Model	F1 (True)
GPT-4.1	0.84
Claude Sonnet 4.6	0.81
Gemini 3.1 Flash	0.85

If you would like to dive deeper or start implementing Visual Quality, check out the following resources:

Examples

Visual Quality Examples - Log in and explore the “Visual Quality” Log Stream in the “Preset Metric Examples” Project to see this metric in action.

Action Completion

​Visual Quality at a glance

​Score interpretation

​When to use this metric

​Example scenario

​Reading a serial number from a photo

​Inputs considered

​Calculation method

​Best practices

Duplicate and create custom metric

Use metric as a diagnostic tool

Combine with Action Completion

​Performance Benchmarks

​Related Resources

​Examples

​Related Concepts

Visual Quality at a glance

Score interpretation

When to use this metric

Example scenario

Reading a serial number from a photo

Inputs considered

Calculation method

Best practices

Performance Benchmarks

Related Resources

Examples

Related Concepts