Visual Quality at a glance
| Property | Description |
|---|---|
| Name | Visual Quality |
| Category | Multimodal Quality |
| Metric Level | LLM Span |
| LLM-as-a-judge Support | ✅ |
| Luna Support | ❌ |
| Protect Runtime Protection | ❌ |
| Value Type | boolean |
Score interpretation
| Score | Label | Meaning |
|---|---|---|
| False | Low Quality | The quality of the image / PDF is low and information required for performing the task is unprocessable |
| True | High Quality | The quality of the image / PDF is high and performing the task is feasible |
When to use this metric
Example scenario
Reading a serial number from a photo
Task prompt: “Read the device serial number and return it exactly.”
The serial number region is in focus, not occluded, and the characters are legible.
Motion blur or glare obscures the serial number region, making characters unreadable and the task infeasible.
Inputs considered
The evaluator examines the following when available:- The raw input image provided to the LLM span
- The text prompt that defines the task the model is expected to perform on that image
Calculation method
Visual Quality is computed through a multi-step process:Task grounding
The evaluator reads the adjoining text prompt to identify what information in the image is required to complete the task.
Visual evidence review
The evaluator inspects the image for quality issues (blur, glare, compression, darkness, occlusion, crop, distortion) that affect task-critical regions.
This metric is typically computed by prompting an LLM with access to the image and the text prompt, which may require additional LLM calls to compute and can impact usage and billing.
Best practices
Duplicate and create custom metric
We recommend you to duplicate this metric, modify the prompt and define your task in the prompt for better performance.
Use metric as a diagnostic tool
Visual Quality helps you diagnose why a certain action wasn’t completed by the agent, by telling if image quality was the reason.
Combine with Action Completion
Use Action Completion to check if your agent is completing the user asks, then use Visual Quality to diagnose where action completion is low.
Performance Benchmarks
We evaluated Visual Quality against human expert labels on an internal dataset of varied samples using top frontier models.| Model | F1 (True) |
|---|---|
| GPT-4.1 | 0.84 |
| Claude Sonnet 4.6 | 0.81 |
| Gemini 3.1 Flash | 0.85 |
Related Resources
If you would like to dive deeper or start implementing Visual Quality, check out the following resources:Examples
- Visual Quality Examples - Log in and explore the “Visual Quality” Log Stream in the “Preset Metric Examples” Project to see this metric in action.