Precision	Recall	F1-Score
{negativeLabel}	{negClass.precision.toFixed(2)}	{negClass.recall.toFixed(2)}	{negClass.f1.toFixed(2)}
{positiveLabel}	{posClass.precision.toFixed(2)}	{posClass.recall.toFixed(2)}	{posClass.f1.toFixed(2)}

Precision

Recall

F1-Score

{negativeLabel}

{negClass.precision.toFixed(2)}

{negClass.recall.toFixed(2)}

{negClass.f1.toFixed(2)}

{positiveLabel}

{posClass.precision.toFixed(2)}

{posClass.recall.toFixed(2)}

{posClass.f1.toFixed(2)}

{}

{titlePrefix}Confusion Matrix (Normalized)

{}

Predicted

{}

{displayPredictedLabels.left}

{displayPredictedLabels.right}

{}

Actual

{displayActualLabels.top}

{showCounts &&

{displayMatrix.tl.count}

}

{formatValue(displayMatrix.tl.pct)}

{showCounts &&

{displayMatrix.tr.count}

}

{formatValue(displayMatrix.tr.pct)}

{}

{displayActualLabels.bottom}

{showCounts &&

{displayMatrix.bl.count}

}

{formatValue(displayMatrix.bl.pct)}

{showCounts &&

{displayMatrix.br.count}

}

{formatValue(displayMatrix.br.pct)}

{}

{displayFormat === "fraction" ? "0.0" : "0%"}

{palette.map((color, idx) =>

)}

{displayFormat === "fraction" ? "1.0" : "100%"}

; }; export const MetricWhenToUse = ({description, useCases}) => { return

When to Use This Metric

{description} {useCases != null && useCases.map((useCase, index) =>

{useCase.title}{useCase.description ? `: ${useCase.description}` : ''}

)} ; }; export const DefinitionCard = ({children}) => { return

{children}

; }; export const Pill = ({label, color, backgroundColor}) => {label} ; Visual Quality is a binary metric that judges whether the quality of an input image / PDF in an LLM span is sufficient for the task described in the adjoining text prompt to be reliably performed. The Visual Quality metric is task-grounded rather than purely aesthetic: an image / PDF is considered high quality only if the specific task can be completed reliably from the visual evidence. An image / PDF is considered low quality when blur, glare, compression artifacts, darkness, occlusion, crop, distortion, or other degradation makes task-critical information unprocessable. ## Visual Quality at a glance | Property | Description | | :----------------------------- | :----------------- | | **Name** | Visual Quality | | **Category** | Multimodal Quality | | **Metric Level** | LLM Span | | **LLM-as-a-judge Support** | ✅ | | **Luna Support** | ❌ | | **Protect Runtime Protection** | ❌ | | **Value Type** | boolean | ## Score interpretation | Score | Label | Meaning | | :-------- | :----------- | :------------------------------------------------------------------------------------------------------ | | **False** | Low Quality | The quality of the image / PDF is low and information required for performing the task is unprocessable | | **True** | High Quality | The quality of the image / PDF is high and performing the task is feasible | ## When to use this metric ## Example scenario

Reading a serial number from a photo

Task prompt: “Read the device serial number and return it exactly.”

High Quality ( ): {" "} The serial number region is in focus, not occluded, and the characters are legible.

Low Quality ( ): {" "} Motion blur or glare obscures the serial number region, making characters unreadable and the task infeasible.

## Inputs considered The evaluator examines the following when available: * The raw input image provided to the LLM span * The text prompt that defines the task the model is expected to perform on that image Quality is evaluated strictly with respect to whether the task described in the prompt can be completed; a low-resolution image that is perfectly adequate for one task may be insufficient for another. ## Calculation method Visual Quality is computed through a multi-step process: The evaluator reads the adjoining text prompt to identify what information in the image is required to complete the task. The evaluator inspects the image for quality issues (blur, glare, compression, darkness, occlusion, crop, distortion) that affect task-critical regions. The evaluator returns a binary label: if the task is feasible from the visual evidence, otherwise . This metric is typically computed by prompting an LLM with access to the image and the text prompt, which may require additional LLM calls to compute and can impact usage and billing. ## Best practices We recommend you to duplicate this metric, modify the prompt and define your task in the prompt for better performance. Visual Quality helps you diagnose why a certain action wasn't completed by the agent, by telling if image quality was the reason. Use Action Completion to check if your agent is completing the user asks, then use Visual Quality to diagnose where action completion is low. ## Performance Benchmarks We evaluated Visual Quality against human expert labels on an internal dataset of varied samples using top frontier models. | Model | F1 (True) | | :---------------- | :-------: | | GPT-4.1 | 0.84 | | Claude Sonnet 4.6 | 0.81 | | Gemini 3.1 Flash | 0.85 | ## Related Resources If you would like to dive deeper or start implementing Visual Quality, check out the following resources: ### Examples * [Visual Quality Examples](https://app.galileo.ai) - Log in and explore the "Visual Quality" Log Stream in the "Preset Metric Examples" Project to see this metric in action. ### Related Concepts * [Action Completion](/concepts/metrics/agentic/action-advancement)