Multimodal Quality Metrics

Multimodal Quality metrics help you measure whether multimodal inputs and outputs (such as images and audio conversations) are usable and compliant for the task at hand. Use Multimodal Quality metrics when you want to:

Validate that input images are clear enough for reliable task completion.
Enforce explicit brand or content rules on generated images using only visible evidence.
Detect turn-taking issues in audio-based conversations (overlap and barge-in).

To start using these metrics, first log your images, audio, or documents with Multimodal Observability. Below is a quick reference table of all multimodal quality metrics:

Name	Description	Supported Nodes	When to Use	Example Use Case
Visual Quality	Judges whether the quality of an input image / PDF is sufficient to reliably complete the task in the adjoining prompt.	LLM span	When user-supplied images might be blurry, occluded, cropped, or poorly lit, and those artifacts can make the task infeasible.	A document capture flow where you need to know if a photo is readable enough to extract a serial number.
Visual Fidelity	Checks whether a generated image satisfies every applicable provided brand rule, based only on visible evidence.	LLM span	When generated images must comply with explicit brand, style, layout, or content rules.	A marketing image generator where logos, colors, and prohibited elements must always comply with a brand rule set.
Interruption Detection	Detects turn-taking violations in audio conversations, including overlap and barge-in events.	Session (trace inputs/outputs only)	When evaluating voice agents where smooth turn-taking and endpoint are critical to user experience.	A voice assistant where the agent must avoid speaking over users or cutting them off mid-utterance.

Next steps

Tone Visual Quality

⌘I

​Next steps

Next steps