Skip to main content
Multimodal Quality metrics help you measure whether multimodal inputs and outputs (such as images and audio conversations) are usable and compliant for the task at hand. Use Multimodal Quality metrics when you want to:
  • Validate that input images are clear enough for reliable task completion.
  • Enforce explicit brand or content rules on generated images using only visible evidence.
  • Detect turn-taking issues in audio-based conversations (overlap and barge-in).
To start using these metrics, first log your images, audio, or documents with Multimodal Observability. Below is a quick reference table of all multimodal quality metrics:
NameDescriptionSupported NodesWhen to UseExample Use Case
Visual QualityJudges whether the quality of an input image / PDF is sufficient to reliably complete the task in the adjoining prompt.LLM spanWhen user-supplied images might be blurry, occluded, cropped, or poorly lit, and those artifacts can make the task infeasible.A document capture flow where you need to know if a photo is readable enough to extract a serial number.
Visual FidelityChecks whether a generated image satisfies every applicable provided brand rule, based only on visible evidence.LLM spanWhen generated images must comply with explicit brand, style, layout, or content rules.A marketing image generator where logos, colors, and prohibited elements must always comply with a brand rule set.
Interruption DetectionDetects turn-taking violations in audio conversations, including overlap and barge-in events.Session (trace inputs/outputs only)When evaluating voice agents where smooth turn-taking and endpoint are critical to user experience.A voice assistant where the agent must avoid speaking over users or cutting them off mid-utterance.

Next steps