Precision	Recall	F1-Score
{negativeLabel}	{negClass.precision.toFixed(2)}	{negClass.recall.toFixed(2)}	{negClass.f1.toFixed(2)}
{positiveLabel}	{posClass.precision.toFixed(2)}	{posClass.recall.toFixed(2)}	{posClass.f1.toFixed(2)}

{}

{titlePrefix}Confusion Matrix (Normalized)

{}

Predicted

{}

{displayPredictedLabels.left}

{displayPredictedLabels.right}

{}

Actual

{displayActualLabels.top}

{showCounts &&

{displayMatrix.tl.count}

}

{formatValue(displayMatrix.tl.pct)}

{showCounts &&

{displayMatrix.tr.count}

}

{formatValue(displayMatrix.tr.pct)}

{}

{displayActualLabels.bottom}

{showCounts &&

{displayMatrix.bl.count}

}

{formatValue(displayMatrix.bl.pct)}

{showCounts &&

{displayMatrix.br.count}

}

{formatValue(displayMatrix.br.pct)}

{}

{displayFormat === "fraction" ? "0.0" : "0%"}

{palette.map((color, idx) =>

)}

{displayFormat === "fraction" ? "1.0" : "100%"}

; }; export const DefinitionCard = ({children}) => { return

{children}

; }; export const Scale = ({low, mid, high, lowLabel = "Low", midLabel = "Mid", highLabel = "High", lowDescription, midDescription, highDescription, midColor = "yellow", inverted = false}) => { const lowColor = inverted ? "green" : "red"; const highColor = inverted ? "red" : "green"; const gradientId = inverted ? "greenToRed" : "redToGreen"; return

{low}

{mid &&

{mid}

}

{high}

{lowLabel}

{lowDescription &&

{lowDescription}

}

{mid &&

{midLabel}

{midDescription &&

{midDescription}

}

{highLabel}

{highDescription &&

{highDescription}

}

; }; Reasoning Coherence assesses whether an agent’s reasoning steps are logically consistent, non-contradictory, and aligned with the intended plan. ## Metric definition Reasoning Coherence — A binary metric that evaluates internal logical consistency within a single LLM call, with respect to the latest user input. * Type: Binary * 1 (Coherent): Intermediate reasoning events/summaries are mutually consistent and causally support the LLM input. * 0 (Incoherent): Contradictions, conflicting premises, circular logic, or unjustified reversals/jumps exist among the reasoning events. This metric is primarily used for agentic workflows that involve multi-step planning, tool usage, and intermediate reasoning traces. It helps validate that the steps an agent takes (or proposes) form a coherent path from problem to solution. Here's a scale that shows the relationship between Reasoning Coherence and potential impact on your AI system: Scale is 0-100 and is derived from binary judgments converted into a confidence score. ## Calculation method Reasoning Coherence is computed through a multi-step process: One or more evaluation requests are sent to an LLM evaluator to analyze the agent’s reasoning steps and plan alignment. A chain-of-thought style judge prompt guides the evaluator to check for logical consistency, contradictions, and adherence to the plan.
Evaluation rubric (summary): - Intermediate reasoning summaries should support the LLM’s input and each other logically. - No event should invalidate or contradict an earlier inference without explicit, justified retraction. - Explanations and planned actions/tool selections must be mutually reinforcing and consistent with the input. - Web search: The need for a search should be justified by the input/reasoning, and the query/parameters should be appropriate. The system can request multiple judgments to improve robustness and reduce variance. Each evaluation produces a binary decision (coherent / not coherent) and an explanation. Each evaluation produces a binary outcome, where coherent = 1 and not coherent = 0, along with an explanation. This metric is computed by prompting an LLM and may require multiple LLM calls to compute, which can impact usage and billing. ## Supported nodes * LLM span Inputs considered (when available): * Latest user input and current system prompt * Intermediate reasoning events and summaries (including plan/steps) * Tool-selection thoughts and invoked tool calls (including arguments) * Final in-span conclusion/output Empty or missing reasoning summaries should not be penalized; assess coherence only when there is evidence of incoherence. ## What constitutes coherent reasoning (1) * Intermediate reasoning summaries support the LLM’s input and each other logically. * No unjustified contradictions: any retractions are explicit and justified. * Explanations, planned actions, and tool selections are consistent with the input and mutually reinforcing. * Web search is justified by the input/reasoning and uses appropriate parameters (e.g., search query). ## What constitutes incoherent reasoning (0) * Explicit contradictions without justification within the reasoning chain. * Final (in-span) conclusions or planned actions don’t follow from prior steps. * Circular reasoning or unjustified reversals of stance. * Tool-selection reasoning conflicts with the recorded input or earlier reasoning steps. * The reasoning process deviates from the latest user or system instructions. * Web search is unjustified for common-knowledge queries (if unsure, treat as justified), or web search is used when an available specialized tool (e.g., get\_weather) is clearly more appropriate for the user’s query. ## Interpreting the score * 0-30: Low coherence — reasoning likely contains contradictions or misaligned steps. * 31-69: Mixed coherence — review critical steps and provide additional guidance. * 70-100: Strong coherence — reasoning appears consistent and aligned. > Consider setting thresholds for alerting or human review based on your domain’s risk tolerance (e.g., flag \< 50 for review). ## Example use cases * Validating multi-step “plan → execute” agents. * Auditing tool-augmented reasoning chains for consistency. * Comparing agent versions for planning quality regressions. * Example: A financial planning agent develops a step-by-step investment plan, ensuring each recommendation logically follows from prior steps and aligns with the user’s goals. ## Usage Enable this metric in experiments or Log Streams by selecting the Reasoning Coherence scorer. ```python Python theme={null} from galileo import GalileoMetrics metric = GalileoMetrics.reasoning_coherence ``` ## Best practices Ensure the agent records its plan and intermediate steps so coherence can be evaluated meaningfully. Calibrate the judge rubric with domain examples to reduce false positives/negatives. Define minimum acceptable coherence scores and trigger human review below that threshold. Use continuous learning via human feedback to improve the judge prompt and rubric over time. ## Performance Benchmarks We evaluated Reasoning Coherence against human expert labels on an internal dataset of agentic conversation samples using top frontier models. | Model | F1 (True) | | :---------------------- | :-------: | | GPT-4.1 | 0.88 | | GPT-4.1-mini (judges=3) | 0.87 | | Claude Sonnet 4.5 | 0.79 | | Gemini 3 Flash | 0.88 | ### GPT-4.1 Classification Report Benchmarks based on internal evaluation dataset. Performance may vary by use case. ## Related Resources If you would like to dive deeper or start implementing Reasoning Coherence, check out the following resources: ### Examples * [Reasoning Coherence Examples](https://app.galileo.ai) - Log in and explore the "Reasoning Coherence" Log Stream in the "Preset Metric Examples" Project to see this metric in action. ### How-to guides * [Agentic AI Basic Example](/how-to-guides/agentic-ai/basic-example) * [Creating Custom Metrics](/how-to-guides/metrics/create-local-metric/create-local-metric) ### Related Concepts * [Agentic Metrics Overview](/concepts/metrics/agentic/agentic-overview) * [Action Completion](/concepts/metrics/agentic/action-completion) * [Action Advancement](/concepts/metrics/agentic/action-advancement)