Precision	Recall	F1-Score
{negativeLabel}	{negClass.precision.toFixed(2)}	{negClass.recall.toFixed(2)}	{negClass.f1.toFixed(2)}
{positiveLabel}	{posClass.precision.toFixed(2)}	{posClass.recall.toFixed(2)}	{posClass.f1.toFixed(2)}

{}

{titlePrefix}Confusion Matrix (Normalized)

{}

Predicted

{}

{displayPredictedLabels.left}

{displayPredictedLabels.right}

{}

Actual

{displayActualLabels.top}

{showCounts &&

{displayMatrix.tl.count}

}

{formatValue(displayMatrix.tl.pct)}

{showCounts &&

{displayMatrix.tr.count}

}

{formatValue(displayMatrix.tr.pct)}

{}

{displayActualLabels.bottom}

{showCounts &&

{displayMatrix.bl.count}

}

{formatValue(displayMatrix.bl.pct)}

{showCounts &&

{displayMatrix.br.count}

}

{formatValue(displayMatrix.br.pct)}

{}

{displayFormat === "fraction" ? "0.0" : "0%"}

{palette.map((color, idx) =>

)}

{displayFormat === "fraction" ? "1.0" : "100%"}

; }; export const Scale = ({low, mid, high, lowLabel = "Low", midLabel = "Mid", highLabel = "High", lowDescription, midDescription, highDescription, midColor = "yellow", inverted = false}) => { const lowColor = inverted ? "green" : "red"; const highColor = inverted ? "red" : "green"; const gradientId = inverted ? "greenToRed" : "redToGreen"; return

{low}

{mid &&

{mid}

}

{high}

{lowLabel}

{lowDescription &&

{lowDescription}

}

{mid &&

{midLabel}

{midDescription &&

{midDescription}

}

{highLabel}

{highDescription &&

{highDescription}

}

; }; export const MetricWhenToUse = ({description, useCases}) => { return

When to Use This Metric

{description} {useCases != null && useCases.map((useCase, index) =>

{useCase.title}{useCase.description ? `: ${useCase.description}` : ''}

)} ; }; export const DefinitionCard = ({children}) => { return

{children}

; }; Agent Flow is a binary metric that checks if an agent's behavior satisfies all user-defined natural language conditions. Agent Flow is a binary evaluation metric that measures the correctness and coherence of an agentic trajectory by validating it against user-specified natural language tests. A trajectory is said to pass the Agent Flow metric if and only if all the user-defined natural language conditions are successfully satisfied by the agent's realized behavior or output. To use this metric, you will need to create a copy and edit the prompt to provide your natural language tests. This is a **boolean** metric, returning a confidence score that the agent flow satisfies all conditions. The score ranges from 0% (no confidence the agent flow satisfies all conditions) to 100% (complete confidence that the agent flow satisfies all conditions). ## Agent Flow at a glance | Property | Description | | :----------------------------- | :--------------------------------------------- | | **Name** | Agent Flow | | **Category** | Agentic AI | | **Can be applied to** | Session | | **LLM-as-a-judge Support** | ✅ | | **Luna Support** | ❌ | | **Protect Runtime Protection** | ❌ | | **Value Type** | Boolean shown as a percentage confidence score | ## When to use this metric ## Score interpretation **Expected Score:** 80%-100%. ## Configure Agent Flow This metric needs to be manually customized to include your own natural language tests. From the **Metrics Hub**, select the **Agent Flow** metric. You will get a popup asking you to duplicate the metric. Select **Duplicate metric** to create a copy. The agent flow metric with the duplicate metric popup

Locate the user defined tests section in the prompt. ```xml theme={null} {{ Add your tests here }} ``` This prompt needs to be customized based on your application, and the inputs and outputs you are expecting. Replace `{{ Add your tests here }}` with a numbered list of tests in natural language that can be used to evaluate the agent efficiency. This can include: * Expected tool or agent calls, using the tool or agent names * Conditions on tool or agent calling (e.g. if tool x is called, don't call agent y) * Expectations around the input or output parameters to tools and agents * Limitations on the number of tool or agent calls For example, imagine you were creating an agent to provide advice on exercises for different body parts, such as for a physical therapy application. This has multiple tools, including `list_by_target_muscle_for_exercised`, `list_by_body_part_for_exercised`, `list_of_bodyparts_for_exercised`. Some user tests might be: ```output wrap theme={null} 1. If a call to "list_by_target_muscle_for_exercised" returns an error that contains the text "target not found", the agent should subsequently attempt an alternative lookup by calling either "list_by_body_part_for_exercised" or "list_of_bodyparts_for_exercised" 2. When the user asks for exercises that target leg muscles, the agent must call at least one of the tools ["list_by_target_muscle_for_exercised", "list_by_body_part_for_exercised"] during the conversation 3. After receiving a successful response from "list_by_body_part_for_exercised", the agent's following natural-language message must contain at least one exercise name, the corresponding equipment, and an animated demonstration URL taken from the tool output 4. Every invocation of the tool "list_by_body_part_for_exercised" must include the required parameter "bodypart" 5. After receiving data from list_by_body_part_for_exercised, the agent response must include the exercise id for every exercise it presents to the user 6. No assistant message should include more than one tool invocation 7. The agent should conclude the conversation with a human-readable answer that summarizes the requested leg exercises using data returned from the tools ``` Save the metric, then turn it on for your Log Stream. ## Best practices Trajectory tests are similar to unit tests for the agents trajectory, to check if certain conditions are followed during the agents path. You should write all the tests in a numbered list. For example: ```md theme={null} 1. If X happens then ask the user Y and call tool Z. 2. X tool is always called before Y tool. 3. When user asks X reply with Y 4. The tool Y should be called once in the conversation. ``` Each test should check for one single condition only. Tests should be logically consistent, and well defined. ## Performance Benchmarks We evaluated Agent Flow against human expert labels on an internal dataset of agentic conversation samples using top frontier models. | Model | F1 (True) | | :---------------------- | :-------: | | GPT-4.1 | 0.93 | | GPT-4.1-mini (judges=3) | 0.92 | | Claude Sonnet 4.5 | 0.95 | | Gemini 3 Flash | 0.92 | ### GPT-4.1 Classification Report Benchmarks based on internal evaluation dataset. Performance may vary by use case. ## Related Resources If you would like to dive deeper or start implementing Agent Flow, check out the following resources: ### How-to guides * [Agentic AI Basic Example](/how-to-guides/agentic-ai/basic-example) ### Related Concepts * [Agentic Metrics Overview](/concepts/metrics/agentic/agentic-overview) * [Action Advancement](/concepts/metrics/agentic/action-advancement) * [Action Completion](/concepts/metrics/agentic/action-completion)