Precision	Recall	F1-Score
{negativeLabel}	{negClass.precision.toFixed(2)}	{negClass.recall.toFixed(2)}	{negClass.f1.toFixed(2)}
{positiveLabel}	{posClass.precision.toFixed(2)}	{posClass.recall.toFixed(2)}	{posClass.f1.toFixed(2)}

{}

{titlePrefix}Confusion Matrix (Normalized)

{}

Predicted

{}

{displayPredictedLabels.left}

{displayPredictedLabels.right}

{}

Actual

{displayActualLabels.top}

{showCounts &&

{displayMatrix.tl.count}

}

{formatValue(displayMatrix.tl.pct)}

{showCounts &&

{displayMatrix.tr.count}

}

{formatValue(displayMatrix.tr.pct)}

{}

{displayActualLabels.bottom}

{showCounts &&

{displayMatrix.bl.count}

}

{formatValue(displayMatrix.bl.pct)}

{showCounts &&

{displayMatrix.br.count}

}

{formatValue(displayMatrix.br.pct)}

{}

{displayFormat === "fraction" ? "0.0" : "0%"}

{palette.map((color, idx) =>

)}

{displayFormat === "fraction" ? "1.0" : "100%"}

; }; export const MetricWhenToUse = ({description, useCases}) => { return

When to Use This Metric

{description} {useCases != null && useCases.map((useCase, index) =>

{useCase.title}{useCase.description ? `: ${useCase.description}` : ''}

)} ; }; export const DefinitionCard = ({children}) => { return

{children}

; }; export const Scale = ({low, mid, high, lowLabel = "Low", midLabel = "Mid", highLabel = "High", lowDescription, midDescription, highDescription, midColor = "yellow", inverted = false}) => { const lowColor = inverted ? "green" : "red"; const highColor = inverted ? "red" : "green"; const gradientId = inverted ? "greenToRed" : "redToGreen"; return

{low}

{mid &&

{mid}

}

{high}

{lowLabel}

{lowDescription &&

{lowDescription}

}

{mid &&

{midLabel}

{midDescription &&

{midDescription}

}

{highLabel}

{highDescription &&

{highDescription}

}

; }; ## Overview Action Completion determines whether the agent successfully accomplished all of the user’s goals. Action Completion addresses the common pain points of agent performance by measuring whether AI agents are actually helping users achieve their end goal rather than just providing responses. Action Completion is successful when all of the below are true: : * The agent provides a complete answer in the case of a question * The agent provides a confirmation of successful action in the case of a request * The response is coherent and factually accurate * The response comprehensively addresses every aspect of the user's request * The response avoids contradicting tool outputs * The response summarizes all relevant parts returned by tools ### Action Completion at a glance | Property | Description | | :----------------------------- | :-------------------------------------------------------------------- | | **Name of Metric** | Action Completion | | **Metric Category** | Agentic Metrics | | **Use this metric for** | Measuring whether the agent successfully accomplished the user's goal | | **Can be applied to** | Session | | **LLM/Luna Support** | Supported with both LLM + Luna models | | **Protect Runtime Protection** | No | | **Constants** | None - Uses dynamic evaluation | | **Usage Context** | Agentic workflows, multi-step tasks, tool-using assistants | | **Value Type** | Confidence score denoted as a percentage. | | **Input/Output Requirements** | Requires agent responses and user goals for evaluation | ## Calculation method If the response does not achieve an Action Completion score of 100%, it indicates that at least one judge considered the model to have failed in accomplishing every user goal. Multiple requests are sent to an LLM using a carefully designed chain-of-thought prompt that adheres to the definition above. The LLM generates multiple distinct responses, each containing: * An explanation * A final judgment: "Yes" (goal accomplished) or "No" (goal not accomplished) Action Completion Score = (Number of "Yes" Responses) / (Total Number of Responses) One explanation is surfaced, chosen to align with the majority judgment among the responses. Galileo displays a generated explanation alongside the score, choosing the one that aligns with the majority judgement for troubleshooting. This metric requires multiple LLM calls to compute, which may impact usage and billing. ## Score interpretation **Expected Score:** 100% - A perfect score indicates the agent successfully accomplished all user goals with complete, accurate, and comprehensive responses. ### What different scores mean * **0.0 - 0.3 (Poor):** Agent completely failed to accomplish user goals, provided incomplete answers, or contradicted tool outputs. Common causes include insufficient tool usage, incomplete responses, or factual inaccuracies. * **0.4 - 0.7 (Fair):** Agent made progress toward user goals but didn't fully address all aspects of the request. Areas for improvement include ensuring comprehensive coverage of all user requirements and better tool utilization. * **0.8 - 1.0 (Excellent):** Agent successfully accomplished all user goals with complete, accurate, and comprehensive responses. Best practices include thorough tool usage, complete answer provision, and proper confirmation of successful actions. ## How to improve Action Completion scores To optimize your agent's performance and ensure high Action Completion scores, focus on comprehensive goal accomplishment and complete response generation. ### Common issues and solutions | Issue | Cause | Solution | | :------------------------- | :-------------------------------------------------- | :------------------------------------------------------------------------------------------------ | | Incomplete responses | Agent stops before addressing all user requirements | Implement comprehensive response generation and ensure all user goals are explicitly addressed | | Tool output contradictions | Agent ignores or contradicts information from tools | Ensure agent properly summarizes and incorporates all relevant tool outputs without contradiction | | Missing confirmations | Agent doesn't confirm successful actions | Add explicit confirmation steps for action-based requests | | Factual inaccuracies | Agent provides incorrect information | Implement fact-checking mechanisms and ensure responses align with tool outputs | ### Best practices for optimization * **Track Progress Over Time**: Monitor Action Completion scores across different versions of your agent to identify trends and ensure continuous improvements in task completion capabilities. * **Analyze Failure Patterns**: When Action Completion scores are low, examine specific steps or scenarios where agents fail to meet user goals. Use this analysis to identify and address systematic issues. * **Combine with Other Metrics**: Use Action Completion alongside other agentic metrics, such as Action Advancement, to get a comprehensive view of your assistant's effectiveness and identify areas for improvement. * **Test Edge Cases**: Create evaluation datasets that include complex, multi-step tasks to thoroughly assess your agent's ability to handle challenging scenarios and advance user goals effectively. When optimizing for Action Completion, ensure you're not sacrificing other important aspects like safety, factual accuracy, or user experience in pursuit of task completion. ## Comparison to other metrics | Property | Action Completion | Action Advancement | Tool Selection | | :----------------------------- | :---------------------------- | :------------------------------ | :-------------------------------- | | **Metric Category** | Agentic Performance | Agentic Performance | Agentic Performance | | **Use this metric for** | Measuring goal accomplishment | Measuring progress toward goals | Measuring tool choice quality | | **Best for** | Final outcome evaluation | Progress tracking | Tool usage optimization | | **LLM/Luna Support** | Yes | Yes | Yes | | **Protect Runtime Protection** | No | No | No | | **Value Type** | Percentage (0%-100%) | Percentage (0%-100%) | Percentage (0%-100%) | | **Limitations** | Requires multiple LLM calls | May not capture final success | Doesn't measure execution quality | ## Performance Benchmarks We evaluated Action Completion against human expert labels on an internal dataset of agentic conversation samples using top frontier models. | Model | F1 (True) | | :---------------------- | :-------: | | GPT-4.1 | 0.92 | | GPT-4.1-mini (judges=3) | 0.79 | | Claude Sonnet 4.5 | 0.87 | | Gemini 3 Flash | 0.92 | ### GPT-4.1 Classification Report Benchmarks based on internal evaluation dataset. Performance may vary by use case. ## Related Resources If you would like to dive deeper or start implementing Action Completion, check out the following resources: ### Examples * [Action Completion Examples](https://app.galileo.ai) - Log in and explore the "Action Completion" Log Stream in the "Preset Metric Examples" Project to see this metric in action. ### How-to guides * [Agentic AI Examples](/how-to-guides/agentic-ai/basic-example) ### Related Concepts * [Action Advancement](/concepts/metrics/agentic/action-advancement) * [Tool Selection](/concepts/metrics/agentic/tool-selection-quality) * [Agentic AI Overview](/concepts/metrics/agentic/agentic-overview)