Tool Selection Quality
Understand Galileo’s Tool Selection Quality Metric
Definition: Determines whether the LLM selected the appropriate tool and provided the correct arguments, including both the correct argument names and values.
If the response does not have a Low Tool Selection Quality score of 100%, then at least one judge considered that the model chose the wrong Tool(s), or the correct Tool(s) with incorrect parameters.
Calculation: Tool Selection Quality is computed by sending additional requests to an LLM (e.g. OpenAI’s GPT4o-mini), using a carefully engineered chain-of-thought prompt that asks the model to judge whether or not the tools selected were correct. The metric requests multiple distinct responses to this prompt, each of which produces an explanation along with a final judgment: yes or no. The final Tool Selection Quality score is the fraction of “yes” responses, divided by the total number of responses.
We also surface one of the generated explanations. The surfaced explanation is always chosen to align with the majority judgment among the responses.
Usefulness: This metric is most useful in Agentic Workflows, where an LLM decides the course of action to take by selecting a Tool. This metric helps you detect whether the right course of action was taken by the Agent.
Was this page helpful?