Creating an LLM-as-a-judge metric is really easy with Galileo’s Autogen feature. You can simply enter a description of what you want to measure or detect, and Galileo auto-generates a metric for you.

How it works

When you enter a description of your metric (e.g. “detect any toxic language in the inputs”), your description is converted into a prompt and few-shot examples for your metric. This prompt and few-shot examples are used to power an LLM-as-a-judge that uses chain-of-thought and majority voting (see Chainpoll paper) to calculate a metric.

You can customize the model that gets used or the number of judges used to calculate your metric.

Currently, auto-generated metrics are restricted to binary (yes/no) measurements. Multiple choice or numerical ratings are coming soon.

How to use it

Editing and Iterating on your auto-generated LLM-as-a-judge

You can always go back and edit your prompt or examples. Additionally, you can use Continuous Learning via Human Feedback (CLHF) to improve and adapt your metric.