How it works
When you enter a description of your metric (e.g. “detect any toxic language in the inputs”), your description is converted into a prompt and few-shot examples for your metric. This prompt and few-shot examples are used to power an LLM-as-a-judge that uses chain-of-thought and majority voting (see Chainpoll paper) to calculate a metric. You can customize the model that gets used or the number of judges used to calculate your metric.Currently, auto-generated metrics are restricted to binary (yes/no) measurements. Multiple choice or numerical ratings are coming soon.