Galileo Metrics

Galileo has built a menu of Guardrail Metrics for you to choose from. These metrics are tailored to your use case and are designed to help you evaluate your prompts and models.

Galileo’s Guardrail Metrics are a combination of industry-standard metrics (e.g. BLEU, ROUGE-1, Perplexity) and an outcome of Galileo’s in-house ML Research Team (e.g. Uncertainty, Correctness, Context Adherence).

Here’s a list of the metrics supported today

Output Quality Metrics:

  • Uncertainty: Measures the model’s certainty in its generated responses. Uncertainty works at the response level as well as at the token level. It has shown a strong correlation with hallucinations or made-up facts, names, or citations.

  • Correctness - Measures whether the facts stated in the response are based on real facts. This metric requires additional LLM calls. Combined with Uncertainty, Factuality is a good way of uncovering Hallucinations.

  • BLEU & ROUGE-1 - These metrics measure n-gram similarities between your Generated Responses and your Target output. These metrics are automatically computed when you add a column in your dataset.

  • Prompt Perplexity - Measure the perplexity of a prompt. Previous research has shown that as perplexity decreases, generations tend to increase in quality.

RAG Quality Metrics:

  • Context Adherence - Measures whether your model’s response was purely based on the context provided. This metric is intended for RAG users. We have two options for this metric: Luna and Plus.

    • Context Adherence Luna is powered by small language models we’ve trained. It’s free of cost.

    • Context Adherence Plus includes an explanation or rationale for the rating. These metrics and the explanations are powered by an LLM (e.g. OpenAI GPT3.5) and thus incur additional costs. Plus has shown to have better performance.

  • Completeness - Measures how thoroughly your model’s response covered relevant information from the context provided. This metric is intended for RAG use cases and is only available if you log your retriever’s output. There are two versions available:

    • Completeness Luna is powered by small language models we’ve trained. It’s free of cost.

    • Completeness Plus includes an explanation or rationale for the rating. These metrics and the explanations are powered by an LLM (e.g. OpenAI GPT3.5) and thus incur additional costs. Plus has shown to have better performance.

  • Chunk Attribution - Measures which individual chunks retrieved in a RAG workflow influenced your model’s response. This metric is intended for RAG use cases and is only available if you log your retriever’s output. There are two versions available:

    • Chunk Attribution Luna is powered by small language models we’ve trained. It’s free of cost.

    • Chunk Attribution Plus is powered by an LLM (e.g. OpenAI GPT3.5) and thus incurs additional costs. Plus has shown to have better performance.

  • Chunk Utilization - For each chunk retrieved in a RAG workflow, measures the fraction of the chunk text that influenced your model’s response. This metric is intended for RAG use cases and is only available if you log your retriever’s output. There are two versions available:

    • Chunk Attribution Luna is powered by small language models we’ve trained. It’s free of cost.

    • Chunk Attribution Plus is powered by an LLM (e.g. OpenAI GPT3.5) and thus incurs additional costs. Plus has shown to have better performance.

  • Context Relevance - Measures how relevant the context provided was to the user query. This metric is intended for RAG users. This metric requires {context} and {query} slots in your data, as well as embeddings for them (i.e. {context_embedding}, {query_embedding}.

Safety Metrics:

  • Private Identifiable Information - This Guardrail Metric surfaces any instances of PII in your model’s responses. We surface whether your text contains any credit card numbers, social security numbers, phone numbers, street addresses, and email addresses.

  • Toxicity - Measures whether the model’s responses contained any abusive, toxic, or foul language.

  • Tone - Classifies the tone of the response into 9 different emotion categories: neutral, joy, love, fear, surprise, sadness, anger, annoyance, and confusion.

  • Sexism - Measures how ‘sexist’ a comment might be perceived ranging in the values of 0-1 (1 being more sexist).

  • Prompt Injection - Detects and classifies various categories of prompt injection attacks.

  • More coming very soon.

A more thorough description of all Guardrail Metrics can be found here.

When creating runs from code, you’ll need to add your Guardrail Metrics as “scorers”, check out “Enabling Scorers in Run” to learn how to do so.

If you want to set up your custom metrics, please see instructions here.