Prerequisites

Human Labelled Test Dataset

A golden set is the set of data points that is the most representative of your real-life production data, which are human-labelled according to the definition of the metric. A golden test set is crucial for robust model development and evaluation. It serves as the single source of truth against which all model performance is measured.

Required dataset format

Source data can come from a CSV file or a Hugging Face dataset. Each row should represent one example. The required columns depend on metric.input_format. Note that all datasets must have a label column for the ground-truth label Read more: Data generation metric input types, Data generation metric output types, Training metric input types, and Training metric output types.

Metrics with 1 input (`single`)

Example: toxicity, sexism, prompt_injection
Dataset must contain exactly one feature column, for example ["input"]

Metrics with 2 or more inputs (`tuple`)

Example: instruction_adherence
Dataset must contain exactly two or more feature columns, for example ["input", "output"]

RAG based metrics (`rag`)

Example: context_adherence, context_relevance, chunk_relevance
Dataset must include documents column
Dataset must also include at least one of input or output column

Agentic metrics (`span_with_tools`)

Example: tool_selection_quality
Dataset columns must be exactly ["tools", "input", "output"]

Advanced formats

trace and session formats are only supported in label_only_mode or when you skip data generation and proceed directly to training. Labels should be manually assigned and should match the exact metric definition you want to train or evaluate.

Required dataset size

As a rule of thumb:

300-500 samples is a good minimal size for the test set
aim for at least 100 examples per class where possible

Try to keep the class distribution reasonably balanced so evaluation results are meaningful across classes.

Training dataset guidance

If you already have a labelled / unlabelled training dataset, a strong target for fine-tuning is around 4,000 total labelled samples. The class distribution should be similar to your test set class distribution, but make sure it is NOT extremely skewed (for example, 99/1). If you do not have enough training data, synthetic generation can help create training examples before fine-tuning.

LLM-as-judge prompt

This can either be a preset metric or a custom metric which you can create using Galileo. More can be found here: Custom LLM-as-judge metrics - Galileo It is important to ensure that the LLMAJ has a high accuracy on your golden dataset. If it doesn’t, you should tune the prompt (either manually or using Autotune in Galileo), to get to a high accuracy before creating a Luna metric. This ensures that for Luna fine-tuning we’re starting with a good understanding of the metric, avoiding garbage-in, garbage-out situations.

Overview

Get Started

Observability

Evaluation Metrics

AI Assistant

Luna Studio

Experiments

Agent Control

Annotations

Integrations

Security

References

Human Labelled Test Dataset

Required dataset format

Metrics with 1 input (`single`)

Metrics with 2 or more inputs (`tuple`)

RAG based metrics (`rag`)

Agentic metrics (`span_with_tools`)

Advanced formats

Required dataset size

Training dataset guidance

LLM-as-judge prompt

​Human Labelled Test Dataset

​Required dataset format

​Metrics with 1 input (single)

​Metrics with 2 or more inputs (tuple)

​RAG based metrics (rag)

​Agentic metrics (span_with_tools)

​Advanced formats

​Required dataset size

​Training dataset guidance

​LLM-as-judge prompt

Human Labelled Test Dataset

Required dataset format

Metrics with 1 input (`single`)

Metrics with 2 or more inputs (`tuple`)

RAG based metrics (`rag`)

Agentic metrics (`span_with_tools`)

Advanced formats

Required dataset size

Training dataset guidance

LLM-as-judge prompt