> ## Documentation Index
> Fetch the complete documentation index at: https://docs.galileo.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Prerequisites

> Dataset preparation to get optimal results from Luna fine-tuning

## Human Labelled Test Dataset

A golden set is the set of data points that is the most representative of your real-life production data, which are human-labelled according to the definition of the metric.
A golden test set is crucial for robust model development and evaluation. It serves as the single source of truth against which all model performance is measured.

### Required dataset format

Source data can come from a CSV file or a Hugging Face dataset.

Each row should represent one example. The required columns depend on `metric.input_format`. Note that all datasets must have a `label` column for the ground-truth label

Read more: [Data generation metric input types](/luna-studio/sdk/how-to-train-your-luna-metric/data-generation/config/metric-input-types), [Data generation metric output types](/luna-studio/sdk/how-to-train-your-luna-metric/data-generation/config/metric-output-types), [Training metric input types](/luna-studio/sdk/how-to-train-your-luna-metric/training/config/metric-input-types), and [Training metric output types](/luna-studio/sdk/how-to-train-your-luna-metric/training/config/metric-output-types).

#### Metrics with 1 input (`single`)

* Example: `toxicity`, `sexism`, `prompt_injection`
* Dataset must contain exactly one feature column, for example `["input"]`

#### Metrics with 2 or more inputs (`tuple`)

* Example: `instruction_adherence`
* Dataset must contain exactly two or more feature columns, for example `["input", "output"]`

#### RAG based metrics (`rag`)

* Example: `context_adherence`, `context_relevance`, `chunk_relevance`
* Dataset must include `documents` column
* Dataset must also include at least one of `input` or `output` column

#### Agentic metrics (`span_with_tools`)

* Example: `tool_selection_quality`
* Dataset columns must be exactly `["tools", "input", "output"]`

#### Advanced formats

`trace` and `session` formats are only supported in `label_only_mode` or when you skip data generation and proceed directly to training.

Labels should be manually assigned and should match the exact metric definition you want to train or evaluate.

### Required dataset size

As a rule of thumb:

* `300-500` samples is a good minimal size for the test set
* aim for at least `100` examples per class where possible

Try to keep the class distribution reasonably balanced so evaluation results are meaningful across classes.

### Training dataset guidance

If you already have a labelled / unlabelled training dataset, a strong target for fine-tuning is around `4,000` total labelled samples. The class distribution should be similar to your test set class distribution, but make sure it is NOT extremely skewed (for example, `99/1`).

If you do not have enough training data, synthetic generation can help create training examples before fine-tuning.

## LLM-as-judge prompt

This can either be a preset metric or a custom metric which you can create using Galileo. More can be found here: [Custom LLM-as-judge metrics - Galileo](https://docs.galileo.ai/concepts/metrics/custom-metrics/custom-metrics-ui-llm#custom-llm-as-a-judge-metrics)

It is important to ensure that the LLMAJ has a high accuracy on your golden dataset. If it doesn’t, you should tune the prompt (either manually or using Autotune in Galileo), to get to a high accuracy before creating a Luna metric.
This ensures that for Luna fine-tuning we’re starting with a good understanding of the metric, avoiding garbage-in, garbage-out situations.
