> ## Documentation Index
> Fetch the complete documentation index at: https://docs.galileo.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Prerequisites

> Dataset preparation to get optimal results from Luna fine-tuning

## Human Labelled Test Dataset

A golden set is the set of data points that is the most representative of your real-life production data, which are human-labelled according to the definition of the metric.
A golden test set is crucial for robust model development and evaluation. It serves as the single source of truth against which all model performance is measured.

### Required dataset format

Source data can come from a CSV file or a Hugging Face dataset.

Each row should represent one example. The required columns depend on `metric.input_format`. Note that all datasets must have a `label` column for the ground-truth label

Read more in [Core concepts](/luna-studio/ui/core-concepts#metrics), [Test sets](/luna-studio/ui/datasets/test-sets), and [Dataset validation](/luna-studio/ui/datasets/validation).

#### Metrics with 1 input

* Example: `toxicity`, `sexism`, `prompt_injection`
* Dataset must contain exactly one feature column, for example `["input"]`

#### Metrics with 2 or more inputs

* Example: `instruction_adherence`
* Dataset must contain exactly two or more feature columns, for example `["input", "output"]`

#### RAG based metrics

* Example: `context_adherence`, `context_relevance`, `chunk_relevance`
* Dataset must include a `documents` column
* Dataset must also include an `input` column; some metrics, such as context adherence, also require `output`

#### Agentic metrics

* Example: `tool_selection_quality`
* Dataset columns must be exactly `["tools", "input", "output"]`

#### Advanced formats

`trace` and `session` formats are only supported in `label_only_mode` or when you skip data generation and proceed directly to training.

Labels should be manually assigned and should match the exact metric definition you want to train or evaluate.

### Required dataset size

As a rule of thumb:

* `300-500` samples is a good minimal size for the test set
* aim for at least `100` examples per class where possible

Try to keep the class distribution reasonably balanced so evaluation results are meaningful across classes.

### Training dataset guidance

If you already have a labelled or unlabelled training dataset, a strong target for fine-tuning is around `2,000` labelled samples. The class distribution should be similar to your test set class distribution, but make sure it is not extremely skewed, for example `99/1`.

If you do not have enough training data, synthetic generation can help create training examples before fine-tuning.

## LLM-as-a-Judge Prompt

This can either be a preset metric or a custom metric which you can create using Galileo. More can be found here: [Custom LLM-as-a-Judge Metrics - Galileo](https://docs.galileo.ai/concepts/metrics/custom-metrics/custom-metrics-ui-llm#custom-llm-as-a-judge-metrics)

It is important to ensure that the LLMAJ has a high accuracy on your golden dataset. If it doesn’t, you should tune the prompt (either manually or using Autotune in Galileo), to get to a high accuracy before creating a Luna metric.
This ensures that Luna fine-tuning starts with a good understanding of the metric and avoids garbage in, garbage out.