Test sets - Galileo

A test set is the ground truth for a training run. After fine-tuning, Luna Studio scores the resulting metric against the test set and reports F1, AUC-ROC, and other diagnostics.

What makes a good test set

Hand-labelled. Don’t auto-generate test labels — they’re the tape measure for evaluating the run.
Representative. Sample inputs from the same distribution your application sees in production. Skewed test sets lead to misleading scores.
Small but not tiny. Use at least 300 hand-labelled rows when possible. Beyond a few hundred rows, you start paying inference cost without much added signal for most metrics.
Held out. Don’t reuse test set rows in your training set. Luna Studio respects this when you generate a training set from a test set.

Required schema

Test sets need at least two columns:

Column	Required	Notes
`input`	Yes	The text the metric scores.
`label`	Yes	The ground-truth value. Type depends on the metric’s output type.

Other columns (e.g. id, timestamp) are kept but ignored during evaluation. For label format by currently trainable output type:

Output type	Acceptable label values
Boolean	`true` / `false` (or 1 / 0).
Categorical	One of the metric’s defined labels.

Floating-point, percentage, multilabel, and numeric labels are not trainable in Luna Studio yet.

File formats

CSV — standard comma-separated. Headers required.
JSONL — one JSON object per line, with input and label keys.

Both formats accept up to a few hundred MB. Larger uploads work but take longer to validate.

Add a test set

You can add a test set in three places:

The Datasets page → Add test set primary button.
The Step 2 of the run creation flow → dropdown’s Add new test set action.
(Indirectly) by importing from Galileo — see Galileo integration.

All three paths open the same Add test set modal — see Add a dataset for the modal reference.

Test/eval split

When a test set is selected for a training run, Luna Studio reserves 80% for evaluation and uses 20% as seed examples if you choose Generate from test set for the training source. The Selected dataset card in Step 2 shows the split, e.g.:

320 rows · 3 columns · Uploaded · 80% → ~256 for eval

Used in metric

The Used in metric column on the Datasets table shows every metric whose runs reference this test set. If the column shows —, the test set is unused — safe to delete.

Renaming a test set

Open the test set’s details page and click the pencil icon next to its name in the breadcrumb. The Edit dataset name modal opens with the current name pre-filled.

Where to go next

Add a dataset

Walk through the Upload / URL / Galileo flows.

Validation

What Luna checks and what to do when validation fails.

Training sets

The other dataset type — used to fine-tune the base model.

Documentation Index

​What makes a good test set

​Required schema

​File formats

​Add a test set

​Test/eval split

​Used in metric

​Renaming a test set

​Where to go next