> ## Documentation Index > Fetch the complete documentation index at: https://docs.galileo.ai/llms.txt > Use this file to discover all available pages before exploring further. # Training sets > The dataset used to fine-tune the base model during a run. A **training set** is the dataset that fine-tunes your [base model](/luna-studio/ui/core-concepts#base-models) during a [training run](/luna-studio/ui/runs/lifecycle). Training sets are typically much larger than test sets; generated training sets contain 2,000 labelled examples. ## Sources In the run creation flow, you choose one of three top-level paths: Luna uses 20% of your test set as seeds and synthesizes 2,000 labelled rows via an LLM-as-judge prompt. Upload or import production logs. Luna Studio labels unlabelled logs before training. Reuse a generated, labelled, or uploaded training dataset from your workspace. The **Add training logs** path lets you upload a `.csv` or `.jsonl` file, fetch a file from URL, or import a dataset from Galileo. Those same import methods are also available from the [Datasets page](/luna-studio/ui/datasets/overview) **Add training set** button. ## Schema Training sets need at least one column: | Column | Required | Notes | | ------- | --------- | --------------------------------------------------------------------------------- | | `input` | Yes | The text the metric will be trained on. | | `label` | Sometimes | Required for labelled training. Unlabelled logs must be labelled before training. | ### Labelled vs. unlabelled * **Labelled** — every row has a `label` column matching the metric's output type. Required if you want supervised fine-tuning. * **Unlabelled** — rows have an `input` only. Luna Studio labels the logs with your LLM-as-judge prompt first, saves a labelled training dataset, and then uses that labelled dataset for training. When you add training logs, Luna Studio validates whether the dataset already has labels: * A green check when the label column is present. * A label-only flow when the label column is missing. ## Generated training sets The most common path for a first run is **Generate from test set**. The flow: 1. Luna uses 20% of your test set as seed examples. 2. The configured model you pick synthesizes 50 sample rows following the metric prompt. 3. You review the sample rows and optionally regenerate with feedback. 4. Luna generates the full 2,000-example training set. See [Step 3 — Training set](/luna-studio/ui/runs/new-run/step-3-training-set#generate-from-test-set) for the full reference. The resulting dataset shows up on the [Datasets page](/luna-studio/ui/datasets/overview) with source **Generated** and a subtitle like "Generated from rag-eval-v2". Each row carries an **Origin** marker so you can trace it back: rows synthesized by the generator render as "Generated", while rows seeded from a test set render as a chip with that test set's name. ## File formats For uploads and URL fetches: * **CSV** — standard comma-separated. Headers required. * **JSONL** — one JSON object per line, with `input` and (optionally) `label` keys. ## Used in metric The **Used in metric** column shows every metric whose runs reference this training set. If empty, the training set isn't being used — safe to delete. ## Where to go next The most common path for first runs. Walk through the Upload / URL / Galileo flows. The other dataset type — used to evaluate the metric. Schema and content checks Luna runs.