Documentation Index
Fetch the complete documentation index at: https://docs.galileo.ai/llms.txt
Use this file to discover all available pages before exploring further.
Human Labelled Test Dataset
A golden set is the set of data points that is the most representative of your real-life production data, which are human-labelled according to the definition of the metric. A golden test set is crucial for robust model development and evaluation. It serves as the single source of truth against which all model performance is measured.Required dataset format
Source data can come from a CSV file or a Hugging Face dataset. Each row should represent one example. The required columns depend onmetric.input_format. Note that all datasets must have a label column for the ground-truth label
Read more in Core concepts, Test sets, and Dataset validation.
Metrics with 1 input (single)
- Example:
toxicity,sexism,prompt_injection - Dataset must contain exactly one feature column, for example
["input"]
Metrics with 2 or more inputs (tuple)
- Example:
instruction_adherence - Dataset must contain exactly two or more feature columns, for example
["input", "output"]
RAG based metrics (rag)
- Example:
context_adherence,context_relevance,chunk_relevance - Dataset must include a
documentscolumn - Dataset must also include an
inputcolumn; some metrics, such as context adherence, also requireoutput
Agentic metrics (span_with_tools)
- Example:
tool_selection_quality - Dataset columns must be exactly
["tools", "input", "output"]
Advanced formats
trace and session formats are only supported in label_only_mode or when you skip data generation and proceed directly to training.
Labels should be manually assigned and should match the exact metric definition you want to train or evaluate.
Required dataset size
As a rule of thumb:300-500samples is a good minimal size for the test set- aim for at least
100examples per class where possible
Training dataset guidance
If you already have a labelled or unlabelled training dataset, a strong target for fine-tuning is around2,000 labelled samples. The class distribution should be similar to your test set class distribution, but make sure it is not extremely skewed, for example 99/1.
If you do not have enough training data, synthetic generation can help create training examples before fine-tuning.