Inputs Required
- Config
- Dataset
- LLM provider integration
What it does
- Reads your
data_generationconfig section - Loads the source dataset (CSV or Hugging Face)
- Consumes a small portion of your test set to create synthetic training data (defaults to 20%, can be configured with
source_data.sampling.enhancement_fraction) - Generates synthetic examples using your configured LLM
- Writes a dataset artifact locally and/or pushes to Hugging Face (depending on
output.push_to_hub)
Key concepts
Output Dataset
The output dataset here has 2 splits -- Train: The training data for the Luna metric
- Test: Your original test set minus the consumed portion (defaults to 80% of your original test set)