Skip to main content
This page explains every field in the data_generation section of the YAML config.
The SDK reads a single YAML file that contains both data_generation and training. You typically run:
  • run_data_generation(config_path=...)
  • run_training(config_path=...)

File format

Your YAML file is a run config that includes top-level keys and a nested data_generation section:
run_steps: ["data_generation", "training"]
pipeline_provider: "local"
metric_name: "custom"

data_generation:
  # described below
  metric: {}
  source_data: {}
  llm: {}
  generation: {}
  output: {}
  labelling: {}
  data_quality_metrics: {}

Configuration structure

The data_generation section has six parts:
  • metric: what you’re generating data for (classes, rubrics, input format)
  • source_data: the seed dataset (CSV or Hugging Face) + sampling
  • llm: the primary LLM used for generation
  • generation: how many examples to generate, distribution, concurrency
  • output: where the generated dataset is written/published
  • labelling + data_quality_metrics: optional steps

metric

Defines the classification metric you are generating data for.
FieldTypeRequiredDefaultNotes
metric.namestringYesMetric name used in prompts and artifacts.
metric.descriptionstring | nullConditionallynullRequired when llmaj_source_prompt is empty.
metric.typestringYesOne of binary, multi-class.
metric.input_formatstringYesOne of single, tuple, rag.
metric.class_labelslist[class_label] | nullConditionallynullRequired when llmaj_source_prompt is empty.
metric.llmaj_source_promptstring | nullNonullIf set, description + class labels may be extracted automatically. We recommend you to set this

metric.class_label

FieldTypeRequiredNotes
namestringYesHuman-readable label name (e.g. positive).
labelint | string | boolYesThe label value used in your dataset. Must be unique.
rubricstringConditionallyRequired when llmaj_source_prompt is empty.

source_data

Defines where seed examples come from and how they’re sampled.

source_data.dataset

FieldTypeRequiredDefaultNotes
source_typestringYeshuggingface or csv.
columns.featureslist[string]YesFeature column names. Rules depend on metric.input_format.
columns.labelstringYes*"label" (base config)Required unless labelling.label_only_mode is true.
huggingface.namestringIf HFDataset name like org/repo.
huggingface.splitstringIf HFSplit name (e.g. train, test).
huggingface.revisionstringNo""Optional Git revision/branch.
csv.file_pathstringIf CSVPath to a CSV file on disk.
csv.train_file_pathstringNo""Used for label-only workflows.

source_data.sampling

FieldTypeRequiredDefault (base config)Notes
enhancement_fractionfloatYes0.2Must be between 0 and 1 (exclusive).
max_enhancement_samples_per_classintYes50000Upper bound for seed examples per class.
include_enhancement_in_trainboolYestrueIf true, seed examples are included in the final train split.
seedintYes42Controls sampling reproducibility.
pinned_enhancement_row_idslist[int]No[]Force-include specific rows in the seed set.

llm

Primary LLM used for generation.
FieldTypeRequiredDefault (base config)Notes
providerstringYes"openai"One of openai, azure, vegas, gemini.
modelstringYes"gpt-4.1"Provider model identifier.
embedding_modelstringYes"text-embedding-3-small"Used for optional quality metrics.

generation

Controls how many examples you generate and how fast.
FieldTypeRequiredDefault (base config)Notes
context_examplesintYes5Number of seed examples placed in the prompt per call. For rag, must be 1.
examples_per_callintYes10Examples requested per LLM call.
total_examplesintYes2000Total synthetic examples across all classes.
label_distributionmap[string,float]Yes{}Empty means auto-infer from the holdout/test split; otherwise keys should sum to ~1.0.
async_config.max_concurrent_requestsintYes5Concurrency.
debug_save_promptsboolNofalseSave prompts/responses for debugging.
special_instructionsstring | nullNonullExtra guidance appended to prompts.
llm_providerstring | nullNonullOptional override; must be set together with llm_model.
llm_modelstring | nullNonullOptional override; must be set together with llm_provider.

output

Controls where the generated dataset goes.
FieldTypeRequiredDefault (base config)Notes
push_to_hubboolYestruePush to Hugging Face Hub.
push_to_object_storeboolNofalseUpload a compressed dataset artifact to the configured object store.
local_pathstringYes"./data/generated_data"Local artifact directory.
object_store_bucketstring | nullIf object store"${LUNA_OBJECT_STORE_BUCKET:-luna-ft}"Required when push_to_object_store is true.
object_store_blobstring | nullIf object store"generated_data"Required when push_to_object_store is true.
dataset.repo_namestringYesLocal folder name and/or hub repo name.
dataset.privateboolYestrueHub visibility.
dataset.splits.trainstringYes"train"Output split name for training data.
dataset.splits.teststringYes"test"Output split name for holdout/test data.
hub.organizationstringYes"rungalileo"Required if push_to_hub is true.

labelling

Optional: run LLM-as-a-judge labeling workflows. If enabled, metric.llmaj_source_prompt is required.
FieldTypeRequiredDefaultNotes
enabledboolNofalseEnables labeling.
label_only_modeboolNofalseLabel an existing train file without generating new examples.

data_quality_metrics

Optional: compute UMAP and drift-based diagnostics.
FieldTypeRequiredDefault (base config)Notes
enabledboolNotrueMaster switch.
random_seedintNo42Reproducibility.
artifact_subdirstringNo"artifacts"Folder under output.local_path.
umap.enabledboolNofalseEnable UMAP projections.
umap.modestringNo"holdout_anchor"joint or holdout_anchor.
umap.umap_neighborsintNo15Neighbor count for UMAP.
umap.umap_min_distfloatNo0.1Minimum distance for UMAP.
drift.enabledboolNotrueEnable drift scoring.
drift.k_neighborsintNo30k-NN size.
drift.drift_thresholdfloatNo0.95Percentile threshold in [0,1].
drift.index_dirstringNo"/tmp/faiss"Directory for the FAISS index.
drift.output_filenamestringNo"drift_scores.parquet"Output file for drift scores.
drift.sort_by_score_descboolNotrueSort drift results descending.