metric.input_format and the matching dataset columns.
Spans
LLM spans without RAG
Useinput_format: tuple when your metric is built for a span level, without retrieved document context. For example: instruction_adherence
Required dataset columns:
source_data.dataset.columns.features: a list with 2+ column names (for example["input", "output"])source_data.dataset.columns.label: your label column name (for example"label")
dicts) with exactly the requested fields, for example:
LLM spans with RAG
Useinput_format: rag when your metric depends on retrieved documents and optionally the user input and/or model output. For example: context_adherence
Required dataset columns:
source_data.dataset.columns.featuresmust be a list that:- includes
documents - includes at least one of
inputoroutput
- includes
source_data.dataset.columns.label: your label column name (for example"label")
generation.context_examplesmust be exactly 1featurescan only contain:documents,input,output
input, output). documents is reused from the context example and is not regenerated.
Detailed Tutorial: LLM spans with RAG
LLM spans with tools (Agentic)
Useinput_format: span_with_tools when the metric depends on tool context in addition to the user input and model output. For example: tool_selection_quality
Required dataset columns:
source_data.dataset.columns.featuresmust be exactly["tools", "input", "output"]source_data.dataset.columns.label: your label column name (for example"label")
generation.context_examplesmust be exactly 1
Retriever spans
Useinput_format: rag for retriever spans. For example: context_relevance
featuresmust contain:documentscolumn and optionallyinput/output
Traces
Trace based metrics are split into 2 categories:Trace input / output only
Useinput_format: single for this type of input. Most of the security metrics fall under this category. For example: toxicity, sexism, prompt_injection
Required dataset columns:
source_data.dataset.columns.features: a list with 1+ column names (for example["input"])source_data.dataset.columns.label: your label column name (for example"label")
dicts) with exactly the requested fields, for example:
Full traces
Today, full trace inputs are intended forlabel_only_mode workflows or for cases where you skip synthetic data generation and proceed directly to training. They are not supported for normal synthetic data generation.
You can use input_format: trace for this type of input.
Detailed Tutorial: Full traces
Sessions
Like Trace based metrics, Session level metrics are also split into 2 categories:List of Trace inputs / outputs only
Useinput_format: tuple for this type of input. For example: conversation_quality
Required dataset columns:
source_data.dataset.columns.features: a list with 2+ column names (for example["input", "output"])source_data.dataset.columns.label: your label column name (for example"label")
dicts) with exactly the requested fields, for example:
Full Sessions
Full Session-based metrics are supported in the SDK asmetric.input_format: session.
Like trace inputs, session inputs are currently intended for label_only_mode workflows or for cases where you skip synthetic data generation and proceed directly to training. They are not supported for normal synthetic data generation.
Detailed Tutorial: Full Sessions