Experiment
Object-centric interface for Galileo experiments. An experiment represents a systematic evaluation framework for running controlled tests on datasets to measure and compare AI model performance.Important Notes
Two-Phase Execution: Experiments are created in two phases:- Create the experiment metadata (name, dataset, optional prompt)
- Run the experiment by creating a job that executes on the dataset
add_tag
key(str): Tag key (e.g., “environment”, “version”, “team”)value(str): Tag value (e.g., “production”, “1.0.0”, “ml-team”)tag_type(str): Tag category, defaults to “generic”. Other options: “rag”
aggregate_metrics
create
dataset
delete
experiment_columns
~galileo.shared.column.ColumnCollection of all columns available
in the experiment comparison table. Scorer-backed metric columns carry UUID-based IDs
of the form "metrics/{scorer-uuid}", which map directly to the keys returned by
export_records
record_type: The type of records to export (SPAN, TRACE, or SESSION).filters: A list of filters to apply to the export.sort: A sort clause to order the exported records.export_format: The desired format for the exported data.column_ids: A list of column IDs to include in the export.redact: Redact sensitive data from the response.
get
name(str): The experiment name.project_id(Optional[str]): The project ID. If neither project_id nor project_name is provided, falls back to GALILEO_PROJECT_ID or GALILEO_PROJECT environment variables.project_name(Optional[str]): The project name. If neither project_id nor project_name is provided, falls back to GALILEO_PROJECT environment variable.
get_metric_aggregate
- :class:
~galileo.schema.metrics.GalileoMetricsenum value — itsvalueIS the human-readable label (e.g.GalileoMetrics.correctness→"Correctness"). - Scorer UUID string — direct lookup in :attr:
metric_aggregates, no column resolution needed. - Human-readable label string (e.g.
"Correctness") — resolved via :attr:experiment_columns. - Legacy
metric_key_aliasstring (e.g."correctness") — fallback after label matching fails.
None if :attr:metric_aggregates is not yet populated
(metrics still computing) or the metric is not found.
Arguments
metric: Any of: a :class:GalileoMetricsenum value, scorer UUID string, human-readable label, or legacy metric_key_alias.
MetricAggregates | None: Aggregate stats withavg,min_,max_,p50,p90,p95,p99,count, andvalue_distributionfields; orNoneif not available.
get_prompt_template_settings
get_sessions
filters: A list of filters to apply to the query.sort: A sort clause to order the query results.limit: The maximum number of records to return.starting_token: The token for the next page of results.
get_spans
filters: A list of filters to apply to the query.sort: A sort clause to order the query results.limit: The maximum number of records to return.starting_token: The token for the next page of results.
get_status
get_traces
filters: A list of filters to apply to the query.sort: A sort clause to order the query results.limit: The maximum number of records to return.starting_token: The token for the next page of results.
has_traces
is_winner
list
project_id(Optional[str]): The project ID. If neither project_id nor project_name is provided, falls back to GALILEO_PROJECT_ID or GALILEO_PROJECT environment variables.project_name(Optional[str]): The project name. If neither project_id nor project_name is provided, falls back to GALILEO_PROJECT environment variable.
metric_aggregates
Key types
- UUID keys (36-char strings, e.g.
"550e8400-e29b-41d4-a716-446655440000") — scorer-backed metrics. The UUID matchescolumn.id.removeprefix("metrics/")for the corresponding entry in :attr:experiment_columns. - Raw-string keys (e.g.
"cost","duration_ns") — system metrics computed without a scorer. These do not appear in :attr:experiment_columns.
Resolving UUIDs to human labels
Use :attr:experiment_columns to look up the display label and legacy metric name for each UUID:
MetricAggregates object exposes: avg, min_, max_, sum_, count,
pct, p50, p90, p95, p99, value_distribution.
For boolean metrics, value_distribution holds {"0": count_false, "1": count_true}.
Note: Call :meth:refresh first to get the latest values after experiment completion.
Examples
model
monitor_progress
job_id: Optional job ID to monitor. If not provided, will attempt to find the primary job for this experiment.
playground_name
Returns
str | None: Playground name if this is a playground experiment, None otherwise.project
prompt
prompt_model
query
record_type: The type of records to query (SPAN, TRACE, or SESSION).filters: A list of filters to apply to the query.sort: A sort clause to order the query results.limit: The maximum number of records to return.starting_token: The token for the next page of results.
rank
ranking_score
refresh
run
Returns
ExperimentRunResult: Result object with link and status.Raises
ValueError: If the experiment has not been created yet.session_columns
set_prompt
prompt: Prompt object, prompt name, or PromptTemplate object.prompt_name: Name of the prompt template (alternative to prompt parameter).prompt_id: ID of the prompt template (alternative to prompt parameter).