Which flow do you need?
Authentication
Every request requires two things:https://api.yourcompany.galileocloud.io
If any POST returns a 422 with “already exists”, the resource was already created. Use the corresponding GET endpoint to retrieve it by name.
How Routing Works
Every trace or evaluation result lands somewhere specific in Galileo. Three pieces of information control exactly where:| What you provide | Proprietary endpoint | OTel endpoint |
|---|---|---|
| Org | Galileo-API-Key header | Galileo-API-Key header |
| Project | project_id in the URL path | projectid header or galileo.project.id resource attribute |
| Destination | experiment_id or log_stream_id in the request body | experimentid/experiment or logstreamid/logstream header |
Step 1: Create a Project
Projects are the top-level container.type: gen_ai is required for experiments.
API reference →
id — this is your <PROJECT_ID>, used in every subsequent call.
If the project already exists:POST /v2/projects/paginatedlists all projects. Find yours by name and take theid.
Step 2: Configure Metrics
Metrics are org-level resources — they exist independently of any project and can be referenced across experiments and Log Streams. Configure them once and reuse them anywhere in your org. Metric names are unique per type within your org (you can have an LLM-based and a code-based metric with the same name, but not two LLM-based metrics with the same name). Bothid and name uniquely identify a metric within its type — but the experiment and Log Stream APIs require the UUID id, not the name. Use POST /v2/scorers/list to look up a metric’s ID by name (shown at the end of this section).
Skip this section entirely if you only want to use Galileo’s built-in preset metrics. Built-in metrics already have scoreable_node_types configured — you don’t need to set anything when referencing them in experiments or Log Streams.
Custom scorers requireGalileo has two types of custom metrics:scoreable_node_typesto be set explicitly at creation time. If omitted, the field is null and the scorer is silently excluded when scoring raw traces (it will still work for dataset+prompt experiments). Always include it in yourPOST /v2/scorerscall.
- LLM-based: an LLM judge that evaluates outputs against a prompt you write
- Code-based: a Python function you upload that returns a scored value
LLM-based metric
Call 1 — Create the scorer shell: API reference →The response is large — the only fields you need aremodel_nameuses the same alias format asmodel_aliasin prompt templates — Galileo’s display name, not the provider model ID. SeeGET /llm_integrations/openai/scorer_modelsfor valid values (scorer-eligible models only — excludes reasoning models like o-series).
id and name:
<LLM_SCORER_ID> and <LLM_SCORER_NAME>.
Call 2 — Attach the prompt and model config: API reference →
{input} and {output} variables from each trace.
output_typeis set at scorer creation, not here. The version endpoint acceptsoutput_typebut only uses it to backfill the scorer’s default if one wasn’t set duringPOST /v2/scorers. If you set it upfront (as above), it’s ignored here. The version itself never storesoutput_type— the DB column is always null, and responses show"boolean"as a display placeholder regardless of what the scorer actually uses.
Code-based metric
Code-based metrics require a validation step before upload. The function signature is something like:-> float, -> bool, -> int, or -> str. **kwargs is required to ensure forward compatibility with additional arguments the platform may pass. The step_object has input, output, metadata, and spans attributes.
Example Scorer:outputtype depends on who created the trace. For traces with pre-generated outputs ,outputat the trace level is a plain string. When Galileo calls the LLM via a prompt template, trace-leveloutputis aMessageobject — access the text viastep_object.output.content.
Response: takescoreable_node_typescontrols which node type the scorer runs on. Valid values:trace,llm,retriever,tool,workflow,agent,session. Always set this explicitly — if omitted, the field is null in the database and the scorer will be silently excluded from experiments that ingest raw traces (it only works for dataset+prompt flows). Use["trace"]for code scorers and["llm", "chat"]for LLM scorers to cover all experiment types.
id and name — these are your <CODE_SCORER_ID> and <CODE_SCORER_NAME>.
task_id. Poll until status is completed: API reference →
validation_result is the object under the top-level "result" key — i.e. {"result": {"result_type": "valid", "score_type": "float", ...}}. Serialize that object as a JSON string and pass it in Call 3. Not the inner "result" object, and not the full poll response — just that one level down.
Call 3 — Upload the code version: API reference →
Looking up metric IDs by name
The experiment and Log Stream APIs require metric UUIDs, not names. If you need to retrieve a metric’s ID by name: API reference →id and name. Use the id when referencing scorers in experiment or Log Stream calls.
To look up preset scorers (e.g.context_adherence,correctness), use the same endpoint with"operator": "one_of"and the preset scorer names.
Datasets
Datasets are independent versioned resources — upload once, reference by ID across any experiment. Flows A and B require a dataset. Flow C and the function-based SDK path do not — traces are posted directly at runtime. Four file formats are supported — pass?format=<value> in the query string (defaults to csv if omitted):
format value | Description |
|---|---|
csv | CSV (default). Auto-detects encoding. |
jsonl | Newline-delimited JSON — one object per line. |
json | JSON array of objects. Top-level must be an array. |
feather | Apache Arrow Feather binary format. |
| Column | Purpose |
|---|---|
input | The input passed to the prompt template (or used as trace input) |
output | Ground truth / expected answer — used by scorers that compare against a reference answer (e.g. correctness). Not the LLM-generated output. Use output when uploading via API — the alias ground_truth is only normalized by the Galileo UI uploader, not the API. Datasets uploaded via API with a ground_truth column will show an empty “Dataset Ground Truth” column in experiment results. |
generated_output | Pre-generated LLM output. When present and no prompt template is provided, Galileo scores these directly without calling an LLM (Flow B). |
metadata | Arbitrary metadata for the row |
Note: This endpoint uses multipart/form-data (file upload), not JSON.
API reference →
id — this is your <DATASET_ID>. The first uploaded version always has version_index: 1.
If the dataset already exists: GET /v2/datasets lists all datasets. Find yours by name.
Step 3: Run an Experiment
Flow A — Prompt Template Evaluation
Galileo runs your prompt template against each row in a dataset and scores the outputs.Prerequisite: Flow A requires an LLM integration to be configured in Galileo (Settings → Integrations in the UI). Galileo calls the LLM on your behalf using those credentials.model_aliasis Galileo’s display name for the model — it is not the provider’s model ID."GPT-4o mini"and"gpt-4o-mini"are different strings and only one is accepted. For legacy models they differ; for newer models they often match. To see valid alias strings for your configured integrations:Returns a flat array of valid alias strings, e.g.["GPT-4o", "GPT-4o mini", "gpt-4.1", "gpt-4.1-mini", ...]. Use the exact string from this list. Replaceopenaiwith your provider (anthropic,azure,mistral,vertex_ai,aws_bedrock, etc.). The same alias format applies tomodel_namein LLM scorer creation — it is the same value set.
A1. Create a prompt template
If you already have a prompt template in the Galileo UI, retrieve its version ID:selected_version_id — this is your <PROMPT_TEMPLATE_VERSION_ID>.
To create a new template, template and first version are created in a single call:
model_aliasis Galileo’s display name for the model, e.g.GPT-4o mini— not the provider’s model ID (gpt-4o-mini). Use the exact string shown in Settings → Integrations for the model you have configured.
Template variable syntax: Prompt templates use Mustache (Response: take{{variable}}), not Python format strings. Use{{input}}to reference the datasetinputcolumn. Using{input}(single braces) passes the literal string through unchanged — the LLM will never see the actual question. Note: LLM scoreruser_promptfields use{input}and{output}(single braces) — that is a different substitution system.
selected_version_id — this is your <PROMPT_TEMPLATE_VERSION_ID>.
A2. Upload a dataset
Follow the Datasets section above if you haven’t already. Flow A requires a dataset with at least aninput column — include an output column if you want ground-truth-based metrics (e.g. correctness) to run.
Take the <DATASET_ID> from the upload response.
A3. Create and trigger the experiment
This single call uploads the experiment configuration and starts execution immediately (trigger: true). Galileo runs the prompt against each dataset row and scores the outputs.
API reference →
"scorer_type": "preset" in the scorers array.
Response: take id — this is your <EXPERIMENT_ID>.
Scoring is async. PollGET /projects/<PROJECT_ID>/runs/<EXPERIMENT_ID>/jobsuntil every job in the response has a terminal status. Non-terminal statuses areunstartedandin_progress— keep polling while any job has either. Terminal statuses arecompleted,processed,error, andfailed. Do not filter by?status=in_progress— jobs start inunstartedbefore a worker picks them up, so that filter can return empty before any work has started. Note:completedstatus is set before metric records finish writing to the database — there is an async queue hop between job completion and metrics being queryable (up to ~5 seconds). If the metrics fetch returns empty results immediately after jobs complete, wait briefly and retry. If a scorer errors, its column will not appear in the UI or aggregate metrics response. Traces will showstatus_type: "pending"for that metric in the trace search API — the scorer job failed before writing results. Check forerrororfailedjobs atGET /projects/<PROJECT_ID>/runs/<EXPERIMENT_ID>/jobsto diagnose the root cause.
Flow B — Dataset with Pre-generated Outputs
Your application has already produced LLM outputs. You have them in a structured dataset (CSV). Upload the dataset with agenerated_output column — Galileo scores the outputs directly without calling an LLM.
This is the cleanest path when your outputs are already tabular. If your outputs aren’t in a CSV — whether they’re generated at runtime or already in memory — see Flow C below.
B1. Upload your dataset
Follow the Datasets section above. Your CSV must include agenerated_output column alongside input and output:
<DATASET_ID> from the upload response.
B2. Create and trigger the experiment
Same call as Flow A but withoutprompt_template_version_id or prompt_settings. The presence of generated_output in the dataset signals Galileo to score those values directly.
API reference →
id — this is your <EXPERIMENT_ID>.
Scoring is async. Same as Flow A — pollGET /projects/<PROJECT_ID>/runs/<EXPERIMENT_ID>/jobsuntil all jobs have a terminal status (completed,processed,error,failed), then allow a brief delay before fetching metrics.
If bothprompt_template_version_idandgenerated_outputare present, the prompt template takes priority and Galileo calls the LLM — thegenerated_outputcolumn is ignored.
Flow C — Raw Trace Ingestion
You generate outputs at runtime (e.g. from a live service) and POST them directly as traces. No CSV dataset required.C1. Create an experiment
API reference →id — this is your <EXPERIMENT_ID>.
C2. Register scorers
Scorers are attached to the experiment via a separate call. This is what tells the platform which metrics to compute when traces arrive. API reference →
Order matters: Register scorers before ingesting traces. When traces arrive with is_complete: true, Galileo immediately enqueues scoring jobs — those jobs look up registered scorers at processing time. If scorer registration hasn’t completed yet, custom scorers will be silently skipped and only built-in metrics (cost, latency) will run. If that happens, re-calling this endpoint after traces exist will trigger a recompute automatically for the newly added scorers.
C3. Ingest traces
API reference →
Trace and span id fields must be valid UUID v4. The API rejects non-UUID4 strings with a 422. Generate a UUID4 per trace/span.
is_complete controls whether scoring is triggered for the traces in that request:
true→ traces stored and scoring triggered immediately for this batchfalse→ traces stored only, no scoring yet
is_complete: false, then send the final batch with is_complete: true. Scoring fires once across all accumulated traces when the final batch lands.
dataset_output is the ground truth — include it if you want ground-truth-based scorers to run.
Log Streams
Experiments are batch evaluation runs. For continuous production monitoring, use a Log Stream — traces stream in from live traffic and are scored as they arrive.LS1. Create a Log Stream
API reference →id — this is your <LOG_STREAM_ID>.
LS2. Attach scorers
Log streams are runs — use the samescorer-settings endpoint as experiments, with LOG_STREAM_ID as the run ID:
LS3. Ingest traces
Same proprietary endpoint as Flow C but withlog_stream_id instead of experiment_id: API reference →
LS4. Query trace results
API reference →OTel Ingestion
If you have an OpenTelemetry collector in your infrastructure, you can pipe traces directly to Galileo’s OTel endpoint instead of the proprietary/v2/projects/{PROJECT_ID}/traces. The transformation pipeline is the same — both converge on the same trace processing logic. The difference is just ingestion format and routing mechanism.
Endpoint:
Galileo-API-Key header.
Routing headers
Where the trace lands is controlled by headers on the OTel collector. All header names are lowercase, no dashes or underscores:| Header | Value | Effect |
|---|---|---|
projectid | UUID | Route to this project |
experimentid | UUID | Send to an existing experiment |
experiment | string | Send to experiment by name |
logstreamid | UUID | Send to an existing Log stream |
logstream | string | Send to Log stream by name — auto-creates if it doesn’t exist |
sessionid | string | Associate traces with a session (Log streams only — ignored for experiments) |
experimentid/experiment for offline evaluation. Use logstreamid/logstream for production monitoring.
Resource attributes (alternative to headers)
Embed routing in the OTel resource instead of collector headers:| Attribute | Effect |
|---|---|
galileo.project.id | Route to project by UUID |
galileo.project.name | Route to project by name |
galileo.logstream.id | Route to Log stream by UUID |
galileo.logstream.name | Route to Log stream by name |
Note: Resource attributes only support Log stream routing. To route OTel traces to an experiment, use theexperimentidorexperimentHTTP header — there is no resource attribute equivalent.
Session auto-resolution
Galileo extracts session ID automatically from span attributes — no manualsessionid header needed if your spans carry any of:
session.idgen_ai.conversation.idgalileo.session.id
Prerequisite: The experiment or Log stream must exist before traces arrive, unless using the logstream header with a name — that auto-creates the Log stream.
Step 4: Fetch Results
Aggregate metrics
Returns aggregate stats per scorer across all traces in the experiment. Use this to answer “did this experiment pass my quality bar” before drilling into individual traces. API reference →metrics represents one scorer. Key fields:
| Field | Meaning |
|---|---|
name | Scorer name |
data_type | Output type: percentage, boolean, categorical, etc. |
average | Mean score across all traces — populated for numeric/percentage scorers, null for boolean/categorical |
buckets | Distribution histogram. For numeric scorers: quartile ranges ("0.25", "0.50", "0.75", "1.00", "other") with trace counts. For boolean scorers: {"True": N, "False": N}. For categorical: one key per category. |
roll_up_method | How to summarize the metric — e.g. percentage_true for boolean scorers. null for numeric. |
To identify your scorer entries in the response, filter forroll_up_method != null. LLM scorers also emit sub-entries (<name>_input_tokens,_output_tokens,_total_tokens,_scorer_version_id) — these all haveroll_up_method: nulland can be discarded.
Per-trace results
Returns individual traces with per-trace metric values. Useful for finding which inputs scored below a threshold. API reference →| Field | Notes |
|---|---|
metric_info | Structured per-metric results — use this for parsing. Keyed by scorer name. |
metric_info[name].status_type | "success" or "pending". "pending" means the scorer job failed before writing — check GET /projects/<PROJECT_ID>/runs/<EXPERIMENT_ID>/jobs for the error. |
metric_info[name].value | The numeric score. |
metric_info[name].explanation / rationale | LLM scorer reasoning (present for LLM scorers only). |
dataset_input / dataset_output | Ground truth from the dataset (Flow A only). Compare against output to compute your own pass/fail logic. |
next_starting_token | Pass as starting_token in the next request to paginate. |
metrics | Flat dict version of all metric values — same data as metric_info, less structured. Useful for quick ad-hoc access. |
Filtering traces
To filter to traces below a score threshold:Scorer names are used as-is in filter names. If your scorer is namedresponse-quality-scorer, the filter name ismetrics/response-quality-scorer— hyphens are preserved, not converted to underscores.
Sorting by metric value
To surface the worst-scoring traces first, sort by the metric ascending:Filter operator reference
Number filters ("type": "number") — for metrics/<scorer-name>, duration_ns, cost, num_total_tokens:
| Operator | Meaning |
|---|---|
eq | Equal to |
ne | Not equal to |
gt | Greater than |
gte | Greater than or equal to |
lt | Less than |
lte | Less than or equal to |
between | Inclusive range — "value": [low, high] |
"type": "text") — for input, output, name, external_id:
| Operator | Meaning |
|---|---|
eq | Exact match |
ne | Not equal |
contains | substring match |
one_of | Value is in list — "value": ["a", "b"] |
not_in | Value is not in list |
"type": "id") — for id, session_id:
| Operator | Meaning |
|---|---|
eq | Exact UUID match |
one_of | Match any UUID in list |
"type": "date") — for created_at, updated_at:
| Operator | Meaning |
|---|---|
gt / gte | After a timestamp |
lt / lte | Before a timestamp |
filters array are combined with AND.