# Apply Bulk Annotation Source: https://docs.galileo.ai/api-reference/annotation/apply-bulk-annotation https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/annotation/ratings # Create Annotation Rating Source: https://docs.galileo.ai/api-reference/annotation/create-annotation-rating https://api.galileo.ai/public/v2/openapi.json put /v2/projects/{project_id}/annotation/templates/{template_id}/traces/{trace_id}/rating # Create Annotation Template Source: https://docs.galileo.ai/api-reference/annotation/create-annotation-template https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/annotation/templates # Create Log Record Annotation Rating Source: https://docs.galileo.ai/api-reference/annotation/create-log-record-annotation-rating https://api.galileo.ai/public/v2/openapi.json put /v2/projects/{project_id}/annotation/templates/{template_id}/records/{record_id}/rating # Delete Annotation Rating Source: https://docs.galileo.ai/api-reference/annotation/delete-annotation-rating https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/annotation/templates/{template_id}/traces/{trace_id}/rating # Delete Annotation Template Source: https://docs.galileo.ai/api-reference/annotation/delete-annotation-template https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/annotation/templates/{template_id} # Delete Log Record Annotation Rating Source: https://docs.galileo.ai/api-reference/annotation/delete-log-record-annotation-rating https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/annotation/templates/{template_id}/records/{record_id}/rating # Get Annotation Rating Source: https://docs.galileo.ai/api-reference/annotation/get-annotation-rating https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/annotation/templates/{template_id}/traces/{trace_id}/rating # Get Annotation Template Source: https://docs.galileo.ai/api-reference/annotation/get-annotation-template https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/annotation/templates/{template_id} # Get Log Record Annotation Rating Source: https://docs.galileo.ai/api-reference/annotation/get-log-record-annotation-rating https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/annotation/templates/{template_id}/records/{record_id}/rating # List Annotation Templates Source: https://docs.galileo.ai/api-reference/annotation/list-annotation-templates https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/annotation/templates # Reorder Annotation Templates Source: https://docs.galileo.ai/api-reference/annotation/reorder-annotation-templates https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/annotation/templates/reorder # Update Annotation Template Source: https://docs.galileo.ai/api-reference/annotation/update-annotation-template https://api.galileo.ai/public/v2/openapi.json patch /v2/projects/{project_id}/annotation/templates/{template_id} # Create Api Key Source: https://docs.galileo.ai/api-reference/api_keys/create-api-key https://api.galileo.ai/public/v2/openapi.json post /v2/users/api_keys # Delete Api Key Source: https://docs.galileo.ai/api-reference/api_keys/delete-api-key https://api.galileo.ai/public/v2/openapi.json delete /v2/users/api_keys/{api_key_id} # Get Api Keys Source: https://docs.galileo.ai/api-reference/api_keys/get-api-keys https://api.galileo.ai/public/v2/openapi.json get /v2/users/{user_id}/api_keys # Get Token Source: https://docs.galileo.ai/api-reference/auth/get-token https://api.galileo.ai/public/v2/openapi.json get /v2/token # Login Api Key Source: https://docs.galileo.ai/api-reference/auth/login-api-key https://api.galileo.ai/public/v2/openapi.json post /v2/login/api_key # Login Email Source: https://docs.galileo.ai/api-reference/auth/login-email https://api.galileo.ai/public/v2/openapi.json post /v2/login # Login Social Source: https://docs.galileo.ai/api-reference/auth/login-social https://api.galileo.ai/public/v2/openapi.json post /v2/login/social # Refresh Token Source: https://docs.galileo.ai/api-reference/auth/refresh-token https://api.galileo.ai/public/v2/openapi.json post /v2/refresh_token # Saml Acs Source: https://docs.galileo.ai/api-reference/auth/saml-acs https://api.galileo.ai/public/v2/openapi.json post /v2/saml/acs # Saml Login Source: https://docs.galileo.ai/api-reference/auth/saml-login https://api.galileo.ai/public/v2/openapi.json get /v2/saml/login # Saml Metadata Source: https://docs.galileo.ai/api-reference/auth/saml-metadata https://api.galileo.ai/public/v2/openapi.json get /v2/saml/metadata # Verify Email Source: https://docs.galileo.ai/api-reference/auth/verify-email https://api.galileo.ai/public/v2/openapi.json post /v2/verify_email # Autogen Llm Scorer Source: https://docs.galileo.ai/api-reference/data/autogen-llm-scorer https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/llm/autogen Autogenerate an LLM scorer configuration. Returns a Celery task ID that can be used to poll for the autogeneration results. # Compute Health Score Endpoint Source: https://docs.galileo.ai/api-reference/data/compute-health-score-endpoint https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/metrics-testing/{run_id}/health-score Compute the health score metric for a metrics testing run. # Create Source: https://docs.galileo.ai/api-reference/data/create https://api.galileo.ai/public/v2/openapi.json post /v2/scorers # Create Code Scorer Version Source: https://docs.galileo.ai/api-reference/data/create-code-scorer-version https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/{scorer_id}/version/code # Create Llm Scorer Version Source: https://docs.galileo.ai/api-reference/data/create-llm-scorer-version https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/{scorer_id}/version/llm # Create Luna Scorer Version Source: https://docs.galileo.ai/api-reference/data/create-luna-scorer-version https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/{scorer_id}/version/luna # Create Preset Scorer Version Source: https://docs.galileo.ai/api-reference/data/create-preset-scorer-version https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/{scorer_id}/version/preset Create a preset scorer version. # Delete Scorer Source: https://docs.galileo.ai/api-reference/data/delete-scorer https://api.galileo.ai/public/v2/openapi.json delete /v2/scorers/{scorer_id} # Get Scorer Source: https://docs.galileo.ai/api-reference/data/get-scorer https://api.galileo.ai/public/v2/openapi.json get /v2/scorers/{scorer_id} # Get Scorer Version Code Source: https://docs.galileo.ai/api-reference/data/get-scorer-version-code https://api.galileo.ai/public/v2/openapi.json get /v2/scorers/{scorer_id}/version/code # Get Scorer Version Or Latest Source: https://docs.galileo.ai/api-reference/data/get-scorer-version-or-latest https://api.galileo.ai/public/v2/openapi.json get /v2/scorers/{scorer_id}/version # Get Validate Code Scorer Task Result Source: https://docs.galileo.ai/api-reference/data/get-validate-code-scorer-task-result https://api.galileo.ai/public/v2/openapi.json get /v2/scorers/code/validate/{task_id} Poll for a code-scorer validation task result (returns status/result). The validation job creates an entry in `registered_scorer_task_results` (pending) and the runner will PATCH the internal task-results endpoint when it finishes. This GET allows clients to poll the current task result. # List All Versions For Scorer Source: https://docs.galileo.ai/api-reference/data/list-all-versions-for-scorer https://api.galileo.ai/public/v2/openapi.json get /v2/scorers/{scorer_id}/versions # List Projects For Scorer Route Source: https://docs.galileo.ai/api-reference/data/list-projects-for-scorer-route https://api.galileo.ai/public/v2/openapi.json get /v2/scorers/{scorer_id}/projects List all projects associated with a specific scorer. # List Projects For Scorer Version Route Source: https://docs.galileo.ai/api-reference/data/list-projects-for-scorer-version-route https://api.galileo.ai/public/v2/openapi.json get /v2/scorers/versions/{scorer_version_id}/projects List all projects associated with a specific scorer version. # List Scorers With Filters Source: https://docs.galileo.ai/api-reference/data/list-scorers-with-filters https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/list # List Tags Source: https://docs.galileo.ai/api-reference/data/list-tags https://api.galileo.ai/public/v2/openapi.json get /v2/scorers/tags # Manual Llm Validate Source: https://docs.galileo.ai/api-reference/data/manual-llm-validate https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/llm/validate # Restore Scorer Version Source: https://docs.galileo.ai/api-reference/data/restore-scorer-version https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/{scorer_id}/versions/{version_number}/restore List all scorers. # Update Source: https://docs.galileo.ai/api-reference/data/update https://api.galileo.ai/public/v2/openapi.json patch /v2/scorers/{scorer_id} # Validate Code Scorer Source: https://docs.galileo.ai/api-reference/data/validate-code-scorer https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/code/validate Validate a code scorer with optional simple input/output test. # Validate Code Scorer Dataset Source: https://docs.galileo.ai/api-reference/data/validate-code-scorer-dataset https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/code/validate/dataset Validate a code scorer against dataset rows. # Validate Code Scorer Log Record Source: https://docs.galileo.ai/api-reference/data/validate-code-scorer-log-record https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/code/validate/log_record Validate a code scorer using actual log records. # Validate Llm Scorer Dataset Source: https://docs.galileo.ai/api-reference/data/validate-llm-scorer-dataset https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/llm/validate/dataset # Validate Llm Scorer Log Record Source: https://docs.galileo.ai/api-reference/data/validate-llm-scorer-log-record https://api.galileo.ai/public/v2/openapi.json post /v2/scorers/llm/validate/log_record # Bulk Delete Datasets Source: https://docs.galileo.ai/api-reference/datasets/bulk-delete-datasets https://api.galileo.ai/public/v2/openapi.json delete /v2/datasets/bulk_delete Delete multiple datasets in bulk. This endpoint allows efficient deletion of multiple datasets at once. It validates permissions for each dataset in the service and provides detailed feedback about successful and failed deletions for each dataset. Parameters ---------- delete_request : BulkDeleteDatasetsRequest Request containing list of dataset IDs to delete (max 100) ctx : Context Request context including authentication information Returns ------- BulkDeleteDatasetsResponse Details about the bulk deletion operation including: - Number of successfully deleted datasets - List of failed deletions with reasons - Summary message # Create Dataset Source: https://docs.galileo.ai/api-reference/datasets/create-dataset https://api.galileo.ai/public/v2/openapi.json post /v2/datasets Creates a standalone dataset. # Create Group Dataset Collaborators Source: https://docs.galileo.ai/api-reference/datasets/create-group-dataset-collaborators https://api.galileo.ai/public/v2/openapi.json post /v2/datasets/{dataset_id}/groups Share a dataset with groups. # Create User Dataset Collaborators Source: https://docs.galileo.ai/api-reference/datasets/create-user-dataset-collaborators https://api.galileo.ai/public/v2/openapi.json post /v2/datasets/{dataset_id}/users # Delete Dataset Source: https://docs.galileo.ai/api-reference/datasets/delete-dataset https://api.galileo.ai/public/v2/openapi.json delete /v2/datasets/{dataset_id} # Delete Group Dataset Collaborator Source: https://docs.galileo.ai/api-reference/datasets/delete-group-dataset-collaborator https://api.galileo.ai/public/v2/openapi.json delete /v2/datasets/{dataset_id}/groups/{group_id} Remove a group's access to a dataset. # Delete Prompt Dataset Source: https://docs.galileo.ai/api-reference/datasets/delete-prompt-dataset https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/prompt_datasets/{dataset_id} # Delete User Dataset Collaborator Source: https://docs.galileo.ai/api-reference/datasets/delete-user-dataset-collaborator https://api.galileo.ai/public/v2/openapi.json delete /v2/datasets/{dataset_id}/users/{user_id} Remove a user's access to a dataset. # Download Dataset Source: https://docs.galileo.ai/api-reference/datasets/download-dataset https://api.galileo.ai/public/v2/openapi.json get /v2/datasets/{dataset_id}/download # Download Prompt Dataset Source: https://docs.galileo.ai/api-reference/datasets/download-prompt-dataset https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/prompt_datasets/{dataset_id} # Extend Dataset Content Source: https://docs.galileo.ai/api-reference/datasets/extend-dataset-content https://api.galileo.ai/public/v2/openapi.json post /v2/datasets/extend Extends the dataset content # Get Dataset Source: https://docs.galileo.ai/api-reference/datasets/get-dataset https://api.galileo.ai/public/v2/openapi.json get /v2/datasets/{dataset_id} # Get Dataset Content Source: https://docs.galileo.ai/api-reference/datasets/get-dataset-content https://api.galileo.ai/public/v2/openapi.json get /v2/datasets/{dataset_id}/content # Get Dataset Synthetic Extend Status Source: https://docs.galileo.ai/api-reference/datasets/get-dataset-synthetic-extend-status https://api.galileo.ai/public/v2/openapi.json get /v2/datasets/extend/{dataset_id} # Get Dataset Version Content Source: https://docs.galileo.ai/api-reference/datasets/get-dataset-version-content https://api.galileo.ai/public/v2/openapi.json get /v2/datasets/{dataset_id}/versions/{version_index}/content # List Dataset Projects Source: https://docs.galileo.ai/api-reference/datasets/list-dataset-projects https://api.galileo.ai/public/v2/openapi.json get /v2/datasets/{dataset_id}/projects # List Datasets Source: https://docs.galileo.ai/api-reference/datasets/list-datasets https://api.galileo.ai/public/v2/openapi.json get /v2/datasets # List Group Dataset Collaborators Source: https://docs.galileo.ai/api-reference/datasets/list-group-dataset-collaborators https://api.galileo.ai/public/v2/openapi.json get /v2/datasets/{dataset_id}/groups List the groups with which the dataset has been shared. # List Prompt Datasets Source: https://docs.galileo.ai/api-reference/datasets/list-prompt-datasets https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/prompt_datasets # List User Dataset Collaborators Source: https://docs.galileo.ai/api-reference/datasets/list-user-dataset-collaborators https://api.galileo.ai/public/v2/openapi.json get /v2/datasets/{dataset_id}/users List the users with which the dataset has been shared. # Preview Dataset Source: https://docs.galileo.ai/api-reference/datasets/preview-dataset https://api.galileo.ai/public/v2/openapi.json post /v2/datasets/{dataset_id}/preview # Query Dataset Content Source: https://docs.galileo.ai/api-reference/datasets/query-dataset-content https://api.galileo.ai/public/v2/openapi.json post /v2/datasets/{dataset_id}/content/query # Query Dataset Versions Source: https://docs.galileo.ai/api-reference/datasets/query-dataset-versions https://api.galileo.ai/public/v2/openapi.json post /v2/datasets/{dataset_id}/versions/query # Query Datasets Source: https://docs.galileo.ai/api-reference/datasets/query-datasets https://api.galileo.ai/public/v2/openapi.json post /v2/datasets/query # Update Dataset Source: https://docs.galileo.ai/api-reference/datasets/update-dataset https://api.galileo.ai/public/v2/openapi.json patch /v2/datasets/{dataset_id} # Update Dataset Content Source: https://docs.galileo.ai/api-reference/datasets/update-dataset-content https://api.galileo.ai/public/v2/openapi.json patch /v2/datasets/{dataset_id}/content Update the content of a dataset. The `index` and `column_name` fields are treated as keys tied to a specific version of the dataset. As such, these values are considered immutable identifiers for the dataset's structure. For example, if an edit operation changes the name of a column, subsequent edit operations in the same request should reference the column using its original name. The `If-Match` header is used to ensure that updates are only applied if the client's version of the dataset matches the server's version. This prevents conflicts from simultaneous updates. The `ETag` header in the response provides the new version identifier after a successful update. # Update Dataset Version Source: https://docs.galileo.ai/api-reference/datasets/update-dataset-version https://api.galileo.ai/public/v2/openapi.json patch /v2/datasets/{dataset_id}/versions/{version_index} # Update Group Dataset Collaborator Source: https://docs.galileo.ai/api-reference/datasets/update-group-dataset-collaborator https://api.galileo.ai/public/v2/openapi.json patch /v2/datasets/{dataset_id}/groups/{group_id} Update the sharing permissions of a group on a dataset. # Update Prompt Dataset Source: https://docs.galileo.ai/api-reference/datasets/update-prompt-dataset https://api.galileo.ai/public/v2/openapi.json put /v2/projects/{project_id}/prompt_datasets/{dataset_id} # Update User Dataset Collaborator Source: https://docs.galileo.ai/api-reference/datasets/update-user-dataset-collaborator https://api.galileo.ai/public/v2/openapi.json patch /v2/datasets/{dataset_id}/users/{user_id} Update the sharing permissions of a user on a dataset. # Upload Prompt Evaluation Dataset Source: https://docs.galileo.ai/api-reference/datasets/upload-prompt-evaluation-dataset https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/prompt_datasets # Upsert Dataset Content Source: https://docs.galileo.ai/api-reference/datasets/upsert-dataset-content https://api.galileo.ai/public/v2/openapi.json put /v2/datasets/{dataset_id}/content Rollback the content of a dataset to a previous version. # Create Experiment Source: https://docs.galileo.ai/api-reference/experiment/create-experiment https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/experiments Create a new experiment for a project. # Delete Experiment Source: https://docs.galileo.ai/api-reference/experiment/delete-experiment https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/experiments/{experiment_id} Delete a specific experiment. # Experiments Available Columns Source: https://docs.galileo.ai/api-reference/experiment/experiments-available-columns https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/experiments/available_columns Procures the column information for experiments. # Get Experiment Source: https://docs.galileo.ai/api-reference/experiment/get-experiment https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/experiments/{experiment_id} Retrieve a specific experiment. # Get Experiment Metrics Source: https://docs.galileo.ai/api-reference/experiment/get-experiment-metrics https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/experiments/{experiment_id}/metrics Retrieve metrics for a specific experiment. # Get Experiments Metrics Source: https://docs.galileo.ai/api-reference/experiment/get-experiments-metrics https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/experiments/metrics Retrieve metrics for all experiments in a project. # Get Metric Settings Source: https://docs.galileo.ai/api-reference/experiment/get-metric-settings https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/experiments/{experiment_id}/metric_settings # List Experiments Source: https://docs.galileo.ai/api-reference/experiment/list-experiments https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/experiments Retrieve all experiments for a project. # List Experiments Paginated Source: https://docs.galileo.ai/api-reference/experiment/list-experiments-paginated https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/experiments/paginated Retrieve all experiments for a project with pagination. # Search Experiments Source: https://docs.galileo.ai/api-reference/experiment/search-experiments https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/experiments/search Search experiments for a project. # Update Experiment Source: https://docs.galileo.ai/api-reference/experiment/update-experiment https://api.galileo.ai/public/v2/openapi.json put /v2/projects/{project_id}/experiments/{experiment_id} Update a specific experiment. # Update Metric Settings Source: https://docs.galileo.ai/api-reference/experiment/update-metric-settings https://api.galileo.ai/public/v2/openapi.json patch /v2/projects/{project_id}/experiments/{experiment_id}/metric_settings # Apply Bulk Feedback V2 Source: https://docs.galileo.ai/api-reference/feedback/apply-bulk-feedback-v2 https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/feedback/ratings # Create Feedback Rating V2 Source: https://docs.galileo.ai/api-reference/feedback/create-feedback-rating-v2 https://api.galileo.ai/public/v2/openapi.json put /v2/projects/{project_id}/feedback/templates/{template_id}/traces/{trace_id}/rating # Create Feedback Template V2 Source: https://docs.galileo.ai/api-reference/feedback/create-feedback-template-v2 https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/feedback/templates # Delete Feedback Rating V2 Source: https://docs.galileo.ai/api-reference/feedback/delete-feedback-rating-v2 https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/feedback/templates/{template_id}/traces/{trace_id}/rating # Delete Feedback Template Source: https://docs.galileo.ai/api-reference/feedback/delete-feedback-template https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/feedback/templates/{template_id} # Get Feedback Rating V2 Source: https://docs.galileo.ai/api-reference/feedback/get-feedback-rating-v2 https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/feedback/templates/{template_id}/traces/{trace_id}/rating # Get Feedback Template V2 Source: https://docs.galileo.ai/api-reference/feedback/get-feedback-template-v2 https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/feedback/templates/{template_id} # List Feedback Templates V2 Source: https://docs.galileo.ai/api-reference/feedback/list-feedback-templates-v2 https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/feedback/templates # Reorder Feedback Templates Source: https://docs.galileo.ai/api-reference/feedback/reorder-feedback-templates https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/feedback/templates/reorder # Update Feedback Template Source: https://docs.galileo.ai/api-reference/feedback/update-feedback-template https://api.galileo.ai/public/v2/openapi.json patch /v2/projects/{project_id}/feedback/templates/{template_id} # Add User To Group Source: https://docs.galileo.ai/api-reference/groups/add-user-to-group https://api.galileo.ai/public/v2/openapi.json post /v2/groups/{group_id}/members # Create Group Source: https://docs.galileo.ai/api-reference/groups/create-group https://api.galileo.ai/public/v2/openapi.json post /v2/groups # Delete Group Source: https://docs.galileo.ai/api-reference/groups/delete-group https://api.galileo.ai/public/v2/openapi.json delete /v2/groups/{group_id} # Delete Group Member Source: https://docs.galileo.ai/api-reference/groups/delete-group-member https://api.galileo.ai/public/v2/openapi.json delete /v2/groups/{group_id}/members/{user_id} # Get Group Source: https://docs.galileo.ai/api-reference/groups/get-group https://api.galileo.ai/public/v2/openapi.json get /v2/groups/{group_id} # Get Group Roles Source: https://docs.galileo.ai/api-reference/groups/get-group-roles https://api.galileo.ai/public/v2/openapi.json get /v2/group_roles # List Current User Groups Source: https://docs.galileo.ai/api-reference/groups/list-current-user-groups https://api.galileo.ai/public/v2/openapi.json get /v2/current_user/groups # List Group Members Source: https://docs.galileo.ai/api-reference/groups/list-group-members https://api.galileo.ai/public/v2/openapi.json get /v2/groups/{group_id}/members # List Groups Source: https://docs.galileo.ai/api-reference/groups/list-groups https://api.galileo.ai/public/v2/openapi.json get /v2/groups # Update Group Source: https://docs.galileo.ai/api-reference/groups/update-group https://api.galileo.ai/public/v2/openapi.json patch /v2/groups/{group_id} # Update Group Member Source: https://docs.galileo.ai/api-reference/groups/update-group-member https://api.galileo.ai/public/v2/openapi.json patch /v2/groups/{group_id}/members/{user_id} # Healthcheck Source: https://docs.galileo.ai/api-reference/health/healthcheck https://api.galileo.ai/public/v2/openapi.json get /v2/healthcheck # Create Group Integration Collaborators Source: https://docs.galileo.ai/api-reference/integrations/create-group-integration-collaborators https://api.galileo.ai/public/v2/openapi.json post /v2/integrations/{integration_id}/groups Share an integration with groups. # Create or update Anthropic integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-anthropic-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/anthropic Create or update an Anthropic integration for this user from Galileo. # Create or update AWS Bedrock integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-aws-bedrock-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/aws_bedrock Create or update an AWS integration for this user from Galileo. # Create or update AWS SageMaker integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-aws-sagemaker-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/aws_sagemaker Create or update an AWS integration for this user from Galileo. # Create or update Azure integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-azure-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/azure Create or update an Azure integration for this user from Galileo. # Create or update custom integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-custom-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/custom # Create or update Databricks integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-databricks-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/databricks Create or update a databricks integration for this user from Galileo. # Create or update Databricks integration (legacy) Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-databricks-integration-legacy https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/databricks/unity-catalog/sql Create or update a databricks integration for this user from Galileo. # Create Or Update Integration Selection Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-integration-selection https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/{integration_id}/select Create or update an integration selection for this user from Galileo. # Create or update Mistral integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-mistral-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/mistral Create or update an Mistral integration for this user from Galileo. # Create or update NVIDIA integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-nvidia-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/nvidia Create or update an NVIDIA integration for this user from Galileo. # Create or update OpenAI integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-openai-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/openai Create or update an OpenAI integration for this user from Galileo. # Create or update Vegas Gateway integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-vegas-gateway-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/vegas_gateway Create or update a Vegas Gateway integration for this user from Galileo. # Create or update Vertex AI integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-vertex-ai-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/vertex_ai Create or update a Google Vertex AI integration for a user. # Create or update Writer integration Source: https://docs.galileo.ai/api-reference/integrations/create-or-update-writer-integration https://api.galileo.ai/public/v2/openapi.json put /v2/integrations/writer Create or update a Writer integration for a user. # Create User Integration Collaborators Source: https://docs.galileo.ai/api-reference/integrations/create-user-integration-collaborators https://api.galileo.ai/public/v2/openapi.json post /v2/integrations/{integration_id}/users # Delete Group Integration Collaborator Source: https://docs.galileo.ai/api-reference/integrations/delete-group-integration-collaborator https://api.galileo.ai/public/v2/openapi.json delete /v2/integrations/{integration_id}/groups/{group_id} Remove a group's access to an integration. # Delete User Integration Collaborator Source: https://docs.galileo.ai/api-reference/integrations/delete-user-integration-collaborator https://api.galileo.ai/public/v2/openapi.json delete /v2/integrations/{integration_id}/users/{user_id} Remove a user's access to an integration. # Get Databases For Cluster Source: https://docs.galileo.ai/api-reference/integrations/get-databases-for-cluster https://api.galileo.ai/public/v2/openapi.json get /v2/integrations/databricks/databases # Get Databricks Catalogs Source: https://docs.galileo.ai/api-reference/integrations/get-databricks-catalogs https://api.galileo.ai/public/v2/openapi.json get /v2/integrations/databricks/catalogs # Get Integration Source: https://docs.galileo.ai/api-reference/integrations/get-integration https://api.galileo.ai/public/v2/openapi.json get /v2/integrations/{name} Gets the integration data formatted for the specified integration. # Get Integration Status Source: https://docs.galileo.ai/api-reference/integrations/get-integration-status https://api.galileo.ai/public/v2/openapi.json get /v2/integrations/{name}/status Checks if the integration status is active or not. # List Available Integrations Source: https://docs.galileo.ai/api-reference/integrations/list-available-integrations https://api.galileo.ai/public/v2/openapi.json get /v2/integrations/available List all of the available integrations to be created in Galileo. # List Group Integration Collaborators Source: https://docs.galileo.ai/api-reference/integrations/list-group-integration-collaborators https://api.galileo.ai/public/v2/openapi.json get /v2/integrations/{integration_id}/groups List the groups with which the integration has been shared. # List User Integration Collaborators Source: https://docs.galileo.ai/api-reference/integrations/list-user-integration-collaborators https://api.galileo.ai/public/v2/openapi.json get /v2/integrations/{integration_id}/users List the users with which the integration has been shared. # Update Group Integration Collaborator Source: https://docs.galileo.ai/api-reference/integrations/update-group-integration-collaborator https://api.galileo.ai/public/v2/openapi.json patch /v2/integrations/{integration_id}/groups/{group_id} Update the sharing permissions of a group on an integration. # Update User Integration Collaborator Source: https://docs.galileo.ai/api-reference/integrations/update-user-integration-collaborator https://api.galileo.ai/public/v2/openapi.json patch /v2/integrations/{integration_id}/users/{user_id} Update the sharing permissions of a user on an integration. # Create Log Stream Source: https://docs.galileo.ai/api-reference/log_stream/create-log-stream https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/log_streams Create a new log stream for a project. # Delete Log Stream Source: https://docs.galileo.ai/api-reference/log_stream/delete-log-stream https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/log_streams/{log_stream_id} Delete a specific log stream. # Get Log Stream Source: https://docs.galileo.ai/api-reference/log_stream/get-log-stream https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/log_streams/{log_stream_id} Retrieve a specific log stream. # Get Metric Settings Source: https://docs.galileo.ai/api-reference/log_stream/get-metric-settings https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/log_streams/{log_stream_id}/metric_settings # List Log Streams Source: https://docs.galileo.ai/api-reference/log_stream/list-log-streams https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/log_streams Retrieve all log streams for a project. DEPRECATED in favor of `list_log_streams_paginated`. # List Log Streams Paginated Source: https://docs.galileo.ai/api-reference/log_stream/list-log-streams-paginated https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/log_streams/paginated Retrieve all log streams for a project paginated. # Search Log Streams Source: https://docs.galileo.ai/api-reference/log_stream/search-log-streams https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/log_streams/search Search log streams for a project. # Update Log Stream Source: https://docs.galileo.ai/api-reference/log_stream/update-log-stream https://api.galileo.ai/public/v2/openapi.json put /v2/projects/{project_id}/log_streams/{log_stream_id} Update a specific log stream. # Update Metric Settings Source: https://docs.galileo.ai/api-reference/log_stream/update-metric-settings https://api.galileo.ai/public/v2/openapi.json patch /v2/projects/{project_id}/log_streams/{log_stream_id}/metric_settings # Get Logstream Insights Token Usages Source: https://docs.galileo.ai/api-reference/logstream-insights/get-logstream-insights-token-usages https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/log_streams/{log_stream_id}/logstream_insights/token_usage # Delete By Metadata Source: https://docs.galileo.ai/api-reference/organization-jobs/delete-by-metadata https://api.galileo.ai/public/v2/openapi.json post /v2/org-jobs/delete-by-metadata Delete traces/sessions across all projects in the organization by metadata filters. This endpoint allows organization administrators to delete traces or sessions that match specific metadata key-value pairs across all projects in their organization. # Get Org Job Status Source: https://docs.galileo.ai/api-reference/organization-jobs/get-org-job-status https://api.galileo.ai/public/v2/openapi.json get /v2/org-jobs/{job_id} Get the status of an organization-level job. This endpoint retrieves the status of jobs that operate at the organization level, such as org-wide data deletion jobs. **Authorization**: The job's organization_id must match the user's organization. # Create Group Project Collaborators Source: https://docs.galileo.ai/api-reference/projects/create-group-project-collaborators https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/groups Share a project with groups. # Create Project Source: https://docs.galileo.ai/api-reference/projects/create-project https://api.galileo.ai/public/v2/openapi.json post /v2/projects Create a new project. # Create User Project Collaborators Source: https://docs.galileo.ai/api-reference/projects/create-user-project-collaborators https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/users Share a project with users. # Delete Group Project Collaborator Source: https://docs.galileo.ai/api-reference/projects/delete-group-project-collaborator https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/groups/{group_id} Remove a group's access to a project. # Delete Project Source: https://docs.galileo.ai/api-reference/projects/delete-project https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id} Deletes a project and all associated runs and objects. Any user with project access can delete a project. Note that `get_project_by_id` calls `user_can_access_project`. # Delete User Project Collaborator Source: https://docs.galileo.ai/api-reference/projects/delete-user-project-collaborator https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/users/{user_id} Remove a user's access to a project. # Get Collaborator Roles Source: https://docs.galileo.ai/api-reference/projects/get-collaborator-roles https://api.galileo.ai/public/v2/openapi.json get /v2/collaborator_roles # Get Project Source: https://docs.galileo.ai/api-reference/projects/get-project https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id} # Get Projects V2 Source: https://docs.galileo.ai/api-reference/projects/get-projects-v2 https://api.galileo.ai/public/v2/openapi.json post /v2/projects/paginated Gets projects optimized for V2 with pagination and server-side run counts. # List Group Project Collaborators Source: https://docs.galileo.ai/api-reference/projects/list-group-project-collaborators https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/groups List the groups with which the project has been shared. # List User Project Collaborators Source: https://docs.galileo.ai/api-reference/projects/list-user-project-collaborators https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/users List the users with which the project has been shared. # Update Group Project Collaborator Source: https://docs.galileo.ai/api-reference/projects/update-group-project-collaborator https://api.galileo.ai/public/v2/openapi.json patch /v2/projects/{project_id}/groups/{group_id} Update the sharing permissions of a group on a project. # Update Project Source: https://docs.galileo.ai/api-reference/projects/update-project https://api.galileo.ai/public/v2/openapi.json put /v2/projects/{project_id} # Update User Project Collaborator Source: https://docs.galileo.ai/api-reference/projects/update-user-project-collaborator https://api.galileo.ai/public/v2/openapi.json patch /v2/projects/{project_id}/users/{user_id} Update the sharing permissions of a user on a project. # Invoke Source: https://docs.galileo.ai/api-reference/protect/invoke https://api.galileo.ai/public/v2/openapi.json post /v2/protect/invoke # Get Settings Source: https://docs.galileo.ai/api-reference/run_insights_settings/get-settings https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/runs/{run_id}/insights-settings # Upsert Insights Config Source: https://docs.galileo.ai/api-reference/run_insights_settings/upsert-insights-config https://api.galileo.ai/public/v2/openapi.json patch /v2/projects/{project_id}/runs/{run_id}/insights-settings # Create Or Verify User Source: https://docs.galileo.ai/api-reference/system_users/create-or-verify-user https://api.galileo.ai/public/v2/openapi.json post /v2/system_users Create a new system user with an email and password. If no admin exists (first user), the user will be created as an admin. Otherwise: - User record was already created when the admin invited the user - We should verify the user's email # Create Or Verify User Social Source: https://docs.galileo.ai/api-reference/system_users/create-or-verify-user-social https://api.galileo.ai/public/v2/openapi.json post /v2/system_users/social Create a user using a social login provider. All social users are created with `email_is_verified=True`, don't need to be invited and are by default read-only (unless they are the first user, in which case they are set to admin). # Count Sessions Source: https://docs.galileo.ai/api-reference/trace/count-sessions https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/sessions/count # Count Spans Source: https://docs.galileo.ai/api-reference/trace/count-spans https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/spans/count # Count Traces Source: https://docs.galileo.ai/api-reference/trace/count-traces https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/traces/count This endpoint may return a slightly inaccurate count due to the way records are filtered before deduplication. # Create Session Source: https://docs.galileo.ai/api-reference/trace/create-session https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/sessions # Delete Sessions Source: https://docs.galileo.ai/api-reference/trace/delete-sessions https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/sessions/delete Delete all session records that match the provided filters. # Delete Spans Source: https://docs.galileo.ai/api-reference/trace/delete-spans https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/spans/delete Delete all span records that match the provided filters. # Delete Traces Source: https://docs.galileo.ai/api-reference/trace/delete-traces https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/traces/delete Delete all trace records that match the provided filters. # Export Records Source: https://docs.galileo.ai/api-reference/trace/export-records https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/export_records # Get Aggregated Trace View Source: https://docs.galileo.ai/api-reference/trace/get-aggregated-trace-view https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/traces/aggregated # Get Session Source: https://docs.galileo.ai/api-reference/trace/get-session https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/sessions/{session_id} # Get Span Source: https://docs.galileo.ai/api-reference/trace/get-span https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/spans/{span_id} # Get Trace Source: https://docs.galileo.ai/api-reference/trace/get-trace https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/traces/{trace_id} # Log Spans Source: https://docs.galileo.ai/api-reference/trace/log-spans https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/spans # Log Traces Source: https://docs.galileo.ai/api-reference/trace/log-traces https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/traces # Metrics Testing Available Columns Source: https://docs.galileo.ai/api-reference/trace/metrics-testing-available-columns https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/metrics-testing/available_columns # Query Custom Metrics Source: https://docs.galileo.ai/api-reference/trace/query-custom-metrics https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/metrics/custom_search # Query Metrics Source: https://docs.galileo.ai/api-reference/trace/query-metrics https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/metrics/search # Query Metrics V2 Source: https://docs.galileo.ai/api-reference/trace/query-metrics-v2 https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/metrics/search/v2 Same as /metrics/search but returns metrics with node-type counts: trace (requests_count), session_count, and span_count in aggregate_metrics and in each bucket, similar to /metrics/custom_search. # Query Partial Sessions Source: https://docs.galileo.ai/api-reference/trace/query-partial-sessions https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/sessions/partial_search # Query Partial Spans Source: https://docs.galileo.ai/api-reference/trace/query-partial-spans https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/spans/partial_search # Query Partial Traces Source: https://docs.galileo.ai/api-reference/trace/query-partial-traces https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/traces/partial_search # Query Sessions Source: https://docs.galileo.ai/api-reference/trace/query-sessions https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/sessions/search # Query Spans Source: https://docs.galileo.ai/api-reference/trace/query-spans https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/spans/search # Query Traces Source: https://docs.galileo.ai/api-reference/trace/query-traces https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/traces/search # Recompute Metrics Source: https://docs.galileo.ai/api-reference/trace/recompute-metrics https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/recompute-metrics # Sessions Available Columns Source: https://docs.galileo.ai/api-reference/trace/sessions-available-columns https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/sessions/available_columns # Spans Available Columns Source: https://docs.galileo.ai/api-reference/trace/spans-available-columns https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/spans/available_columns # Traces Available Columns Source: https://docs.galileo.ai/api-reference/trace/traces-available-columns https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/traces/available_columns # Update Span Source: https://docs.galileo.ai/api-reference/trace/update-span https://api.galileo.ai/public/v2/openapi.json patch /v2/projects/{project_id}/spans/{span_id} Update a span with the given ID. # Update Trace Source: https://docs.galileo.ai/api-reference/trace/update-trace https://api.galileo.ai/public/v2/openapi.json patch /v2/projects/{project_id}/traces/{trace_id} Update a trace with the given ID. # Create Section Source: https://docs.galileo.ai/api-reference/trends_dashboard/create-section https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/log_streams/{log_stream_id}/trends/sections # Create Widget Source: https://docs.galileo.ai/api-reference/trends_dashboard/create-widget https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/log_streams/{log_stream_id}/trends/widgets # Delete Dashboard Source: https://docs.galileo.ai/api-reference/trends_dashboard/delete-dashboard https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/log_streams/{log_stream_id}/trends/dashboards/{trends_dashboard_id} # Delete Section Source: https://docs.galileo.ai/api-reference/trends_dashboard/delete-section https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/log_streams/{log_stream_id}/trends/sections/{section_id} Delete section. If ungroup=True, keep widgets by moving them to dashboard top-level (clear section_id). # Delete Widget Source: https://docs.galileo.ai/api-reference/trends_dashboard/delete-widget https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/log_streams/{log_stream_id}/trends/widgets/{widget_id} # Duplicate Dashboard Source: https://docs.galileo.ai/api-reference/trends_dashboard/duplicate-dashboard https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/log_streams/{log_stream_id}/trends/dashboards/{trends_dashboard_id}/duplicate # Favorite Dashboard Source: https://docs.galileo.ai/api-reference/trends_dashboard/favorite-dashboard https://api.galileo.ai/public/v2/openapi.json post /v2/projects/{project_id}/log_streams/{log_stream_id}/trends/dashboards/{trends_dashboard_id}/favorite # Get Trends Source: https://docs.galileo.ai/api-reference/trends_dashboard/get-trends https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/log_streams/{log_stream_id}/trends # List Dashboards Source: https://docs.galileo.ai/api-reference/trends_dashboard/list-dashboards https://api.galileo.ai/public/v2/openapi.json get /v2/projects/{project_id}/log_streams/{log_stream_id}/trends/dashboards # Unfavorite Dashboard Source: https://docs.galileo.ai/api-reference/trends_dashboard/unfavorite-dashboard https://api.galileo.ai/public/v2/openapi.json delete /v2/projects/{project_id}/log_streams/{log_stream_id}/trends/dashboards/favorite # Update Section Source: https://docs.galileo.ai/api-reference/trends_dashboard/update-section https://api.galileo.ai/public/v2/openapi.json put /v2/projects/{project_id}/log_streams/{log_stream_id}/trends/sections/{section_id} # Update Trends Source: https://docs.galileo.ai/api-reference/trends_dashboard/update-trends https://api.galileo.ai/public/v2/openapi.json put /v2/projects/{project_id}/log_streams/{log_stream_id}/trends # Update Widget Source: https://docs.galileo.ai/api-reference/trends_dashboard/update-widget https://api.galileo.ai/public/v2/openapi.json put /v2/projects/{project_id}/log_streams/{log_stream_id}/trends/widgets/{widget_id} # Current User Source: https://docs.galileo.ai/api-reference/users/current-user https://api.galileo.ai/public/v2/openapi.json get /v2/current_user # Delete User Source: https://docs.galileo.ai/api-reference/users/delete-user https://api.galileo.ai/public/v2/openapi.json delete /v2/users/{user_id} # Get User Source: https://docs.galileo.ai/api-reference/users/get-user https://api.galileo.ai/public/v2/openapi.json get /v2/users/{user_id} # Get User Roles Source: https://docs.galileo.ai/api-reference/users/get-user-roles https://api.galileo.ai/public/v2/openapi.json get /v2/user_roles Get all user roles. # Invite Users Source: https://docs.galileo.ai/api-reference/users/invite-users https://api.galileo.ai/public/v2/openapi.json post /v2/invite_users # List Users Paginated Source: https://docs.galileo.ai/api-reference/users/list-users-paginated https://api.galileo.ai/public/v2/openapi.json post /v2/users/all # Update User Source: https://docs.galileo.ai/api-reference/users/update-user https://api.galileo.ai/public/v2/openapi.json put /v2/users/{user_id} # Overview Source: https://docs.galileo.ai/api/getting-started Learn how to get started with the Galileo REST API Galileo provides a public REST API that you can use to interact with the Galileo platform. This guide will help you get started with the Galileo REST API. ## Base API URL The first thing you need to call the Galileo API is the base URL of your Galileo API instance. ### Free or hosted Galileo version If you are using the free or hosted tier of Galileo at [app.galileo.ai](https://app.galileo.ai), then the base API URL is [https://api.galileo.ai](https://api.galileo.ai). ### Custom deployment For custom deployments, you will need your Galileo console URL. You can then replace `console` in it with `api`. For example, if your Galileo console URL is `https://console.galileo.myenterprise.com`, then your base URL for the API is `https://api.galileo.myenterprise.com`. ### Verify the Base URL To verify the base URL of your Galileo API instance, you can send a `GET` request to the [`healthcheck` endpoint](/api-reference/health/healthcheck). ```bash theme={null} curl -X GET https://api.galileo.ai/v2/healthcheck ``` The API version will be reported in the response: ```output theme={null} ➜ curl -X GET https://api.galileo.ai/v2/healthcheck {"api_version":"1.0.0","message":"🔭 API","version":"1.844.0"} ``` ## Authentication For interacting with our public endpoints, you can use any of the following methods to authenticate your requests: ### API Key To use your [API key](/references/faqs/find-keys#galileo-api-key) to authenticate your requests, include the key in the HTTP headers for your requests. ```json theme={null} { "Galileo-API-Key": "" } ``` ### HTTP Basic Auth To use HTTP Basic Auth to authenticate your requests, include your username and password Base64 encoded in the HTTP headers for your requests. ```json theme={null} { "Authorization": "Basic :)>" } ``` ### JWT Token To use a JWT token to authenticate your requests, include the token in the HTTP headers for your requests. ```json theme={null} { "Authorization": "Bearer " } ``` We recommend using this method for high-volume requests because it is more secure (expires after 24 hours) and scalable than using an API key. To generate a JWT token, send a `GET` request to the [`get-token` endpoint](/api-reference/auth/get-token) using the API Key or HTTP Basic auth. # Access Control Source: https://docs.galileo.ai/concepts/access-control Control access to projects via role-based access control and groups in Galileo For organizations requiring role-based access control (RBAC), Galileo supports fine-grained control over granting users different levels of access to the system, as well as organizing users into groups for easily sharing projects. Some features are only available to customers on paid Galileo plans. ## System-level Roles There are four roles that a user can be assigned: * **Admin** - Full access to the organization, including viewing all projects. * **Manager** (enterprise only) - Can add and remove users. * **User** - Can create, update, share, and delete projects and resources within projects. * **Read-only** - Cannot create, update, share, or delete any projects or resources. Limited to view-only permissions. *Note:* Free users of Galileo can only use the Admin, User, or Read-only roles. [Contact us](https://galileo.ai/contact-sales) to explore a paid plan and get full RBAC. In table form: | | Admin | Manager | User | Read-only | | ------------------------------------- | ---------------------------------- | ----------------------------------------------- | ------------------------------------------ | ------------------------------------------ | | View all projects | | | | | | Add/delete users | | (excluding admins) | | | | Create groups, invite users to groups | | | | | | Create/update projects | | | | | | Share projects | | | | | | View projects | (all) | (only shared) | (only shared) | (only shared) | System-level roles are chosen when users are invited to Galileo: Image shows the pop up when inviting new users to Galileo and the system role options provided ## Groups (enterprise only) Users can be organized into groups to streamline sharing projects. Currently, groups are only available to customers on paid plans of Galileo. There are 3 types of groups: * **Public** - Group and members are visible to everyone in the organization. Anyone can join. * **Private** - Group is visible to everyone in the organization. Members are kept private. Access is granted by a group maintainer. * **Hidden** - Group and its members are hidden from non-members in the organization. Access is granted by a group maintainer. Within a group, each member has a group role: * **Maintainer** - Can add and remove members. * **Member** - Can view other members and shared projects. ## Share Projects By default, only a project's creator (and managers and admins) have access to a project. Projects can be shared both with individual users and entire groups. Together, these are called *collaborators*. How to share a project with collaborators: Share a project within Galileo # Annotations Overview Source: https://docs.galileo.ai/concepts/annotations/overview **Annotations** allow users to provide human feedback on LLM inputs and outputs through the Galileo Console UI and the [API](/api-reference/annotation/create-annotation-template). Galileo supports annotations on Sessions, Traces, and Spans. * The Messages page allows you to submit annotations, and highlights available annotations with a dot indicator. * The Logs page allows you to view and export annotations as columns. Annotations For each project, users with the necessary permissions can configure one or more of these annotation types: * **Categories**: Enable annotators to select one or more designated categories * **Score**: Enable annotators to select a number between 0 and a max score * **Star**: Enable annotators to rate from 1 to 5 stars * **Text**: Enable annotators to provide freeform text * **Thumbs Up & Down**: Enable annotators to rate a like or dislike The following roles have permissions to configure annotations: organization admins, project owners, or project editors. The minimum permissions for creating annotations are: organization users, and project annotators. Read-only or viewer roles cannot submit annotations. ## Annotation Queues (Enterprise Beta) Galileo's Annotation Queues enable teams to organize and scale human feedback by grouping project logs (sessions, traces, and spans) for structured review by subject-matter experts. Annotation Queue in the UI Please contact [support@galileo.ai](mailto:support@galileo.ai) to participate in the enterprise beta. # Integration Costs Source: https://docs.galileo.ai/concepts/costs/integration-costs In the Settings menu, **Integration Costs** allows admins to track project costs for [LLM-as-a-judge metrics](/concepts/metrics/how-llm-as-judge-metrics-are-calculated). In the Galileo console UI, navigate to the [Integration Costs page](https://app.galileo.ai/settings/integration-costs) by opening the user menu on the bottom-left corner, and then selecting **Integration Costs**. Integration Costs beta This Integration Costs page is only visible to Admins in the organization. ## Beta feedback request We're working on clarifying integration costs in various parts of the platform. Please let us know if you encounter unexpected data when using this feature -- or if you have other feedback to share. # Model Pricing Settings Source: https://docs.galileo.ai/concepts/costs/model-pricing-settings Galileo's model pricing allows admins to configure model prices that will be used to calculate app and metric costs. Use this feature to have a more accurate view of how your organization's AI agents are impacting budgets. In the Galileo console UI, navigate to the [Model Pricing settings page](https://app.galileo.ai/settings/model-pricing) by opening the user menu on the bottom-left corner, and then selecting **Model Pricing**. Model Pricing menu item Model Pricing settings are visible only to Admins in the organization. Admins can: * Browse and search the models found in Galileo projects * View the current prices used to compute app and metric costs * Provide updated prices for any existing model -- or revert to the default price * Add a new model and price Model Pricing table Usage tips: * Updated prices will apply to new logs and experiments using that model. App and metric cost for historical logs and experiments remain unchanged. * If you're a new Admin in the organization, you may need to Sign Out and Sign In again to view the model pricing data. ## Beta feedback request We're working on expanding support for customers' model prices in various parts of the platform. Please let us know if you encounter unexpected data when using this feature -- or if you have other feedback to share. # Compare Experiments Source: https://docs.galileo.ai/concepts/experiments/compare Learn how to compare multiple experiment runs in Galileo Once you have run some experiments, the next natural step is to compare the results of your experiments, allowing you to optimize prompts, select the best model for your use case, or tune your input data to suit your needs. Galileo allows you to compare up to five different experiments, showing the difference in outputs, metrics, latency, and token usage. ## Prerequisites To compare experiments, you will need: * A project containing two or more experiments, created either using [playgrounds](/concepts/experiments/running-experiments-in-console), or in [code](/sdk-api/experiments/running-experiments) ## Compare experiments Experiments can be compared from the Experiments tab in the [Galileo Console](https://app.galileo.ai/). Experiments are part of a project, so select the relevant project to see the experiments tab. 1. Open the **Experiments** tab The experiments tab for a project 2. Select the experiments you want to compare by checking the box next to each experiment. Select between two and five experiments. The experiments tab with check boxes checked on the left of two experiments 3. Select the **Compare experiments** button to open the comparison page The compare experiments button, above the rows of experiments 4. You will see the experiments side by side in the comparison page 2 experiments side by side showing metrics, input prompt and output ### Review the comparison The comparison shows each experiment's metrics, inputs, and outputs. 1. If your experiment has multiple inputs, you can navigate between inputs using the forward and backwards buttons. Your experiment inputs should align by position - for example if you have two inputs in each, the comparison is based on input one from experiment one being compared to input one from experiment two, and so on. > All the experiments in the comparison should have the same number of inputs. If they do not, you will only be able to navigate based off of the experiment with the least inputs. The navigation buttons to navigate between inputs 2. The **Details** section shows the model used, and averages and totals for both the cost of each response and generating the metrics, as well as averages for the metrics. > Averages are calculated for experiments with multiple inputs. The details tab showing two experiments, one using GPT 3.5 Turbo, the other using GPT-4o mini. Each detail has averages and totals for costs, and averages for metrics 3. The **Metrics** section shows the metrics for the currently selected input. These metrics include system metrics (latency, the number of input and output tokens), and the selected metrics for the experiment. Comparing two sets of metrics with latency, number of tokens, instruction adherence, and validate investment advice If you hover over a metric, a pop-up will explain the reasoning behind the score, along with details of the LLM used to judge, the cost of the judgment, and the number of judges used. Hovering over a metric showing an explanation in a popup 4. The **Input** and **Output** sections show the input to the experiment, and the output generated by the LLM. The inputs and outputs for an experiment # Run Experiments in Playgrounds Source: https://docs.galileo.ai/concepts/experiments/running-experiments-in-console Learn about running experiments in the Galileo console using playgrounds and datasets This section will guide you through the process of running experiments in the Galileo Console. ## Experiment walkthrough Follow these steps to test and improve your AI projects using [Galileo's Console UI](https://app.galileo.ai). In the [Galileo Console](https://app.galileo.ai), use the **drop down menu in the top-left** to select the project you would like to experiment with. Or, create a new project. Then, click the **"Open Playground"** button to access the Galileo Console. app.galileo.ai In the Galileo Console, **select a model** using the "Model" drop down menu. Some models require that you **enter your corresponding API key**. Visit their respective API platforms to obtain your keys, then add it using the [integrations page](https://app.galileo.ai/settings/integrations) in the Galileo Console. Select a Model Click the settings icon **to the right of your model name** to adjust its behavior: * **Max Length:** Sets the maximum number of tokens the model can generate in its output. * **Temperature:** Controls randomness in output—higher values make responses more creative, lower values make them more focused and deterministic. * **Top P:** Limits sampling to the most likely tokens whose cumulative probability is within this threshold (a form of nucleus sampling). * **Frequency Penalty:** Reduces the likelihood of the model repeating the same tokens by penalizing frequent ones. * **Presence Penalty:** Discourages the model from mentioning tokens that have already appeared, promoting new content. Configure Model Settings There are **two ways** to set the prompt data for your experiment: * **Option 1**: Add prompts and variables through the console UI. Ideal for quick tests. * **Option 2**: Use datasets from past experiments or create new ones. Ideal for real, fully-configured experiments. ### Add prompt and variables In the Editor section, **add your prompt**. In your prompt, you can use **variable names with curly brackets** (e.g. `{{variable_name}}`). Add new variable options with the **new tab icon** next to "Variable Set" and fill them in with different values to be used in place of your variable. You can also use **nested variables** by entering JSON formatted key-value pairs into the "Variable Set" text input field. Then, refer to their values with either `{{key}}` or `{{input.key}}` ```json theme={null} { "pepperoni": "pepperoni pizza", "anchovy": "pizza with anchovies" } ``` In this example, both `{{pepperoni}}` and `{{input.pepperoni}}` will result in "pepperoni pizza" being used in the prompt. This approach is great for testing how changing individual words in a prompt structure affects outputs. Add new messages beyond the initial prompt with the "+ Add Message" button **below the prompt field**. New messages can be from the user or the model ("system"). Add Variables ### Add dataset Click the "Add Dataset" button to **choose a dataset** to be used by your model. The datasets listed are from your past experiments. You can also [add your own](/sdk-api/experiments/datasets) by clicking "Create new dataset". [Learn more about datasets →](/sdk-api/experiments/datasets) Add Dataset Click the "+ Add Metric" button to **choose metrics** by which your experiment's outputs are measured. Filter and select from the preset metrics, or add your own by clicking "+ Create New Metric" in the top-right. Scores are produced for each selected metric after running an experiment. [Learn more about metrics →](/concepts/metrics/overview) Add Metrics Add **additional prompt sections** with the "+ Compare Prompt" button in the top-right. Each new prompt section can have its own distinct configuration of: * Model * Model settings * Prompt * Message conversation Add new prompt sections and **customize their settings** as needed for your experiment. Add More Adjustments Click the **"Run All" button in the top-right** to run your experiments, generate outputs, and calculate evaluations based on your chosen metrics. Run Experiments After the experiment has completed, **scroll down** to view their outputs and evaluations. The more distinct prompts and variable sets you used, the more results there will be. Review Outputs Click the "Log as Experiment" button above the outputs to **record all the details of the experiment**. Use a **descriptive name** for your experiment so that it's easy to keep track of your progress. [Learn more about logging →](/sdk-api/logging/logging-basics) Log Results That's it! Now, further customize and configure your experiment to meet your testing goals. Log your experiment results, and create new projects to try out different configurations. If you encounter any errors, visit our [Common Errors guide](/references/faqs/errors). ## Experiment settings and options Continue Experimenting! 1. **Model Select** - choose model to be used with prompt/dataset (and enter API keys if necessary) 2. **Model Settings** - adjust model-specific settings. 3. **Message Originator** - select if the content of the prompt is from a user or from the model itself ("system"). 4. **Prompt Entry** - add your prompt for the experiment. Use variables with curly brackets (e.g. `{{variable_name}}`) and add variable values in the variable entry field (#9). [Learn more about prompts and variables →](#add-prompt-and-variables) 5. **Add Message** - use a multi-prompt conversation in your experiment by adding new messages. 6. **Dataset Select** - instead of entering prompt(s), select a dataset of prompt data structures from a prior project, or add a new one. [Learn more about datasets →](/sdk-api/experiments/datasets) 7. **Metrics Select** - select metrics by which your experiment is evaluated. Select from Galileo's many presets, or create your own metrics. Scores are produced for each metric after running an experiment. [Learn more about metrics →](/concepts/metrics/overview) 8. **Add Variable Set** - add new groups of values for the variables used in your prompt. This adds a new "VARIABLE SET" section along the bottom of the screen. 9. **Variable Value Entry** - set the values of the variables used in your prompt. 10. **Log Experiment** - record your experiment's prompt data, settings, metric evaluations, and outputs. [Learn more about logging →](/sdk-api/logging/logging-basics) 11. **Run Individual Experiment** - run your experiment using its individual prompt data and settings. When using multiple prompts, a "Run All" button appears in the top-right to run all of your experiments. 12. **Add Prompt Section** - add a new prompt section. Each prompt section can be configured with different models and prompts to compare and contrast their outputs and metric evaluations. [Learn more about metrics →](/concepts/metrics/overview) # Log Stream Metrics Source: https://docs.galileo.ai/concepts/logging/configure-metrics/configure-metrics Learn how to configure metrics for Log streams, including managing sampling rates Once you have traces feeding in to a Log stream, you can configure the metrics that you want to evaluate. Metrics are managed at organizational level, including the creation of custom metrics, then are used to evaluate traces at the Log stream level. ## Configure metrics for a Log stream ### Configure metrics through the console To configure metrics, open your Log stream and select the **Configure Metrics** button. You will need at least one session in your Log stream to be able to configure metrics. The configure metrics button on the sessions tab This will load the **Configure metrics** pane. The configure metrics pane with the action advancement metric turned on and the switch highlighted, and the save and close button highlighted From here you can filter and search for metrics, then turn on the relevant ones for your Log stream. Once you have the metrics you need turned on, select the **Save and close** button to save your settings. You can also create new custom metrics from this pane, either using an [LLM as a judge](/concepts/metrics/custom-metrics/custom-metrics-ui-llm), or in [code](/concepts/metrics/custom-metrics/custom-metrics-ui-code), then add them to your Log stream. ### Configure metrics in code You can also configure metrics for a Log stream using the Galileo SDKs. ```python Python theme={null} from galileo import GalileoMetrics from galileo.log_streams import enable_metrics # Enable metrics enable_metrics(project_name="MyProject", log_stream_name="MyLogStream", metrics=[GalileoMetrics.context_adherence]) ``` ```typescript TypeScript theme={null} import { enableMetrics, GalileoMetrics } from "galileo"; // Enable metrics await enableMetrics({ projectName: "MyProject", logStreamName: "MyLogStream", metrics: [GalileoMetrics.contextAdherence] }); ``` Set `MyProject` to your project name, and `MyLogStream` to your Log stream name. You can then pass in either the relevant metric enum, or the name of a custom metric. This function will enable just the metrics specified for the Log stream. If you have any other metrics enabled before calling this function, they will be disabled. ## Metric sampling Every evaluation interacts with an LLM (unless you are only using custom code-based metrics), and therefore has an associated cost. When your application is in development you will probably want to evaluate every trace that is captured, but once your application is in production and is scaling to hundreds, thousands, or even millions of users you most likely want to reduce your evaluation costs by only evaluating a small sample of the traces that are captured. You can configure metric sampling at a Log stream level. To configure metric sampling rate rules, select the **Metric Sampling** button from the **Configure metrics** pane. The metrics sampling button From here you can configure the metric sampling rates. These rates can be applied to all metrics (including custom code metrics and Luna-2 metrics), or LLM-as-a-judge metrics only. Set the sampling rate you want, then select the **Save** button. The metrics sampling dialog When you configure the sample rates, all traces are captured and visible in Galileo, but metrics will only be evaluated for those traces based off the sample rates. For example, if you set the sampling to 10% and create 100 traces, then all 100 traces will be visible in Galileo, with metrics evaluated for just 10 of them. ### Metric sampling rates The most basic way to set sampling rates is by a percentage for all incoming logs. When you set a percentage, all traces are stored and available in Galileo, but only that percentage of traces will be evaluated. A trace is either evaluated for all configured metrics, or not evaluated. You can configure sampling at a more granular level by adding additional rules based off metadata set at a trace level. For example, if you are onboarding a new customer and want to evaluate all of their logs during the onboarding process, you can add the customer name to your metadata, and set a rule to evaluate 100% of traces that have that customer name in their metadata. The metric sampling dialog showing 100% sampling if customer is set to important customer, otherwise 10% This metadata is set when you start a trace with the Galileo logger. ```python Python theme={null} logger.start_trace( name="Conversation step", input=user_input, metadata={"customer": "ImportantCustomer"} ) ``` ```typescript TypeScript theme={null} galileoLogger.startTrace({ name: "Conversation step", input: userInput, metadata: { "customer": "ImportantCustomer" } }); ``` These rules are applied in a top-down approach, so the first rule is evaluated and if the metadata matches, then the percentage is used, if not the next rule is evaluated, and so on. Finally if no rules match, the default sampling rate for all traces is used. ## Metric filters Sometimes metrics only make sense for certain spans. For example, if you have a custom metric for verifying the final response to a user from a multi-agent system with multiple LLM spans, you might only want to calculate the metric on the final LLM span that summarizes the results from all the agents. You can filter the spans that a metric is calculated for, based off the span name or span metadata. Metric filtering is configured at the project level, with filtering applying to all Log streams in a project. To configure metric filters, select **Apply filter** from the menu for the metric you want to filter on the **Configure metrics** pane: The apply filter menu option Use the **Add Condition** button to add a condition based off a span name, or span metadata for the span type that the metric evaluates. * For metadata, set the field, the comparison operator, and the value * For the span name, set the comparison operator and the value You can set multiple conditions, and these are combined with an **And** clause, so condition 1 **And** condition 2. The apply filter dialog with a metadata filter for agent is equal to summary agent ## Next steps Explore Galileo's comprehensive metrics framework for evaluating and improving AI system performance across multiple dimensions. Learn how to create evaluation metrics using LLMs to judge the quality of responses. Learn how to create, register, and use custom metrics to evaluate your LLM applications. # Multimodal Observability Source: https://docs.galileo.ai/concepts/logging/multimodal-observability Log, inspect, and evaluate images, audio, and documents alongside text in your traces AI applications increasingly process and generate images, audio, and documents. Text-based logs alone no longer capture enough context to debug or evaluate them effectively. A voice agent's transcription can be perfect while the generated audio sounds robotic. A document extraction can return the right fields but miss a table. An image generation can follow the prompt but produce off-brand visuals. Galileo supports logging multimodal content on trace inputs and outputs, giving teams full visibility into what their models received and produced. With multimodal traces, you can: * Inspect the exact media your model received or generated, not a text summary of it * Evaluate inputs and outputs using multimodal LLM-as-a-judge metrics * Replay and debug issues that would be invisible in a transcript alone *** ## Choose a logging method | Method | Use when... | | :-------------------------------------- | :-------------------------------------------------------------------------- | | **GalileoLogger — log an external URL** | Your content is already hosted externally and accessible via URL | | **GalileoLogger — upload local files** | You're working with files on disk and need to upload them directly | | **LangChain handler** | Your app already uses LangChain — multimodal content converts automatically | *** ## Option 1: Log an external URL Use `DataContentBlock` with the `url` field. No encoding required. ```python Python theme={null} from galileo.logger import GalileoLogger from galileo.schema.content_blocks import TextContentBlock, DataContentBlock logger = GalileoLogger() logger.start_trace( input=[ TextContentBlock(text="Describe this image"), DataContentBlock(modality="image", url="https://example.com/photo.png"), ], project="my-project", ) logger.add_llm_span( input=[{"role": "user", "content": "Describe this image"}], output={"role": "assistant", "content": "It's a cat."}, model="gpt-5", ) logger.conclude(output="It's a cat.") logger.flush() ``` *** ## Option 2: Upload local files Encode local files as base64 and pass them with the `base64` and `mime_type` fields. This works for images, audio, and documents in a single trace. The example below assumes `photo.png`, `recording.wav`, and `report.pdf` are in the same directory as your script: ```python Python theme={null} import base64 from pathlib import Path from galileo.logger import GalileoLogger from galileo.schema.content_blocks import TextContentBlock, DataContentBlock image_b64_data = base64.b64encode(Path("photo.png").read_bytes()).decode() audio_b64_data = base64.b64encode(Path("recording.wav").read_bytes()).decode() pdf_b64_data = base64.b64encode(Path("report.pdf").read_bytes()).decode() logger = GalileoLogger() logger.start_trace( input=[ TextContentBlock(text="Analyze all of these files"), DataContentBlock(modality="image", base64=image_b64_data, mime_type="image/png"), DataContentBlock(modality="audio", base64=audio_b64_data, mime_type="audio/wav"), DataContentBlock(modality="document", base64=pdf_b64_data, mime_type="application/pdf"), ], project="my-project", ) logger.add_llm_span( input=[{"role": "user", "content": "Analyze all of these files"}], output={ "role": "assistant", "content": "The image is a cat, audio is clear, the PDF is a report.", }, model="gpt-5", ) logger.conclude( output="The image is a cat, audio is clear, the PDF is a report." ) logger.flush() ``` `DataContentBlock` supports three modalities: `image`, `audio`, and `document`. *** ## Option 3: Log with the LangChain handler The LangChain handler converts multimodal message content to structured content blocks automatically. Pass multimodal messages the same way you normally would with LangChain — no extra setup: ```python Python theme={null} from langchain_openai import ChatOpenAI from langchain_core.messages import HumanMessage from galileo.handlers.langchain import GalileoCallback callback = GalileoCallback() llm = ChatOpenAI(model="gpt-5", callbacks=[callback]) response = llm.invoke([ HumanMessage(content=[ {"type": "text", "text": "What's in this image?"}, {"type": "image_url", "image_url": {"url": "https://example.com/photo.png"}}, ]) ]) ``` Supported content types: `text`, `image_url`, `audio_url`, `document_url`, `input_image`, and `input_audio`. Base64 data URIs are also supported — the handler extracts the payload and MIME type automatically. *** ## View multimodal content in your traces An audio trace in the Galileo Log stream showing an inline waveform player in the user input, a text output from the assistant, and audio quality metrics in the side panel Multimodal content renders inline in the Log stream alongside span inputs and outputs: * **Audio** renders as an inline waveform player you can play back directly, with download support * **Images** display inline and can be downloaded * **PDFs** appear as inline previews and can be downloaded *** ## Evaluate multimodal traces Galileo provides out-of-the-box LLM-as-a-judge metrics for multimodal content. You can also configure custom LLM-as-a-judge metrics on any span, trace, or session that contains multimodal content. ### Out-of-the-box metrics | Metric | Modality | What it evaluates | | :------------------------- | :---------- | :----------------------------------------------------------------------------- | | **Visual Quality** | Image / PDF | Whether input quality is sufficient for the task to be reliably performed | | **Visual Fidelity** | Image / PDF | Whether a generated image complies with brand rules, based on visible evidence | | **Interruption Detection** | Audio | Turn-taking violations — agent overlap, premature barge-in, and user barge-in | ### Custom LLM-as-a-judge metrics The custom metric editor showing Audio modality selected, an LLM model configured, and a judge prompt for evaluating audio quality 1. Go to **Metrics** and create a new custom LLM metric. 2. Configure a model integration. See [suggested models](#suggested-models) below. 3. Under capabilities, select **Image/PDF** or **Audio**. 4. Enable the metric on your Log stream **before** logging content. Metrics compute only when the trace contains at least one attachment matching the enabled capability. A metric with **Image/PDF** enabled returns N/A if the trace contains only audio, or no attachments at all. Similarly, a metric with **Audio** enabled returns N/A on image-only traces. *** ## Supported formats and models ### Supported formats | Modality | Formats | | :------- | :------------ | | Image | `png`, `jpeg` | | Audio | `mp3`, `wav` | | Document | `pdf` | ### Suggested models For best results, use GPT-5 or later (OpenAI) for image and PDF evaluation, and Gemini 3+ via Vertex AI for audio. If using Vertex AI, you will also need to configure a separate GCP bucket and credentials for file uploads. See [how to set up a Vertex AI integration](https://v2galileo.mintlify.app/api-reference/integrations/create-or-update-vertex-ai-integration#create-or-update-vertex-ai-integration). *** ## Known limitations * **LangChain handler stores the full message list.** The trace's input and output fields contain the full serialized message structure (e.g., `[{"content": [...blocks...], "role": "user"}]`), not bare content blocks. * **Multimodal attachments are not supported via OpenTelemetry or native callbacks** (e.g., Google ADK, CrewAI). Use GalileoLogger or the LangChain/LangGraph callback instead. * **Multimodal metrics are not supported in playground or prompt experiments.** *** ## Next steps Full reference for logging with GalileoLogger. Complete guide to the Galileo LangChain integration. # Overview Source: https://docs.galileo.ai/concepts/logging/overview Core Observability concepts in Galileo ## What is AI Observability Agentic applications are inherently non-deterministic, meaning their behavior cannot be fully predicted or exhaustively tested before deployment. As a result, traditional monitoring approaches fall short in capturing how these systems behave in production. AI observability provides visibility into the unique runtime behavior of AI applications, allowing teams to understand what is happening under the hood, why it is happening, and how it impacts performance and outcomes. ## Core concepts Once instrumented, Galileo captures every session, trace, and span, producing a structured stream of real-time data. * [Log streams](/sdk-api/logging/logging-basics) and [projects](/concepts/projects) organize the data you send to Galileo for a given application or environment. * [Sessions](/concepts/logging/sessions/sessions-overview) group related traces into a complete multi-turn interaction. * [Traces](/sdk-api/logging/galileo-logger#start-a-trace) represent a single turn, request or AI workflow. * [Spans](/sdk-api/logging/galileo-logger#add-spans) capture the individual steps within a trace, such as LLM calls, tool calls, or a retrieval step. ## Getting started Start with [Instrumentation](/sdk-api/logging/logging-basics) to understand how data is structured in Galileo and how to send logs from your application. # Sessions Overview Source: https://docs.galileo.ai/concepts/logging/sessions/sessions-overview Learn about log sessions in Galileo A Session is **a collection of Traces, Events, and Spans emitted by your Application**. They group all Traces for one conversation or evaluation run, giving you a bird's-eye view of that LLM workflow. Imagine you're building an LLM-powered customer service chat application. During development or in production, you will want to see how a multi-turn conversation flows from user prompt, to tool calling, to model response. **Sessions** solve this problem by bundling Log stream traces into a cohesive unit, so you can observe and evaluate an entire agent interaction from start to finish. ## Core concepts Let's take a look at the building blocks of a session. ### Span → trace → session * [**Span**](/sdk-api/logging/galileo-logger#add-spans): The smallest logging unit the system, typically representing a single operation, function call, or request. Each user message, model API call, or model tool usage generates a *Span*. * [**Trace**](/sdk-api/logging/galileo-logger#start-a-trace): When multiple spans occur as part of a single logical operation (e.g. a request that triggers several downstream calls) they form a *Trace*. Traces allow you to see parent/child relationships among spans. * **Session**: A collection of one or more traces that together represent an entire interaction, or multi-step evaluation. A Session bundles related traces so that you can analyze an entire workflow end to end, even if it spans multiple services, threads, or agents. ## How do sessions differ from Log streams? A [**Log stream**](/sdk-api/logging/logging-basics) is a continuous sequence of log entries emitted over time. Log streams simply capture everything in chronological order, and can contain a mix of spans, traces, and sessions. On the other hand, a *Session* is a way to group Traces that are logically connected. And with Galileo, every Session is stored in a Log stream that you can specify either explicitly or using environment variables. ## How do sessions differ from workflows? A **Workflow** is a defined sequence of steps or tasks. It may include branching logic, conditional steps, and dependencies. A *Session* can contain one or more *Workflows* if they are part of the same overall evaluation. ## Where can I find my sessions? Sessions can be viewed in the [Galileo Console](/concepts/logging/sessions/using-sessions#view-your-session). When you create a session, you will usually select a Log stream where they will be found. (If you don't specify one, Galileo will use your default Log stream). Head over to the [Galileo Console](https://app.galileo.ai) and log in. On your dashboard, select the Log stream where you were sending your session logs. If you didn't specify a unique or new Log stream name, you will find the logs in your **default** Log stream. Select your Log stream from the list Selecting the Log stream will bring you to its event records. All logs will be grouped by *Session*, though you can use the control near the top-left of your screen to change the Log stream's event grouping: Event-group controls for selecting Session, Trace, or Span granularity Your session should be visible in the table below the controls, especially if you gave it a recognizable name. Select it to view the traces. Once you select your session, you can see the Traces you captured from your test run as a flowchart. Any tools that were used will also show up as individual Spans. Select the nodes of the flowchart to see their inputs and outputs on the right-edge of your screen. A flowchart showing the nodes in a session Trace Each message from the user and response from the LLM will form a single trace; you can view the contents here in a familiar format, as well as other details like tool calls. Just select the **Messages** tab (shown in the image below) to see a list of traces in the session, along with their child spans. You can select a span to see metrics and other details on the right edge of the screen (not pictured) View Trace messages Use the **Condense Steps** toggle to show only the most relevant spans in a trace. This will include any tool calls made by the LLM! You can learn more about creating and using sessions [in our using sessions guide](/concepts/logging/sessions/using-sessions). ## Conclusion A Session can collect multiple workflow runs and traces into one cohesive view. By using Sessions in your LLM application, you can: 1. Organize logs and metrics for each customer interaction or batch evaluation run, so debugging and analysis become straightforward. 2. Drill down into any step, inspecting the span for tokenization latency or the trace for scoring logic without losing context. 3. Compare multiple chat sessions to track performance improvements. ## Next steps Learn how to [create and use sessions](/concepts/logging/sessions/using-sessions) in Galileo. ## Related resources * [**Using Sessions**](/concepts/logging/sessions/using-sessions) - Create and view sessions in Galileo * [**Log streams**](/sdk-api/logging/logging-basics) - Learn about Log streams in Galileo. * [**Spans**](/sdk-api/logging/galileo-logger#add-spans) - Learn about the building blocks of Traces in Galileo. * [**Traces**](/sdk-api/logging/galileo-logger#start-a-trace) - Learn about Traces, and different ways to create them. # Create and Use Sessions Source: https://docs.galileo.ai/concepts/logging/sessions/using-sessions Learn to create and use Sessions in Galileo ## Overview This tutorial will guide you through creating and using a [Session](/concepts/logging/sessions/sessions-overview) in Galileo, using a simple LLM-driven example that you can expand to multiple agents and data sources. It is a quick way to introduce you to logging sessions. By the end of this guide, you will know how to: 1. Initialize a [logging session](/concepts/logging/sessions/sessions-overview) 2. Add events to your session 3. Inspect the Session in the Galileo Console to see all related Traces and Spans. ```mermaid theme={null} --- config: flowchart: curve: linear --- graph TD; __start__([Initialize a Session]) app_logic(Run LLM logic) log_traces([Galileo captures Traces and Spans]) flush([Flush the context]) __end__([

View Session in Galileo Console

]) __start__ --> app_logic app_logic -.-> log_traces app_logic --> flush flush --> __end__ ``` There will be minor differences around starting and flushing the session context, depending on whether you're using the automatic or manual way. We'll cover both below. ## Prerequisites * **Galileo Account**: Ensure you have signed up for a Galileo account. This should provide you with the following values: * `GALILEO_API_KEY`: Your API key * `GALILEO_PROJECT`: The name of your Galileo Project * `GALILEO_LOG_STREAM`: The Log stream where you will save your sessions * `GALILEO_CONSOLE_URL`: Optional. The URL of your Galileo console for custom deployments. If you are using `app.galileo.ai`, you don't need to set this. * **OpenAI API Key**: This example will use OpenAI as the underlying LLM, so you will need an API key from them. In addition, this tutorial assumes you are familiar with: * Simple LLM Apps, and making simple OpenAI completion calls using Python or TypeScript * The [`GalileoLogger`](/sdk-api/logging/galileo-logger) class from the Python or TypeScript SDK ## Project setup Let's take a moment to prepare the development environment. If you already have a project setup with `Galileo`, `LangChain`, and `LangGraph`, you can skip right to [Manage a Session](#manage-a-session). If not, here's an abbreviated quickstart: We'll need the Galileo [Python](/sdk-api/python/sdk-reference) or [TypeScript](/sdk-api/python/sdk-reference) SDK, LangChain, LangGraph, OpenAI, and `dotenv` to pull in variables from your `.env` file. Let's start by installing them: ```bash Python theme={null} pip install "galileo[openai]" langchain langchain-openai langgraph python-dotenv ``` ```bash TypeScript theme={null} npm i -s galileo openai @langchain/langgraph @langchain/core dotenv ``` Next, create a `.env` file and add in the following variables: ```ini .env theme={null} # Your Galileo API key GALILEO_API_KEY="your-galileo-api-key" # Your Galileo project name GALILEO_PROJECT="your-galileo-project-name" # The name of the Log stream you want to use for logging GALILEO_LOG_STREAM="your-galileo-log-stream" # Provide the console url below if you are using a # custom deployment, and not using the free tier, or app.galileo.ai. # This will look something like “console.galileo.yourcompany.com”. # GALILEO_CONSOLE_URL="your-galileo-console-url" # OpenAI properties OPENAI_API_KEY="your-openai-api-key" # Optional. The base URL of your OpenAI deployment. # Leave this commented out if you are using the default OpenAI API. # OPENAI_BASE_URL="your-openai-base-url-here" # Optional. Your OpenAI organization. # OPENAI_ORGANIZATION="your-openai-organization-here" ``` Finally, create a main script file (e.g. `main.py` or `main.ts`) where you'll add and run your application logic. Now we can dive in. ## Manage a session Recall our objectives from earlier? We'll build a simple application and use it to work through each step. If you're in a hurry, you jump to the [full code sample here](#full-code-sample), then return to see how it was put together. ### Steps In your main script, import the following dependencies. Let's begin by creating a very simple agent using LangGraph and OpenAI: ```python Python {9,17-22} theme={null} from time import time from dotenv import load_dotenv # Galileo dependencies from galileo import GalileoLogger from galileo.handlers.langchain import GalileoCallback # LangChain and LangGraph dependencies from langchain.agents import create_agent from langchain_core.runnables.config import RunnableConfig # Load `.env` variables load_dotenv() # Create a simple assistant for our test (or import one). You can also provide # your agent with tools: the session will log their usage simple_agent = create_agent( name="simple_agent", model="openai:o3-mini", # you can choose any OpenAI model here system_prompt="You are a friendly assistant that answers the user's questions", tools=[], # (OPTIONAL) provide tools to your agent ) ``` ```typescript TypeScript {7,9,16-22} theme={null} import { configDotenv } from "dotenv"; // Galileo dependencies import { GalileoCallback, GalileoLogger } from "galileo"; // LangChain and LangGraph dependencies import { createReactAgent } from "@langchain/langgraph/prebuilt"; import { RunnableConfig } from "@langchain/core/runnables"; import { ChatOpenAI } from "@langchain/openai"; // Load environment variables configDotenv(); // Create a simple assistant for our test (or import one). You can also provide // your agent with tools: the session will log their usage const simpleAgent = createReactAgent({ name: "simpleAssistant", llm: new ChatOpenAI({ model: "o3-mini" }), prompt: "You are a friendly assistant that answers the user's questions", tools: [] // (OPTIONAL) provide tools to your agent }); ``` We'll see `GalileoCallback` and `RunnableConfig` in action later. For now, let's move on to the next step. We'll be using the `GalileoLogger` to manage our logging session. Let's create one next: ```python Python theme={null} # Create a GalileoLogger instance for our session logger = GalileoLogger() ``` ```typescript TypeScript theme={null} // Create a GalileoLogger instance for our session const logger = new GalileoLogger(); ``` `GalileoLogger` takes some optional arguments: you don't have to provide any of them, but they are listed below so that you can see what is available. ```python Python theme={null} GalileoLogger( project_name: Optional[str] """ name of target project for the logger instance """ log_stream_name: Optional[str] """ name of target logstream for the logger instance """ project_id: Optional[str] """ ID of target project for the logger instance """ log_stream_id: Optional[str] """ ID of target logstream for the logger instance """ experiment_id: Optional[str] """ ID of an experiment to which this log session will be linked """ session_id: Optional[str] """ ID of a previous session to which this log session will be linked """ local_metrics: Optional[list[LocalMetricConfig]] """ Locally-defined metrics that should be used on spans/traces from this session: See 'Custom Metrics' for more information """ ) ``` ```typescript TypeScript theme={null} new GalileoLogger({ /** name of target project for the logger instance */ projectName?: string; /** name of target logstream for the logger instance */ logStreamName?: string; /** ID of an experiment to which this log session will be linked */ projectId?: string; /** ID of target project for the logger instance */ logStreamId?: string; /** ID of target logstream for the logger instance */ experimentId?: string; /** ID of a previous session to which this log session will be linked */ sessionId?: string; /** Locally-defined metrics that should be used on spans/traces from this session: See 'Custom Metrics' for more information */ localMetrics?: LocalMetricConfig[]; /** Logger mode: "batch" or "streaming". Defaults to "batch" if not set. - "batch": Batches traces and sends on flush() (default) - "streaming": Enables streaming tracing with immediate updates to backend */ mode?: string; /** Optional callback invoked on `flush()` in batch mode with the payload that would be sent to the API. When set, it runs instead of the default `ingestTraces` call—use for custom delivery. */ ingestionHook?: (request: LogTracesIngestRequest) => Promise | void; }) ``` Our simple application will have a `main` function where everything happens. The first thing we will do in this function is start up a logging session. This will prepare the logger to group all captured events under a single session. Below, we give the session a unique `name` and `external id`. The name helps us find the session more easily in the Galileo Console. The `external id` is to link this session to external tracing: for example, linking to a conversation ID in your chatbot app by an ID created inside that app. You can also pass an optional `metadata` dictionary of string key-value pairs to attach structured information to the session, such as customer IDs, environment names, or application versions. Metadata keys appear as filterable columns in the Sessions table in the Galileo Console. ```python Python {7-14} theme={null} def main(): """Main application logic""" # start a logging session external_id = f"custom_id-{int(time())}" logger.start_session( name="Logger Session Tutorial", external_id=external_id, metadata={ "brand_id": "acme", "environment": "production", }, ) ``` ```typescript TypeScript {5-12} theme={null} /** Main application logic */ async function main() { // Start a logging session const externalId = `custom_id-${Math.round(Date.now() / 1000)}`; await logger.startSession({ name: "Logger Session Tutorial", externalId, metadata: { brand_id: "acme", environment: "production", }, }); } ``` Treat `logger.start_session` like a lifecycle event, and call it before any code you want to monitor. The `name`, `external id`, and `metadata` arguments are all optional; `name` and `external id` are recommended. Now you can interact with your LLM. Our very simple application will invoke the LLM with two questions: each question will be a question/answer exchange that generates a `Trace` with child spans in our session. We will also pass a callback handler, which will be called by `LangChain` after each LLM invocation. Here's our full `main` function: you can make this part as complex as you like! ```python Python {9-12, 17, 19-28} theme={null} def main(): """Main application logic""" # start a logging session external_id = f"custom_id-{int(time())}" logger.start_session(name="Logger Session Tutorial", external_id=external_id) # Here's what we will ask the LLM: prompts = [ "Hello! How many minutes are in a year?", "Hello! How far is an Astronomical Unit in kilometers?", ] # Create a LangChain Runnable config object with a LangGraph callback handler: # We will supply the logger instance to ensure that it generates traces in the # correct session agent_config = RunnableConfig(callbacks=[GalileoCallback(galileo_logger=logger)]) for prompt in prompts: # Invoke the LLM with our question: response = simple_agent.invoke( input={"messages": [{"role": "user", "content": prompt}]}, config=agent_config, # pass the RunnableConfig here ) # Print out the LLM's response to confirm that this code block ran: print("Model response:", response["messages"][-1].content.strip()) ``` ```typescript TypeScript {8-11, 16-18, 20-29} theme={null} /** Main application logic */ async function main() { // Start a logging session const externalId = `custom_id-${Math.round(Date.now() / 1000)}`; await logger.startSession({ name: "Logger Session Tutorial", externalId }); // Here's what we will ask the LLM: const prompts = [ "Hello! How many minutes are in a year?", "Hello! How far is an Astronomical Unit in kilometers?" ]; // Create a LangChain Runnable config object with a LangGraph callback handler. // We will supply the logger instance to ensure that it generates traces in the // correct session const agentConfig: RunnableConfig = { callbacks: [new GalileoCallback(logger)] }; for (const prompt of prompts) { // Invoke the LLM with our question: const result = await simpleAgent.invoke( { messages: [{ role: "user", content: prompt }] }, agentConfig // pass the RunnableConfig here ); // Print out the LLM's response to confirm that this code block ran: console.log("LLM Reply:", result.messages.at(-1)?.content); } } ``` ### The GalileoCallback handler `GalileoCallback` is a callback handler specifically for `LangChain`. It sends the most-recent captured traces to Galileo Console when it is called behind the scenes: your LLM logic determines what traces are generated and/or captured. `GalileoCallback` has a few optional parameters: ```python Python theme={null} GalileoCallback( galileo_logger: Optional[GalileoLogger] = None, """ A `GalileoLogger` instance. Defaults to a global singleton """ start_new_trace: bool = True, """ Start a new trace on next invocation. Defaults to "true" """ flush_on_chain_end: bool = True, """ Flush captured traces after invocation. Defaults to "true" """ ) ``` ```typescript TypeScript theme={null} new GalileoCallback( /** A `GalileoLogger` instance. Defaults to a global singleton */ galileoLogger?: GalileoLogger, /** Start a new trace on next invocation. Defaults to "true" */ startNewTrace?: boolean, /** Flush captured traces after invocation. Defaults to "true" */ flushOnChainEnd?: boolean ) ``` ### Full code sample Here's everything we have done so far: ```python Python theme={null} from time import time from dotenv import load_dotenv # Galileo dependencies from galileo import GalileoLogger from galileo.handlers.langchain import GalileoCallback # LangChain and LangGraph dependencies from langchain.agents import create_agent from langchain_core.runnables.config import RunnableConfig # Load `.env` variables load_dotenv() # Create a simple assistant for our test (or import one). You can also provide # your agent with tools: the session will log their usage simple_agent = create_agent( name="simple_agent", model="openai:o3-mini", # you can choose any OpenAI model here system_prompt="You are a friendly assistant that answers the user's questions", tools=[], # (OPTIONAL) provide tools to your agent ) # Create a GalileoLogger instance logger = GalileoLogger() def main(): """Main application logic""" # start a logging session external_id = f"custom_id-{int(time())}" logger.start_session(name="Logger Session Tutorial", external_id=external_id) # Here's what we will ask the LLM: prompts = [ "Hello! How many minutes are in a year?", "Hello! How far is an Astronomical Unit in kilometers?", ] # Create a LangChain Runnable config object with a LangGraph callback handler: # We will supply the logger instance to ensure that it generates traces in the # correct session agent_config = RunnableConfig(callbacks=[GalileoCallback(galileo_logger=logger)]) for prompt in prompts: # Invoke the LLM with our question: response = simple_agent.invoke( input={"messages": [{"role": "user", "content": prompt}]}, config=agent_config, # pass the RunnableConfig here ) # Print out the LLM's response to confirm that this code block ran: print("Model response:", response["messages"][-1].content.strip()) if __name__ == "__main__": main() ``` ```typescript TypeScript theme={null} import { configDotenv } from "dotenv"; // Galileo dependencies import { GalileoCallback, GalileoLogger } from "galileo"; // LangChain and LangGraph dependencies import { createReactAgent } from "@langchain/langgraph/prebuilt"; import { RunnableConfig } from "@langchain/core/runnables"; import { ChatOpenAI } from "@langchain/openai"; // Load environment variables configDotenv(); // Create a simple assistant for our test (or import one). You can also provide // your agent with tools: the session will log their usage const simpleAgent = createReactAgent({ name: "simpleAgent", llm: new ChatOpenAI({ model: "o3-mini" }), prompt: "You are a friendly assistant that answers the user's questions", tools: [] // (OPTIONAL) provide tools to your agent }); // Create a GalileoLogger instance for our session const logger = new GalileoLogger(); /** Main application logic */ async function main() { // Start a logging session const externalId = `custom_id-${Math.round(Date.now() / 1000)}`; await logger.startSession({ name: "Logger Session Tutorial", externalId }); // Here's what we will ask the LLM: const prompts = [ "Hello! How many minutes are in a year?", "Hello! How far is an Astronomical Unit in kilometers?" ]; // Create a LangChain Runnable config object with a LangGraph callback handler. // We will supply the logger instance to ensure that it generates traces in the // correct session const agentConfig: RunnableConfig = { callbacks: [new GalileoCallback(logger)] }; for (const prompt of prompts) { // Invoke the LLM with our question: const result = await simpleAgent.invoke( { messages: [{ role: "user", content: prompt }] }, agentConfig // pass the RunnableConfig here ); // Print out the LLM's response to confirm that this code block ran: console.log("LLM Reply:", result.messages.at(-1)?.content); } } main(); ``` ### Run your script That's all the code: we have now learned to use `logger.start_session` before starting LLM chat session, and supply `GalileoCallback` to ensure your traces get sent to the Galileo Console. Now let's run the script: ```bash Python theme={null} python main.py ``` ```bash TypeScript theme={null} npx tsx main.ts ``` You should see the LLM's response in your terminal! You can also head to the [Galileo Console](https://app.galileo.ai) to view the newly-created session. (Shown below) ## View your session Now that you've logged a session, it's time to view results. Head over to the [Galileo Console](https://app.galileo.ai) and log in. On your dashboard, select the Log stream where you were sending your session logs. If you didn't specify a unique or new Log stream name, you will find the logs in your **default** Log stream. Select your Log stream from the list Selecting the Log stream will bring you to its event records. All logs will be grouped by *Session*, though you can use the control near the top-left of your screen to change the Log stream's event grouping: Event-group controls for selecting Session, Trace, or Span granularity Your session should be visible in the table below the controls, especially if you gave it a recognizable name. Select it to view the traces. Once you select your session, you can see the Traces you captured from your test run as a flowchart. Any tools that were used will also show up as individual Spans. Select the nodes of the flowchart to see their inputs and outputs on the right-edge of your screen. A flowchart showing the nodes in a session Trace Each message from the user and response from the LLM will form a single trace; you can view the contents here in a familiar format, as well as other details like tool calls. Just select the **Messages** tab (shown in the image below) to see a list of traces in the session, along with their child spans. You can select a span to see metrics and other details on the right edge of the screen (not pictured) View Trace messages Use the **Condense Steps** toggle to show only the most relevant spans in a trace. This will include any tool calls made by the LLM! ## Additional considerations Remember to always use the same `GalileoLogger` instance across your project. This ensures that all captured events are placed in the same session. You can achieve this in a few ways: 1. Export your `logger` instance from a separate module, so that your application uses a singleton instance. 2. Use the TypeScript SDK's `getLogger` function, or the Python SDK's `galileo_context` context manager for a consistent reference: ```python Python theme={null} from galileo import galileo_context # Create a new session (with optional metadata) galileo_context.start_session( name="My Session", metadata={"brand_id": "acme", "environment": "production"}, ) # Application logic follows # Flush the session (if you are not using galileo callback or "with galileo_context()") galileo_context.flush() ``` ```typescript TypeScript theme={null} import { getLogger } from 'galileo'; const logger = getLogger(); // Create a new session (with optional metadata) logger.startSession({ name: "My Session", metadata: { brand_id: "acme", environment: "production" }, }); // Application logic follows // Flush the session (if you are not using GalileoCallback) logger.flush() ``` 3. You can also add `Traces` wherever you see fit. A `Trace` might represent a question asked to your LLM, and the response generated for it — as well as any tools used! Galileo will generate traces for you, but you can also create new ones by using your logger instance: ```python Python theme={null} question = "What is the meaning of plenipotentiary?" logger.start_trace(input=question) logger.add_llm_span( input=question, output="Plenipotentiary means 'Invested with full power'" ) logger.conclude({}) # end the trace ``` ```typescript TypeScript theme={null} const question = "What is the meaning of plenipotentiary?" logger.startTrace({ input: question }) logger.addLlmSpan({ input: question, output: "Plenipotentiary means 'Invested with full power'" }) logger.conclude({}); // end the trace ``` You can learn more about traces and how to use them [in our logging guide](/sdk-api/logging/galileo-logger#start-a-trace). ## Conclusion In this tutorial, you learned how to: 1. Create a logging session with the `GalileoLogger` class 2. Manually start your own session with the `logger.start_session()` method 3. View your sessions in the Galileo Console. ## Next steps For a more detailed walkthrough of a multi-agent application, take a look at [Monitoring LangChain Agents with Galileo](/cookbooks/use-cases/agent-langchain). You can also learn more about using [Galileo's metrics](/concepts/metrics/overview) to gain more insight about your AI application. ## Related resources * [Sessions](/concepts/logging/sessions/sessions-overview) - An overview of sessions * [Galileo Context](/sdk-api/logging/galileo-context) - Learn about the Galileo Context Manager * [Monitoring LangChain Agents with Galileo](/cookbooks/use-cases/agent-langchain) - Follow this cookbook recipe to create and evaluate a multi-agent application. # Fine-Tuning Luna-2 Models Source: https://docs.galileo.ai/concepts/luna/fine-tuning Understand the requirements and process for fine-tuning Luna-2 models based off your real-world scenarios One big advantage of the Luna-2 model is the ability for Galileo to fine-tune the model for your specific use case, either for 'out-of-the-box' metrics, or custom metrics. These models are fine-tuned for specific customer use cases by the team at Galileo. Contact us to learn more about fine-tuning Luna-2 for your use case. Due to the complex nature of fine-tuning, the process is performed by the team at Galileo. This is not a self-service capability. ## Requirements for fine-tuning process These are the requirements for the fine-tuning process: 1. **Define the metric objective and use case** Clearly define the desired outcome from your perspective. This objective will guide the selection of the most appropriate flow and approach. 2. **Create a test dataset** The test dataset is the most crucial piece for any fine-tuning work. 300-500 samples is a decent number for a test set with a diverse set of samples, with at least 100 samples of each class. If 300 samples is hard to get to, 150 is a bare minimum. The test set **must be manually labelled** to ensure high quality. The format can be a spreadsheet/csv with input, output, label and explanations on the label if possible. If you are already using a Galileo metric on the data, these numbers will also help. 3. **Latency and Load Requirements** Specify the maximum acceptable latency for the given metric and its use case (online observability, run time protection etc.). The latency requirements should include QPS and expected input token size. Both these numbers should be provided for average and peak loads. This requirement will influence the choice of flow and may necessitate trade-offs with other factors. 4. **Constraints** Identify any limitations or restrictions that may impact the design or implementation of the flow. These constraints could include technical, resource, or regulatory limitations. ## Approaches There are 3 different approaches that Galileo takes with fine-tuning, depending on the metric you are interested in, and your dataset. | Approach | When to use | | :--------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Use a preset SLM metric | The required metric aligns in definition with a metric offered by Galileo. Accuracy on your dataset is good enough. | | Fine-tune preset SLM metric | The required metric aligns in definition with a metric offered by Galileo. LLM-as-a-judge variant of the metric works well (with or without CLHF). Preset SLM variant performance is poor. | | Create a new customized SLM metric | The required metric is not offered by Galileo out-of-the-box. | ```mermaid theme={null} flowchart TD A([Is required metric offered by Galileo?]) -->|Yes| B([Is Preset SLM accurate enough?]) A -->|No| C[Use approach 3 - create custom SLM metric] B -->|Yes| D[Use approach 1 - enable preset SLM metric] B -->|No| E[Use approach 2 - fine-tune Preset SLM] ``` ## Fine-tuning requirements After working through the above approaches, if fine-tuning is the best approach, you will need to provide a training dataset. * If you can provide this, we would need around 4,000 total labeled samples, with a 50/50 split amongst the classes (e.g. 2,000 context adherent samples, 2,000 non-context adherent samples). * If you are unable to procure this many labeled samples, these can be synthetically generated via LLMs approved by you. Description of model(s) used for synthetic data generation would then be explained in a generalized model card. ## Turnaround time These are rough estimates for the turnaround time for a fine-tuned metric, based on data analysis and training time. Your Galileo contact can provide more details. ### Model fine-tuning To fine tune your model, ensure: * The objective for the new metric is clearly understood * The test Dataset is quality checked by the applied data science team The estimated timings are: * **Existing metric fine-tuning:** * If your metric has the same definition as our metric, and just needs to work on your data: **2-3 days** * If there is a slight change in definition on the metric: **3-4 days** * **New metric:** **4-5 days** These times can vary based on the data/latency/metric requirements, so would ideally need at least a week (5 days) for any metric fine-tuning after all the prerequisites are fulfilled. ### Model deployment To deploy your model, ensure: * The new fine-tuned model is approved for use internally * The model is integrated into the Galileo cluster by the applied data science team The estimated timings are: * **Replace existing metric**: Deployment can be done in **1-2 days** * **New custom metric**: This is more involved, the time to completion would be defined on a case-by-case basis # Luna-2 Overview Source: https://docs.galileo.ai/concepts/luna/luna Discover Galileo's Luna-2 Evaluation model, reducing the latency and cost for metric evaluations **Luna-2** is the latest generation of our Luna small language models (SLMs), purpose built for scaling AI evaluations. Luna-2 models are fine tuned to provide low latency and reduced costs for metric evaluations. Luna-2 is designed to be further fine tuned for your specific use cases and custom metrics with the goal of providing scalable, real-time, customizable evaluations for enterprises. Luna-based metrics offer highly accurate and efficient evaluations for AI applications, particularly those with agentic workflows. Luna-2 is only available in the Enterprise tier of Galileo. [Contact us](https://galileo.ai/contact-sales) to learn more and get started. Contact us to learn more about using Luna-2 in your evaluations Learn how Galileo pushes the envelope on GenAI evaluation with our family of fine tuned small language models. ## Overview LLMs are powerful judges for evaluations, but as your application scales up to from tens or hundreds of traces a day, to thousands or millions, they can fall short. Too often, organizations relying solely on LLMs to act as judges incur major inference costs and don't see the low-latency they need to enable real-time evaluations and runtime protection. * LLMs are expensive * LLMs don't provide the performance needed, especially for runtime protection * LLMs are general purpose, and even leveraging [Autotune](/concepts/metrics/autotune-llm-as-a-judge-metrics) to enhance the evaluation prompts, can still be less effective for your specific needs. The Luna-2 model mitigates these issues: * Being an SLM, it is an **order of magnitude cheaper** to run than most LLMs * SLMs run an **order of magnitude faster**, allowing for **runtime protection** * Luna-2 is not only **fine-tuned for evaluations**, giving comparable performance out of the box with the top LLMs, but it can be **further fine-tuned using your data** to improve accuracy beyond any general purpose LLM. The Luna-2 model works with most of the [out of the box metrics](/sdk-api/metrics/metrics#luna-metrics), or your [LLM-as-a-judge custom metrics](/concepts/metrics/custom-metrics/custom-metrics-ui-llm). ## Performance and cost comparison ### Comparison with different LLMs and content safety tools | Model | Cost/1M token | Accuracy (F1 score) | Latency (avg) | Max tokens | | :------------------- | ------------: | ------------------: | ------------: | ---------: | | **Luna-2** | **\$0.02** | **0.95** | **152ms** | **128k** | | GPT 4o | \$2.50 | 0.94 | 3,200ms | 128k | | GPT 4o mini | \$0.60 | 0.90 | 2,600ms | 128k | | Azure Content Safety | \$1.52 | 0.62 | 312ms | 3k | ### Latency vs compute requirements These are the measured latencies for Luna-2 across a range of GPUs for different sized requests. #### H100/H200 GPU | Model | Small (500 tokens) | Medium (2K tokens) | Large (15K tokens) | Extra Large (100K tokens) | | :-------- | -----------------: | -----------------: | -----------------: | ------------------------: | | Luna-2 3B | 15ms | 15ms | 141ms | 2.8s | | Luna-2 8B | 16ms | 30ms | 277ms | 4.71s | #### RTX PRO 6000 GPU | Model | Small (500 tokens) | Medium (2K tokens) | Large (15K tokens) | Extra Large (100K tokens) | | :-------- | -----------------: | -----------------: | -----------------: | ------------------------: | | Luna-2 3B | 17ms | 32ms | 245ms | 4.8s | | Luna-2 8B | 28ms | 61ms | 514ms | 8.05s | #### B200 GPU | Model | Small (500 tokens) | Medium (2K tokens) | Large (15K tokens) | Extra Large (100K tokens) | | :-------- | -----------------: | -----------------: | -----------------: | ------------------------: | | Luna-2 3B | 15ms | 16ms | 81ms | 1.37s | | Luna-2 8B | 15ms | 19ms | 146ms | 2.24s | #### A100 GPU | Model | Small (500 tokens) | Medium (2K tokens) | Large (15K tokens) | Extra Large (100K tokens) | | :-------- | -----------------: | -----------------: | -----------------: | ------------------------: | | Luna-2 3B | 27ms | 85ms | 750ms | 12.5s | | Luna-2 8B | 51ms | 177ms | 1.51s | 21.2s | #### L40S GPU | Model | Small (500 tokens) | Medium (2K tokens) | Large (15K tokens) | Extra Large (100K tokens) | | :-------- | -----------------: | -----------------: | -----------------: | ------------------------: | | Luna-2 3B | 57ms | 91ms | 491ms | 8.06s | | Luna-2 8B | 86ms | 163ms | 1.01s | 14.03s | #### L4 GPU L4 GPUs are only supported for calculating metrics for Log streams and experiments. These GPUs are not supported for runtime protection. | Model | Small (500 tokens) | Medium (2K tokens) | Large (15K tokens) | Extra Large (100K tokens) | | :-------- | -----------------: | -----------------: | -----------------: | ------------------------: | | Luna-2 3B | 51ms | 155ms | 1.66s | 29.45s | | Luna-2 8B | 126ms | 364ms | 3.35s | 50.78s | The actual latencies can vary a lot based upon the load on the system (Eg: QPS). This can be managed with more GPUs, but the cost will increase. ## Technical details Galileo's Luna-2 metrics utilize fine-tuned Llama models (3B and 8B variants) in evaluating generative AI metrics. The technical process involves: * **Fine-Tuning:** Base Llama models are fine-tuned with proprietary data for specific metric needs. * **Classification:** Models output normalized log-probabilities of True/False tokens to determine metric accuracy. * **Optimized Infrastructure:** Metrics are hosted on Galileo's optimized inference engine with modern GPU hardware for low-latency and cost-effective evaluations. You can also self host on-prem or on your cloud infrastructure. * **Adapters for Custom Metrics:** Lightweight adapters on a shared base model enhance scalability and minimize infrastructure overhead for additional metrics. By leveraging fine-tuned Llama models, Luna-2 metrics provide significant enhancements over traditional methods: Luna evaluation models are fine-tuned on open-source base models, including but not limited to Llama and Mistral. Where applicable, third-party license terms apply — for example, Llama is licensed under the [Meta Llama Community License](https://llama.meta.com/llama3/license), Copyright (c) Meta Platforms, Inc. All Rights Reserved. * **Adaptability:** These models are most effective when fine tuned, requiring approximately 4,000 samples for fine-tuning to customer-specific use cases. * **Efficiency and Cost-Effectiveness:** Luna-2 models enable simultaneous evaluation of multiple metrics with low latency and reduced costs, ideal for real-time, high-scale deployments. * **Enhanced Accuracy:** Luna-2 demonstrates at least a 10% accuracy increase compared to traditional BERT-based models, perfect for precise monitoring in production environments. ## Get started with Luna-2 If you are using the enterprise tier of Galileo, follow these steps to use Galileo's Luna-based metrics: 1. [Contact Galileo's customer support or account management](https://galileo.ai/contact-sales) to begin onboarding. 2. If you are using a Galileo-hosted instance, request L4 GPUs or higher, necessary for running Luna-2 models. Otherwise you can deploy to your own infrastructure, using L4 or higher GPUs. 3. Review the provided documentation and model cards for details on latency, accuracy, and comparisons to BERT-based metrics. 4. Provide Galileo with relevant labelled sample data to fine tune the model. We can augment this with synthetic data if needed. 5. Galileo will fine tune your model for you, and deploy it. 6. Set up your experiments and Log streams to use [Luna-based metrics](/sdk-api/metrics/metrics#luna-metrics). This is not a one-shot process. Your model can be tuned on a regular basis as required. ## Next steps Contact us to learn more about using Luna-2 in your evaluations Learn how to evaluate metrics cheaper and faster using the Luna-2 model Learn how to use Luna-2 metrics when running experiments in code # Action Advancement Source: https://docs.galileo.ai/concepts/metrics/agentic/action-advancement Understand how to measure and optimize the effectiveness of your AI agent's actions ## Overview Action Advancement measures whether an assistant successfully accomplishes or makes progress toward at least one user goal in a conversation. Action Advancement addresses the common pain points of unclear agent performance by measuring whether AI agents are actually helping users achieve their objectives rather than just providing responses. An assistant successfully advances a user's goal when it: 1. Provides a complete or partial answer to the user's question 2. Requests clarification or additional information to better understand the user's needs 3. Confirms that a requested action has been successfully completed For an interaction to count as advancing the user's goal, the assistant's response must be: * Factually accurate * Directly addressing the user's request * Consistent with any tool outputs used ### Action Advancement at a glance | Property | Description | | :----------------------------- | :---------------------------------------------------------------------------- | | **Name of Metric** | Action Advancement | | **Metric Category** | Agentic Metrics | | **Use this metric for** | Evaluating whether AI agents make progress toward user goals in conversations | | **Can be applied to** | session, trace, all span types (agent, workflow, retriever, LLM and tool) | | **LLM/Luna Support** | Supported with both LLM + Luna models | | **Protect Runtime Protection** | No - Not applicable for this metric | | **Constants** | None - Uses dynamic evaluation | | **Usage Context** | Agentic workflows, multi-step tasks, tool-using assistants | | **Value Type** | Confidence score (0.0 to 1.0) - Confidence that any one action has advanced | | **Input/Output Requirements** | Requires conversation context, user goals, and assistant responses | ## When to Use This Metric

When to Use This Metric

This metric shines when simple response quality metrics fall short, particularly for complex, multi-step interactions where progress toward goals matters more than individual response quality.
Agentic Workflows: When an AI agent must decide on actions and select appropriate tools.
Multi-step Tasks: When completing a user's request requires multiple steps or decisions.
Tool-using Assistants: When evaluating if the assistant used available tools effectively.
Customer Service Agents: Resolving user issues through multi-step problem-solving.
Task-Oriented Assistants: Completing specific actions like booking flights or processing orders.
Research Assistants: Gathering and synthesizing information across multiple sources.
Creative Assistants: Understanding and building upon user requests iteratively.
### Calculation method If the Action Advancement score is less than 100%, it means at least one evaluator determined the assistant failed to make progress on any user goal. Action Advancement is calculated by: Multiple evaluation requests are sent to an LLM evaluator to analyze the assistant's progress toward user goals. A specialized chain-of-thought prompt guides the model to evaluate whether the assistant made progress on user goals based on the metric's definition. Each evaluation analyzes the interaction and produces both a detailed explanation and a binary judgment (yes/no) on goal advancement. The final Action Advancement score is computed as the confidence score or probability that any one user ask is advanced. We display one of the generated explanations alongside the score, choosing one that aligns with the majority judgment. This metric requires multiple LLM calls to compute, which may impact usage and billing. ### Score Interpretation **Expected Score:** 1.0 (Excellent) - The assistant made clear progress toward the booking goal by gathering necessary information and providing options. ### What different scores mean * **0.0 - 0.3 (Poor):** The assistant completely failed to address the user's request or made no meaningful progress. Common causes include ignoring the user's question, providing irrelevant information, or failing to use available tools when needed. * **0.4 - 0.7 (Fair):** The assistant made some progress but didn't fully accomplish the user's goal. This might include partial answers, requesting clarification when not needed, or missing key aspects of the request. * **0.8 - 1.0 (Excellent):** The assistant successfully advanced the user's goal by providing complete answers, making appropriate requests for clarification, or confirming successful task completion. ## How to improve Action Advancement scores To improve Action Advancement scores, focus on ensuring your AI agents make meaningful progress toward user goals in every interaction. ### Common issues and solutions | Issue | Cause | Solution | | :---------------------------------- | :----------------------------------------------- | :----------------------------------------------------------------------------------------------------------- | | **Assistant ignores user requests** | Poor prompt engineering or context understanding | Improve system prompts to emphasize goal-oriented responses and ensure the assistant understands user intent | | **Incomplete responses** | Insufficient context or tool usage | Provide better context and ensure the assistant uses available tools effectively | | **Irrelevant information** | Lack of focus on user goals | Train the assistant to stay focused on the specific user request and avoid tangential information | | **No progress on multi-step tasks** | Poor task breakdown | Implement better task decomposition and ensure the assistant can handle complex, multi-step processes | ### Best practices for optimization * **Clear goal identification:** Ensure your assistant can identify and prioritize user goals * **Progressive disclosure:** Break complex tasks into manageable steps * **Tool integration:** Make sure the assistant effectively uses available tools and APIs * **Context awareness:** Maintain conversation context to build on previous interactions ## Comparison to other metrics | Property | Action Advancement | Instruction Adherence | Completeness | | :----------------------------- | :---------------------------------------- | :----------------------------------------------- | :------------------------------- | | **Metric Category** | Agentic Metrics | Response Quality | Response Quality | | **Use this metric for** | Evaluating goal progress in conversations | Measuring how well responses follow instructions | Assessing response completeness | | **Best for** | Multi-step tasks and agentic workflows | Single-turn instruction following | Ensuring comprehensive responses | | **LLM/Luna Support** | Yes | Yes | Yes | | **Protect Runtime Protection** | No | No | No | | **Value Type** | Percentage (0.0-1.0) | Percentage (0.0-1.0) | Percentage (0.0-1.0) | | **Limitations** | Requires conversation context | May not capture goal progress | Doesn't measure goal advancement | ## Best practices To effectively implement and optimize Action Advancement in your AI systems, consider these key practices: ### Track progress over time Monitor Action Advancement scores across different versions of your agent to ensure improvements in task completion capabilities. This helps you identify whether your optimizations are actually improving goal advancement. ### Analyze failure patterns When Action Advancement scores are low, examine the specific steps where agents fail to make progress to identify systematic issues. Look for patterns in where agents get stuck or fail to advance user goals. ### Combine with other metrics Use Action Advancement alongside other agentic metrics to get a comprehensive view of your assistant's effectiveness. This provides a more complete picture of your agent's performance beyond just goal advancement. ### Test edge cases Create evaluation datasets that include complex, multi-step tasks to thoroughly assess your agent's ability to advance user goals. This ensures your agent can handle challenging scenarios that require multiple steps. When optimizing for Action Advancement, ensure you're not sacrificing other important aspects like safety, factual accuracy, or user experience in pursuit of task completion. ## Performance Benchmarks We evaluated Action Advancement against human expert labels on an internal dataset of agentic conversation samples using top frontier models. | Model | F1 (True) | | :---------------------- | :-------: | | GPT-4.1 | 0.87 | | GPT-4.1-mini (judges=3) | 0.78 | | Claude Sonnet 4.5 | 0.89 | | Gemini 3 Flash | 0.85 | ### GPT-4.1 Classification Report Benchmarks based on internal evaluation dataset. Performance may vary by use case. ## Related Resources If you would like to dive deeper or start implementing Action Advancement, check out the following resources: ### Examples * [Action Advancement Examples](https://app.galileo.ai) - Log in and explore the "Action Advancement" Log Stream in the "Preset Metric Examples" Project to see this metric in action. ### How-to guides * [Agentic AI Basic Example](/how-to-guides/agentic-ai/basic-example) * [Creating Custom Metrics](/how-to-guides/metrics/create-local-metric/create-local-metric) ### Related Concepts * [Agentic Metrics Overview](/concepts/metrics/agentic/agentic-overview) * [Action Completion](/concepts/metrics/agentic/action-completion) * [Agent Efficiency](/concepts/metrics/agentic/agent-efficiency) # Action Completion Source: https://docs.galileo.ai/concepts/metrics/agentic/action-completion Understand how to measure whether your agent actually accomplished a user's goals ## Overview Action Completion determines whether the agent successfully accomplished all of the user’s goals. Action Completion addresses the common pain points of agent performance by measuring whether AI agents are actually helping users achieve their end goal rather than just providing responses. Action Completion is successful when all of the below are true: : * The agent provides a complete answer in the case of a question * The agent provides a confirmation of successful action in the case of a request * The response is coherent and factually accurate * The response comprehensively addresses every aspect of the user's request * The response avoids contradicting tool outputs * The response summarizes all relevant parts returned by tools ### Action Completion at a glance | Property | Description | | :----------------------------- | :-------------------------------------------------------------------- | | **Name of Metric** | Action Completion | | **Metric Category** | Agentic Metrics | | **Use this metric for** | Measuring whether the agent successfully accomplished the user's goal | | **Can be applied to** | Session | | **LLM/Luna Support** | Supported with both LLM + Luna models | | **Protect Runtime Protection** | No | | **Constants** | None - Uses dynamic evaluation | | **Usage Context** | Agentic workflows, multi-step tasks, tool-using assistants | | **Value Type** | Confidence score denoted as a percentage. | | **Input/Output Requirements** | Requires agent responses and user goals for evaluation | ## Calculation method If the response does not achieve an Action Completion score of 100%, it indicates that at least one judge considered the model to have failed in accomplishing every user goal. Multiple requests are sent to an LLM using a carefully designed chain-of-thought prompt that adheres to the definition above. The LLM generates multiple distinct responses, each containing: * An explanation * A final judgment: "Yes" (goal accomplished) or "No" (goal not accomplished) Action Completion Score = (Number of "Yes" Responses) / (Total Number of Responses) One explanation is surfaced, chosen to align with the majority judgment among the responses. Galileo displays a generated explanation alongside the score, choosing the one that aligns with the majority judgement for troubleshooting. This metric requires multiple LLM calls to compute, which may impact usage and billing. ## Score interpretation **Expected Score:** 100% - A perfect score indicates the agent successfully accomplished all user goals with complete, accurate, and comprehensive responses. ### What different scores mean * **0.0 - 0.3 (Poor):** Agent completely failed to accomplish user goals, provided incomplete answers, or contradicted tool outputs. Common causes include insufficient tool usage, incomplete responses, or factual inaccuracies. * **0.4 - 0.7 (Fair):** Agent made progress toward user goals but didn't fully address all aspects of the request. Areas for improvement include ensuring comprehensive coverage of all user requirements and better tool utilization. * **0.8 - 1.0 (Excellent):** Agent successfully accomplished all user goals with complete, accurate, and comprehensive responses. Best practices include thorough tool usage, complete answer provision, and proper confirmation of successful actions. ## How to improve Action Completion scores To optimize your agent's performance and ensure high Action Completion scores, focus on comprehensive goal accomplishment and complete response generation. ### Common issues and solutions | Issue | Cause | Solution | | :------------------------- | :-------------------------------------------------- | :------------------------------------------------------------------------------------------------ | | Incomplete responses | Agent stops before addressing all user requirements | Implement comprehensive response generation and ensure all user goals are explicitly addressed | | Tool output contradictions | Agent ignores or contradicts information from tools | Ensure agent properly summarizes and incorporates all relevant tool outputs without contradiction | | Missing confirmations | Agent doesn't confirm successful actions | Add explicit confirmation steps for action-based requests | | Factual inaccuracies | Agent provides incorrect information | Implement fact-checking mechanisms and ensure responses align with tool outputs | ### Best practices for optimization * **Track Progress Over Time**: Monitor Action Completion scores across different versions of your agent to identify trends and ensure continuous improvements in task completion capabilities. * **Analyze Failure Patterns**: When Action Completion scores are low, examine specific steps or scenarios where agents fail to meet user goals. Use this analysis to identify and address systematic issues. * **Combine with Other Metrics**: Use Action Completion alongside other agentic metrics, such as Action Advancement, to get a comprehensive view of your assistant's effectiveness and identify areas for improvement. * **Test Edge Cases**: Create evaluation datasets that include complex, multi-step tasks to thoroughly assess your agent's ability to handle challenging scenarios and advance user goals effectively. When optimizing for Action Completion, ensure you're not sacrificing other important aspects like safety, factual accuracy, or user experience in pursuit of task completion. ## Comparison to other metrics | Property | Action Completion | Action Advancement | Tool Selection | | :----------------------------- | :---------------------------- | :------------------------------ | :-------------------------------- | | **Metric Category** | Agentic Performance | Agentic Performance | Agentic Performance | | **Use this metric for** | Measuring goal accomplishment | Measuring progress toward goals | Measuring tool choice quality | | **Best for** | Final outcome evaluation | Progress tracking | Tool usage optimization | | **LLM/Luna Support** | Yes | Yes | Yes | | **Protect Runtime Protection** | No | No | No | | **Value Type** | Percentage (0%-100%) | Percentage (0%-100%) | Percentage (0%-100%) | | **Limitations** | Requires multiple LLM calls | May not capture final success | Doesn't measure execution quality | ## Performance Benchmarks We evaluated Action Completion against human expert labels on an internal dataset of agentic conversation samples using top frontier models. | Model | F1 (True) | | :---------------------- | :-------: | | GPT-4.1 | 0.92 | | GPT-4.1-mini (judges=3) | 0.79 | | Claude Sonnet 4.5 | 0.87 | | Gemini 3 Flash | 0.92 | ### GPT-4.1 Classification Report Benchmarks based on internal evaluation dataset. Performance may vary by use case. ## Related Resources If you would like to dive deeper or start implementing Action Completion, check out the following resources: ### Examples * [Action Completion Examples](https://app.galileo.ai) - Log in and explore the "Action Completion" Log Stream in the "Preset Metric Examples" Project to see this metric in action. ### How-to guides * [Agentic AI Examples](/how-to-guides/agentic-ai/basic-example) ### Related Concepts * [Action Advancement](/concepts/metrics/agentic/action-advancement) * [Tool Selection](/concepts/metrics/agentic/tool-selection-quality) * [Agentic AI Overview](/concepts/metrics/agentic/agentic-overview) # Agent Efficiency Source: https://docs.galileo.ai/concepts/metrics/agentic/agent-efficiency Learn how to measure the efficiency of your agentic workflows Agent Efficiency is a binary evaluation metric of the efficiency of your agentic workflows. An agentic session is considered efficient or optimal when the agent provides a precise answer or resolution to every user ask, with an efficient path. An ask could be a question that requires an answer, or a request that requires a resolution through tool usage. Efficiency here means the agent does not make redundant tool calls, ask redundant questions/clarifications to the user, is precise and concise in its communication, and reaches its goal in minimal steps needed. This is a **boolean** metric, returning a confidence score that the agent is efficient. The score ranges from 0% (no confidence the agent is efficient) to 100% (complete confidence that the agent is efficient). ## Agent Efficiency at a glance | Property | Description | | :----------------------------- | :--------------------------------------------- | | **Name** | Agent Efficiency | | **Category** | Agentic AI | | **Can be applied to** | Session | | **LLM-as-a-judge Support** | ✅ | | **Luna Support** | ❌ | | **Protect Runtime Protection** | ❌ | | **Value Type** | Boolean shown as a percentage confidence score | ## When to use this metric ## Score interpretation **Expected Score:** 80%-100%. # Agent Flow Source: https://docs.galileo.ai/concepts/metrics/agentic/agent-flow Learn how to measure the correctness and coherence of an agentic trajectory by validating it against user-specified natural language tests Agent Flow is a binary metric that checks if an agent's behavior satisfies all user-defined natural language conditions. Agent Flow is a binary evaluation metric that measures the correctness and coherence of an agentic trajectory by validating it against user-specified natural language tests. A trajectory is said to pass the Agent Flow metric if and only if all the user-defined natural language conditions are successfully satisfied by the agent's realized behavior or output. To use this metric, you will need to create a copy and edit the prompt to provide your natural language tests. This is a **boolean** metric, returning a confidence score that the agent flow satisfies all conditions. The score ranges from 0% (no confidence the agent flow satisfies all conditions) to 100% (complete confidence that the agent flow satisfies all conditions). ## Agent Flow at a glance | Property | Description | | :----------------------------- | :--------------------------------------------- | | **Name** | Agent Flow | | **Category** | Agentic AI | | **Can be applied to** | Session | | **LLM-as-a-judge Support** | ✅ | | **Luna Support** | ❌ | | **Protect Runtime Protection** | ❌ | | **Value Type** | Boolean shown as a percentage confidence score | ## When to use this metric ## Score interpretation **Expected Score:** 80%-100%. ## Configure Agent Flow This metric needs to be manually customized to include your own natural language tests. From the **Metrics Hub**, select the **Agent Flow** metric. You will get a popup asking you to duplicate the metric. Select **Duplicate metric** to create a copy. The agent flow metric with the duplicate metric popup Locate the user defined tests section in the prompt. ```xml theme={null} {{ Add your tests here }} ``` This prompt needs to be customized based on your application, and the inputs and outputs you are expecting. Replace `{{ Add your tests here }}` with a numbered list of tests in natural language that can be used to evaluate the agent efficiency. This can include: * Expected tool or agent calls, using the tool or agent names * Conditions on tool or agent calling (e.g. if tool x is called, don't call agent y) * Expectations around the input or output parameters to tools and agents * Limitations on the number of tool or agent calls For example, imagine you were creating an agent to provide advice on exercises for different body parts, such as for a physical therapy application. This has multiple tools, including `list_by_target_muscle_for_exercised`, `list_by_body_part_for_exercised`, `list_of_bodyparts_for_exercised`. Some user tests might be: ```output wrap theme={null} 1. If a call to "list_by_target_muscle_for_exercised" returns an error that contains the text "target not found", the agent should subsequently attempt an alternative lookup by calling either "list_by_body_part_for_exercised" or "list_of_bodyparts_for_exercised" 2. When the user asks for exercises that target leg muscles, the agent must call at least one of the tools ["list_by_target_muscle_for_exercised", "list_by_body_part_for_exercised"] during the conversation 3. After receiving a successful response from "list_by_body_part_for_exercised", the agent's following natural-language message must contain at least one exercise name, the corresponding equipment, and an animated demonstration URL taken from the tool output 4. Every invocation of the tool "list_by_body_part_for_exercised" must include the required parameter "bodypart" 5. After receiving data from list_by_body_part_for_exercised, the agent response must include the exercise id for every exercise it presents to the user 6. No assistant message should include more than one tool invocation 7. The agent should conclude the conversation with a human-readable answer that summarizes the requested leg exercises using data returned from the tools ``` Save the metric, then turn it on for your Log Stream. ## Best practices Trajectory tests are similar to unit tests for the agents trajectory, to check if certain conditions are followed during the agents path. You should write all the tests in a numbered list. For example: ```md theme={null} 1. If X happens then ask the user Y and call tool Z. 2. X tool is always called before Y tool. 3. When user asks X reply with Y 4. The tool Y should be called once in the conversation. ``` Each test should check for one single condition only. Tests should be logically consistent, and well defined. ## Performance Benchmarks We evaluated Agent Flow against human expert labels on an internal dataset of agentic conversation samples using top frontier models. | Model | F1 (True) | | :---------------------- | :-------: | | GPT-4.1 | 0.93 | | GPT-4.1-mini (judges=3) | 0.92 | | Claude Sonnet 4.5 | 0.95 | | Gemini 3 Flash | 0.92 | ### GPT-4.1 Classification Report Benchmarks based on internal evaluation dataset. Performance may vary by use case. ## Related Resources If you would like to dive deeper or start implementing Agent Flow, check out the following resources: ### How-to guides * [Agentic AI Basic Example](/how-to-guides/agentic-ai/basic-example) ### Related Concepts * [Agentic Metrics Overview](/concepts/metrics/agentic/agentic-overview) * [Action Advancement](/concepts/metrics/agentic/action-advancement) * [Action Completion](/concepts/metrics/agentic/action-completion) # Agentic Metrics Source: https://docs.galileo.ai/concepts/metrics/agentic/agentic-overview Understand and evaluate the performance of AI agents using Galileo's agentic metrics Agentic metrics help you measure how well your AI agents perform complex, multi-step tasks—especially when those agents need to use tools, make decisions, or interact with external systems. These metrics and helpful for those for anyone building advanced AI assistants, workflow automation, or any system where the AI acts on behalf of a user. Use agentic metrics when you want to: * Track whether your agent is making meaningful progress toward its goals. * Detect and diagnose errors that occur when your agent uses tools or APIs. * Ensure your agent is choosing the best tools or actions for each situation. Below is a quick reference table of all agentic performance metrics: | Name | Description | Supported Nodes | When to Use | Example Use Case | | :------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------- | | [Action advancement](/concepts/metrics/agentic/action-advancement) | Measures how effectively each action advances toward the goal. | Trace | When assessing whether an agent is making meaningful progress in multi-step tasks. | A travel planning agent that needs to book flights, hotels, and activities in the correct sequence. | | [Action completion](/concepts/metrics/agentic/action-completion) | Determines whether the agent successfully accomplished all of the user's goals. | Session | To assess whether an agent completed the desired goal. | A coding agent that is seeking to close engineering tickets. | | [Agent efficiency](/concepts/metrics/agentic/agent-efficiency) | Determines if an agent provides a precise answer or resolution to every user ask, with an efficient path. | Session | To assess if an agent is taking the most efficient path to a solution. | A complex multi-agent chatbot that needs a fast response. | | [Agent flow](/concepts/metrics/agentic/agent-flow) | Measures the correctness and coherence of an agentic trajectory by validating it against user-specified natural language tests. | Session | To assess a multi-agent system, or a system with multiple tools. | An internal process agent that needs to follow strict process rules. | | [Conversation quality](/concepts/metrics/agentic/conversation-quality) | A binary metric that assesses whether a chatbot interaction left the user feeling satisfied and positive or frustrated and dissatisfied. | Session (trace inputs/outputs only) | When building customer facing chatbots. | A health insurance chatbot. | | [Tool error](/concepts/metrics/agentic/tool-error) | Detects errors or failures during the execution of tools. | Tool span | When implementing AI agents that use tools and want to track error rates. | A coding assistant that uses external APIs to run code and must handle and report execution errors appropriately. | | [Tool selection quality](/concepts/metrics/agentic/tool-selection-quality) | Evaluates whether the agent selected the most appropriate tools for the task. | LLM span | When optimizing agent systems for effective tool usage. | A data analysis agent that must choose the right visualization or statistical method based on the data type and user question. | | [Reasoning Coherence](/concepts/metrics/agentic/reasoning-coherence) | Assesses whether an agent’s reasoning steps are logically consistent and aligned with its plan. | LLM span | When validating multi-step planning and intermediate reasoning quality. | A planning agent that must follow a coherent plan across tool calls. | | [User Intent change](/concepts/metrics/agentic/intent-change) | Measures a significant shift in the user's primary conversational goal or workflow during a session, relative to their initial stated intent. | Session (trace inputs/outputs only) | To analyze a holistic view across an entire user session to understand what capabilities a user interacts with in a single session. | A multi-purpose chatbot for a bank. | *** ## Next steps * [See examples of agentic metrics in action](/cookbooks/use-cases/agent-weather-vibes-app) * [Back to Metrics Overview](/concepts/metrics/overview) * [Compare all metrics](/concepts/metrics/metric-comparison) # Conversation Quality Source: https://docs.galileo.ai/concepts/metrics/agentic/conversation-quality Learn how to measure the quality of a conversation that a user has with a chatbot Conversation Quality is a binary metric that assesses whether a chatbot interaction left the user feeling satisfied and positive or frustrated and dissatisfied, based on tone, engagement, and overall experience. The Conversation Quality metric evaluates user satisfaction across an entire chatbot session by analyzing tone, engagement, and sentiment. It classifies each conversation as GOOD or BAD depending on whether the user’s overall experience reflects positive engagement or frustration directed at the bot. The metric focuses on conversational flow rather than task success, emphasizing how naturally and politely the user and bot interact. It excludes non-textual or purely action-based agent outputs (e.g., button clicks). This is a **boolean** metric, returning a confidence score that the conversation quality is good. The score ranges from 0% (no confidence the conversation quality is good) to 100% (complete confidence that the conversation quality is good). ## Conversation Quality at a glance | Property | Description | | :----------------------------- | :--------------------------------------------- | | **Name** | Conversation Quality | | **Category** | Agentic AI | | **Can be applied to** | Session | | **LLM-as-a-judge Support** | ✅ | | **Luna Support** | ❌ | | **Protect Runtime Protection** | ❌ | | **Value Type** | Boolean shown as a percentage confidence score | ## When to use this metric ## Score interpretation **Expected Score:** 80%-100%. ## How to improve Conversation Quality scores Some techniques to improve Conversation Quality scores are: * Ensure bots provide clear, empathetic, and concise responses * Detect and mitigate repeated clarification loops * Train models to de-escalate external frustration effectively * Log complete sessions to allow accurate tone assessment Common issues that can cause low scores are: * Mislabeling external frustration as bot-directed * Incomplete logs * Abrupt session truncation ## Performance Benchmarks We evaluated Conversation Quality against human expert labels on an internal dataset of agentic conversation samples using top frontier models. | Model | F1 (True) | | :---------------------- | :-------: | | GPT-4.1 | 0.89 | | GPT-4.1-mini (judges=3) | 0.85 | | Claude Sonnet 4.5 | 0.85 | | Gemini 3 Flash | 0.88 | ### GPT-4.1 Classification Report Benchmarks based on internal evaluation dataset. Performance may vary by use case. ## Related Resources If you would like to dive deeper or start implementing Conversation Quality, check out the following resources: ### Examples * [Conversation Quality Examples](https://app.galileo.ai) - Log in and explore the "Conversation Quality" Log Stream in the "Preset Metric Examples" Project to see this metric in action. ### How-to guides * [Agentic AI Basic Example](/how-to-guides/agentic-ai/basic-example) * [Creating Custom Metrics](/how-to-guides/metrics/create-local-metric/create-local-metric) ### Related Concepts * [Agentic Metrics Overview](/concepts/metrics/agentic/agentic-overview) * [Action Completion](/concepts/metrics/agentic/action-completion) * [Action Advancement](/concepts/metrics/agentic/action-advancement) # User Intent Change Source: https://docs.galileo.ai/concepts/metrics/agentic/intent-change Learn how to measure if users are using your agent system for different intents across multi-turn conversation User Intent Change checks if users are using your agent system for different intents across multi-turn conversations. User Intent Change is a binary evaluation metric, and is defined as a significant shift in the user's primary conversational goal or workflow during a session, relative to their initial stated intent. This is a **boolean** metric, returning a confidence score that the user intent has changed significantly. The score ranges from 0% (no confidence the user intent has changed) to 100% (complete confidence that the user intent has changed). ## User Intent Change at a glance | Property | Description | | :----------------------------- | :--------------------------------------------- | | **Name** | User Intent Change | | **Category** | Agentic AI | | **Can be applied to** | Session | | **LLM-as-a-judge Support** | ✅ | | **Luna Support** | ❌ | | **Protect Runtime Protection** | ❌ | | **Value Type** | Boolean shown as a percentage confidence score | ## When to use this metric ## Score interpretation **Expected Score:** 80%-100%. # Reasoning Coherence Source: https://docs.galileo.ai/concepts/metrics/agentic/reasoning-coherence Evaluate whether an agent’s reasoning steps are logically consistent and aligned with its plan Reasoning Coherence assesses whether an agent’s reasoning steps are logically consistent, non-contradictory, and aligned with the intended plan. ## Metric definition Reasoning Coherence — A binary metric that evaluates internal logical consistency within a single LLM call, with respect to the latest user input. * Type: Binary * 1 (Coherent): Intermediate reasoning events/summaries are mutually consistent and causally support the LLM input. * 0 (Incoherent): Contradictions, conflicting premises, circular logic, or unjustified reversals/jumps exist among the reasoning events. This metric is primarily used for agentic workflows that involve multi-step planning, tool usage, and intermediate reasoning traces. It helps validate that the steps an agent takes (or proposes) form a coherent path from problem to solution. Here's a scale that shows the relationship between Reasoning Coherence and potential impact on your AI system: Scale is 0-100 and is derived from binary judgments converted into a confidence score. ## Calculation method Reasoning Coherence is computed through a multi-step process: One or more evaluation requests are sent to an LLM evaluator to analyze the agent’s reasoning steps and plan alignment. A chain-of-thought style judge prompt guides the evaluator to check for logical consistency, contradictions, and adherence to the plan.
Evaluation rubric (summary): - Intermediate reasoning summaries should support the LLM’s input and each other logically. - No event should invalidate or contradict an earlier inference without explicit, justified retraction. - Explanations and planned actions/tool selections must be mutually reinforcing and consistent with the input. - Web search: The need for a search should be justified by the input/reasoning, and the query/parameters should be appropriate.
The system can request multiple judgments to improve robustness and reduce variance. Each evaluation produces a binary decision (coherent / not coherent) and an explanation. Each evaluation produces a binary outcome, where coherent = 1 and not coherent = 0, along with an explanation.
This metric is computed by prompting an LLM and may require multiple LLM calls to compute, which can impact usage and billing. ## Supported nodes * LLM span Inputs considered (when available): * Latest user input and current system prompt * Intermediate reasoning events and summaries (including plan/steps) * Tool-selection thoughts and invoked tool calls (including arguments) * Final in-span conclusion/output Empty or missing reasoning summaries should not be penalized; assess coherence only when there is evidence of incoherence. ## What constitutes coherent reasoning (1) * Intermediate reasoning summaries support the LLM’s input and each other logically. * No unjustified contradictions: any retractions are explicit and justified. * Explanations, planned actions, and tool selections are consistent with the input and mutually reinforcing. * Web search is justified by the input/reasoning and uses appropriate parameters (e.g., search query). ## What constitutes incoherent reasoning (0) * Explicit contradictions without justification within the reasoning chain. * Final (in-span) conclusions or planned actions don’t follow from prior steps. * Circular reasoning or unjustified reversals of stance. * Tool-selection reasoning conflicts with the recorded input or earlier reasoning steps. * The reasoning process deviates from the latest user or system instructions. * Web search is unjustified for common-knowledge queries (if unsure, treat as justified), or web search is used when an available specialized tool (e.g., get\_weather) is clearly more appropriate for the user’s query. ## Interpreting the score * 0-30: Low coherence — reasoning likely contains contradictions or misaligned steps. * 31-69: Mixed coherence — review critical steps and provide additional guidance. * 70-100: Strong coherence — reasoning appears consistent and aligned. > Consider setting thresholds for alerting or human review based on your domain’s risk tolerance (e.g., flag \< 50 for review). ## Example use cases * Validating multi-step “plan → execute” agents. * Auditing tool-augmented reasoning chains for consistency. * Comparing agent versions for planning quality regressions. * Example: A financial planning agent develops a step-by-step investment plan, ensuring each recommendation logically follows from prior steps and aligns with the user’s goals. ## Usage Enable this metric in experiments or Log Streams by selecting the Reasoning Coherence scorer. ```python Python theme={null} from galileo import GalileoMetrics metric = GalileoMetrics.reasoning_coherence ``` ## Best practices Ensure the agent records its plan and intermediate steps so coherence can be evaluated meaningfully. Calibrate the judge rubric with domain examples to reduce false positives/negatives. Define minimum acceptable coherence scores and trigger human review below that threshold. Use continuous learning via human feedback to improve the judge prompt and rubric over time. ## Performance Benchmarks We evaluated Reasoning Coherence against human expert labels on an internal dataset of agentic conversation samples using top frontier models. | Model | F1 (True) | | :---------------------- | :-------: | | GPT-4.1 | 0.88 | | GPT-4.1-mini (judges=3) | 0.87 | | Claude Sonnet 4.5 | 0.79 | | Gemini 3 Flash | 0.88 | ### GPT-4.1 Classification Report Benchmarks based on internal evaluation dataset. Performance may vary by use case. ## Related Resources If you would like to dive deeper or start implementing Reasoning Coherence, check out the following resources: ### Examples * [Reasoning Coherence Examples](https://app.galileo.ai) - Log in and explore the "Reasoning Coherence" Log Stream in the "Preset Metric Examples" Project to see this metric in action. ### How-to guides * [Agentic AI Basic Example](/how-to-guides/agentic-ai/basic-example) * [Creating Custom Metrics](/how-to-guides/metrics/create-local-metric/create-local-metric) ### Related Concepts * [Agentic Metrics Overview](/concepts/metrics/agentic/agentic-overview) * [Action Completion](/concepts/metrics/agentic/action-completion) * [Action Advancement](/concepts/metrics/agentic/action-advancement) # Tool Error Source: https://docs.galileo.ai/concepts/metrics/agentic/tool-error Detect and analyze tool execution errors in AI agents using Galileo Guardrail Metrics to ensure reliable tool usage in agentic workflows Tool Error detects errors or failures during the execution of Tools. This metric is particularly valuable for monitoring agentic AI systems where the model uses various tools to complete tasks. Tool execution failures can lead to incomplete or incorrect responses, affecting the overall user experience. Here's a scale that shows the relationship between Tool Error detection and the potential impact on your AI system: ## Calculation method Tool Error detection is computed through a multi-step process: Additional evaluation requests are sent to an LLM evaluator (e.g., OpenAI's GPT4o-mini) to analyze tool execution outcomes. A carefully engineered chain-of-thought prompt guides the model to evaluate whether each tool executed successfully without errors. The system performs a detailed analysis of execution logs and outputs from each tool call to identify potential issues. The evaluation process identifies specific errors, exceptions, and unexpected behaviors that occurred during tool execution. A detailed explanation is generated describing the detected errors and their potential impact on the system's functionality. We also surface a generated explanation that helps understand the nature of the error and its potential causes. This metric is computed by prompting an LLM, which requires additional LLM calls to compute, potentially impacting usage and billing. ## Understanding tool error

Common Types of Tool Errors

Tool Error detection identifies various failure modes:
API Failures: External services or APIs that tools depend on may be unavailable or return errors.
Parameter Errors: Tools may receive invalid parameters that cause execution failures.
Timeout Issues: Tools may take too long to execute and exceed allocated time limits.
Permission Errors: Tools may lack necessary permissions to access required resources.
## Optimizing your AI system

Addressing Tool Errors

When your system experiences tool execution errors, consider these improvements:
Implement robust error handling: Ensure tools can gracefully handle exceptions and provide meaningful error messages.
Add parameter validation: Validate input parameters before tool execution to prevent runtime errors.
Monitor external dependencies: Set up monitoring for external services that your tools depend on.
Implement fallback mechanisms: Design tools with fallback options when primary execution paths fail.
## Best practices Implement detailed logging for all tool executions to facilitate debugging and error analysis. Design tools to provide partial results or alternative responses when they encounter errors. Categorize different types of errors to identify patterns and prioritize fixes based on frequency and impact. Translate technical errors into user-friendly messages that help users understand what went wrong. This metric helps you detect whether your tools executed correctly. It's most useful in Agentic Workflows where many Tools get called. It helps you detect and understand patterns in your Tool failures, allowing you to improve reliability over time. ## Related Resources If you would like to dive deeper or start implementing Tool Error detection, check out the following resources: ### How-to guides * [Agentic AI Basic Example](/how-to-guides/agentic-ai/basic-example) * [Creating Custom Metrics](/how-to-guides/metrics/create-local-metric/create-local-metric) ### Related Concepts * [Agentic Metrics Overview](/concepts/metrics/agentic/agentic-overview) * [Action Completion](/concepts/metrics/agentic/action-completion) * [Action Advancement](/concepts/metrics/agentic/action-advancement) # Tool Selection Quality Source: https://docs.galileo.ai/concepts/metrics/agentic/tool-selection-quality Evaluate tool selection quality in AI agents using Galileo Guardrail Metrics to ensure agents choose appropriate tools with correct parameters Tool Selection Quality determines whether the agent selected the correct tool and for each tool the correct arguments. This metric is particularly valuable for evaluating agentic AI systems where the model must decide which tools to use and how to use them correctly. Poor tool selection can lead to ineffective or incorrect responses. Here's a scale that shows the relationship between Tool Selection Quality and the potential impact on your AI system: ## Calculation method Tool Selection Quality is computed through a multi-step process: Multiple evaluation requests are sent to an LLM evaluator (e.g., OpenAI's GPT4o-mini) to analyze the agent's tool selection decisions. A carefully engineered chain-of-thought prompt guides the model to evaluate whether the selected tools and their parameters were appropriate for the task. The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus. Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on tool selection appropriateness. The final Tool Selection Quality score is computed as the ratio of positive ('yes') responses to the total number of evaluation responses. We also surface one of the generated explanations, always choosing one that aligns with the majority judgment among the responses. This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing. ## Understanding tool selection quality

When Tool Selection is Evaluated

Tool Selection Quality evaluates different scenarios:
No Tool Needed: The assistant is not expected to call tools if there are no unanswered user queries, if no tools can help answer any query, or if all the information to answer is contained in the history.
Tool Needed: When tools should be used, the turn is considered successful if the agent selected the correct tool and provided all required arguments with correct values.
Unsuccessful Selection: If the agent calls tools when it shouldn't, or selects the wrong tool/arguments when it should call tools, the turn is considered unsuccessful.
## Optimizing your AI system

Addressing Low Tool Selection Quality

When a response has a low Tool Selection Quality score, consider these improvements:
Analyze error patterns: Identify common mistakes in tool selection or parameter usage.
Improve tool descriptions: Enhance tool documentation with clearer descriptions of when and how to use each tool.
Refine system prompts: Update instructions to provide better guidance on tool selection criteria.
Consider model capabilities: Some models may be better at tool selection than others.
## Best practices Provide detailed descriptions for each tool, including when to use it and what parameters are required. Implement validation for tool parameters to prevent incorrect usage and provide helpful error messages. Track which tools are frequently misused to identify opportunities for improvement in tool design or documentation. Provide examples of correct tool usage in different scenarios to help the agent learn appropriate selection patterns. Tool Selection Quality is most useful in Agentic Workflows, where an LLM decides the course of action to take by selecting a Tool. This metric helps you detect whether the right course of action was taken by the Agent. # Improve LLM-as-a-Judge Metrics with Autotune Source: https://docs.galileo.ai/concepts/metrics/autotune-llm-as-a-judge-metrics Use Autotune to turn feedback into prompt improvements that make LLM-as-a-judge metrics more accurate for your use case. LLM-as-a-judge metrics evaluate LLM application outputs at scale, but may not reflect your team's domain-specific standards out of the box. Whether you're adapting a preset metric to a new domain or refining a custom metric that still isn't accurate enough, the metric prompt often needs tuning to capture your specific evaluation criteria — and doing that manually is time-consuming and hard to scale. Teams typically rewrite prompts, test changes, and repeat that cycle across multiple rounds with no guarantee the result is right. Autotune lets anyone involved in building or reviewing metrics — annotators, product managers, or developers — provide feedback on metric outputs instead of editing prompts directly. Reviewers correct results and explain their reasoning in natural language. Galileo translates that feedback into prompt improvements and shows exactly what changed. ## When to use Autotune Use Autotune to improve metric performance when: * A new custom metric isn't accurate enough for your use case * An existing metric isn't generalizing well to a new domain or use case * An existing metric is producing inconsistent results with low reviewer agreement in production * The current prompt isn't handling domain-specific edge cases reliably * Manual prompt iteration is too time-consuming to scale ## How it works ### See Autotune in action