Registered Scorers
We support registering a scorer such that it can be reused across various runs, projects, modules, and users within your organization. Registered Scorers are run in the backend in an isolated environment that has access to a predefined set of libraries and packages.Creating Your Registered Scorer
To define a registered scorer, create a Python file that has at least 2 functions and follow the function signatures as described below:scorer_fn
: The scorer function is provided the row-wise inputs and is expected to generate outputs for each response. The expected signature for this function is:
scorer_fn
must accept **kwargs
as the last parameter so that your registered scorer is forward-compatible.
Here is an example with the full list of parameters supported currently. This example checks the output vs the ground truth and returns the absolute difference in length:
node_name
, node_type
, node_id
and tools
are all specific to workflows/multi step chains. dataset_variables
contains key-value pairs of variables that are passed in from the dataset in prompt evaluation runs, but can also be used to get the target/ground truth in multi step runs. Dataset variables are not available for Evaluate workflows / Observe.
The index
parameter is the index of the row in the dataset, node_input
is the input to the node, and node_output
is the output from the node.
-
aggregator_fn
: The aggregator function is only used in Evaluate, not Observe. The aggregator function takes in an array of the row-wise outputs from your scorer and allows you to generate aggregates from those. The expected signature for the aggregator function is:For aggregated values that you want to output from your scorer, return them as key-value pairs with the key corresponding to the label and the value. -
(Optional, but recommended)
score_type
: The scorer_type function is used to define theType
of the score that your scorer generates. The expected signature for this function is:Note that the return type is aType
object likefloat
, not the actual type itself. Defining this function is necessary for sorting and filtering by scores to work correctly. If you don’t define this function, the scorer is assumed to generatefloat
scores by default. -
(Optional)
scoreable_node_types_fn
: If you want to restrict your scorer to only run on specific node types, you can define this function which returns a list of node types that your scorer should run on. The expected signature for this function is:If you don’t define this function, your scorer will run onllm
andchat
nodes by default. Here’s an example of ascoreable_node_types_fn
that restricts the scorer to only run onretriever
nodes: -
(Optional)
include_llm_credentials
: If you want access to the LLM credentials for the user who created the Observe project / Evaluate run during the execution of the registered scorer. This is expected to be set as a boolean value. OpenAI credentials are the only ones that are currently supported. By default, it is assumed to beFalse
. The expected signature for this property is:If you don’t define this function, your scorer will not have access to the LLM credentials by default. If you do enable it, the credentials will be included in calls toscorer_fn
at the keyword argumentcredentials
. The credentials will be a dictionary with the keys as the name of the integration, if available, and values as the credentials. For example, if the user has an OpenAI integration, the credentials will be: -
(Optional)
chain_aggregation
: Set this field to one ofsum
,average
,first
orlast
to specify how the scores across each sub-node in a chain should be aggregated to the chain-level. The aggregation This is only applicable if your scorer is used in a multi-step chain. If not specified, the default is no aggregation.
Registering Your Scorer
Once you’ve created your scorer file, you can register it with the name and the scorer file:Using Your Registered Scorer
To use your scorer during a prompt run (or sweep), simply pass it in alongside any of the other scorers:Example
For example, let’s say we wanted to create a custom metric that measured the length of the response. In our Python environment, we would define ascorer_fn
function, and an aggregator_fn
function.
- Create a
scorer.py
file:
-
Register the scorer:
-
Use the scorer in your prompt run:
Execution Environment
Your scorer will be executed in a Python 3.10 environment. You can arbitrarily add additional Python libraries with the following comment snippet at the top of your scorer, with theopenai
library as an example:
What if I need to use other libraries or packages?
If you need to use other libraries or packages, you may use ‘Custom Scorers’. Custom Scorers are run on your notebook environment. Because they run locally, they won’t be available for runs created from the UI or for Observe projects.Registered Scorers | Custom Scorers | |
---|---|---|
Creating the custom metric | Created from the Python client, can be activated through the UI. | Created via the Python client |
Sharing across the organization | Accessible within the Galileo console across different projects and modules | Outside Galileo, accessible only to the current project |
Accessible modules | Evaluate and Observe | Evaluate |
Scorer Definition | As an independent Python file | Within the notebook |
Execution Environment | Server-side | Within your Python environment |
Python Libraries available | Limited to a Galileo provided execution environment | Any library within your virtual environment |
Execution Resources | Restricted by Galileo | Any resources available to your local instance |
How do I create a local “Custom Scorer”?
Custom scorers can be created from two Python functions (executor
and aggrator
function as defined below). Common types include:
- Heuristics/custom rules: checking for regex matches or presence/absence of certain keywords or phrases.
- model-guided: utilizing a pre-trained model to check for specific entities (e.g. PERSON, ORG), or asking an LLM to grade the quality of the output.
**executor**
and **aggregator**
instead of scorer_fn
and aggregator_fn
.
scorers
parameter inside pq.run
or pq.run_sweep
, pq.EvaluateRun
, or pq.GalileoPromptCallback
: