Metric Definitions
Toxicity
Understand Galileo’s Toxicity Metric
Definition: Flags whether a response contains hateful or toxic information. Output is a binary classification of whether a response is toxic or not.
Calculation: We leverage a Small Language Model (SLM) trained on open-source and internal datasets.
The accuracy on the below open-source datasets averages 96% on the validation set:
Usefulness: Identify responses that contain toxic comments and take preventative measure such as fine-tuning or implementing guardrails that flag responses to prevent future occurrences.
Was this page helpful?