Clone this notebook to create this run in your Galileo cluster: https://colab.research.google.com/drive/1LfFEe8MlZuKU_a41z8SEH0to4OOfrWnO

In this example, we will demonstrate how to integrate a topic detection model into a Galileo run through a Galileo CustomMetric.

Setup: Install Library and Set Up Variables

We will use promptquality, the Python client to interact with Galileo’s GenAI Studio: Evaluate.


! pip install promptquality

Next, we will set our Galileo cluster url, API key, and project name in order to define where we want to log our results.


    import os
    import promptquality as pq
    from google.colab import userdata

    ### Set variables and env variables ###
    os.environ['GALILEO_API_KEY'] = GALILEO_API_KEY = userdata.get('GALILEO_API_KEY_DEMO')
    os.environ['GALILEO_CONSOLE_URL'] = GALILEO_CONSOLE_URL = 'https://console.demo.rungalileo.io/'
    GALILEO_PROJECT_NAME = 'hotpotqa_topicdetection'

    # 🔭🌕 Logging in to the console
    config = pq.login(os.environ['GALILEO_CONSOLE_URL'])

Construct Dataset: Subsample of HotpotQA

We will be using (a subsample of) HotpotQA, a public Q&A dataset with question, context, and ground truths aliases. HotpotQA has easy, medium, and hard tasks that are challenging even for the most modern LLM releases.

In lieu of evaluating model responses against the ground truths, we can leverage Galileo’s metrics to gauge hallucinations.


    import urllib.request
    import pandas
    import json

    def parse_context(context):
        parsed_context = ""
        for item in context:
            title = item[0]
            contents = " ".join(item[1])
            parsed_context += f"{title}: {contents}\n"
        return parsed_context.strip()

    url = 'http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_dev_fullwiki_v1.json'
    with urllib.request.urlopen(url) as urlo:
        json_data = json.load(urlo)

    data = pandas.DataFrame(json_data)
    data['parsed_context'] = data['context'].apply(parse_context)

    dataset = {
        'question': data['question'].iloc[0:50].tolist(),
        'context': data['parsed_context'].iloc[0:50].tolist()
    }

Define our Classification Pipeline

We will use a checkpoint for bart-large that has been trained on the MultiNLI (MNLI) dataset, which is a dataset of sentence pairs annotated with textual entailment information. This makes it ideal as an off-the-shelf zero-shot topic classification model.


    from transformers import pipeline

    # HuggingFace will expect an environment variable 'HF_TOKEN' to download the model
    os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
    pipe = pipeline(model="facebook/bart-large-mnli")

Implementing our Pipeline as a Galileo CustomMetric

We will define a small space of candidate labels for this zero-shot topic detection task.

We also define an executor and aggregator function. The executor is a row-level calculator, while the aggregator consolidates all of the calculated row values.

In this example, I want to publish both the top topic and its score, so my executor will serialize the JSON into a string. Then, my aggregator will evaluate that string to parse the numeric label score for aggregation.

When we invoke our run, the executor and aggregator will be computed within your Python runtime / notebook / application.


    import ast

    candidate_labels=["sports", "music", "science", "history", "technology"]

    # The executor function is a row-level calculation function.
    def executor_topicdetect(row) -> str:
        pipe_out = pipe(row.response, candidate_labels=candidate_labels)
        return json.dumps({'top_label': pipe_out['labels'][0], 'top_score': pipe_out['scores'][0]})
    # The aggregator function takes in row-level calculations of all rows, in order to perform some kind of aggregation calculation (eg mean, median, P95, etc)
    def aggregator_topicdetect(scores, indices) -> float:
        scores_parse = [float(ast.literal_eval(score)['top_score']) for score in scores]
        return {'Average Topic Score (Top Label)': sum(scores_parse) / len(scores_parse) }

Galileo Evaluate

Finally, we will define the metrics we are interested in (Galileo’s metrics) as well as our CustomMetric (Galileo’s CustomScorer class that take our executor and aggregator as inputs).

We will also define our prompt template with placeholders {context} and {question}. These will be replaced by values with the same keys in our dataset.


    metrics = [
        pq.Scorers.context_adherence,
        pq.Scorers.correctness,
        pq.Scorers.latency,
        pq.Scorers.tone,
        pq.Scorers.sexist,
        pq.Scorers.pii,
        pq.Scorers.prompt_perplexity,
        pq.CustomScorer(name='Top Topic', executor=executor_topicdetect, aggregator=aggregator_topicdetect)
    ]

    template = """
        You are a knowledgeable assistant capable of answering a wide range of questions accurately and clearly. Given the following context, provide detailed and informative answers.

        Context:
        {context}

        Question: {question}
        """

    # run our dataset
    pq.run(project_name = GALILEO_PROJECT_NAME,
           template = template,
           dataset = dataset,
           scorers = metrics,
           settings = pq.Settings(model_alias=pq.SupportedModels.chat_gpt))

The run() execution will return a URL for you to inspect your run in the Galileo Evaluate UI.

In the below run view, you can see that UI publishes the row-level and aggregate calculations based on our CustomMetric.