Datasets serve as inputs to an Evaluate run. Each column in a dataset represents a variable that can be used within a prompt template.

Using Datasets in the Galileo Console

Create a dataset

From the Datasets page, click the “Create Dataset” button.

You can upload a CSV, or enter data directly into the table.

Using a dataset in an evaluation run

When creating a new evaluation run, you can select a dataset to use as input.

Using Datasets in code

Prerequisites

For Python, install the promptquality library.

For TypeScript, install the @rungalileo/galileo package.

Create a dataset

You can create a new dataset by running:

import os
import promptquality as pq

pq.login(os.environ["GALILEO_CONSOLE_URL"])

dataset = pq.create_dataset(
    {
        "virtue": ["benevolence", "trustworthiness"],
        "voice": ["Oprah Winfrey", "Barack Obama"],
    }
)

These functions accepts a few different formats for the dataset.

  1. A dictionary mapping column names to lists of values (as shown above).

  2. A list of dictionaries, where each dictionary represents a row in the dataset, e.g.

    dataset = pq.create_dataset(
         [
              {"virtue": "benevolence", "voice": "Oprah Winfrey"},
              {"virtue": "trustworthiness", "voice": "Barack Obama"},
         ]
    )
    
  3. A path to a file in either CSV, Feather, or JSONL format, e.g.

    dataset = pq.create_dataset("path/to/dataset.csv")
    

Using a dataset in an evaluation run

To use the dataset in an evaluation run, provide the dataset ID to the run function (Python only).

template = "Explain {virtue} to me in the voice of {voice}"

pq.run(
    project_name="test_dataset_project",
    template=template,
    dataset=dataset.id,
    settings=pq.Settings(
        model_alias="ChatGPT (16K context)", temperature=0.8, max_tokens=400
    ),
)

Note that the TypeScript client does not currently support creating runs. However, you can use the dataset for logging workflows.

Getting the contents of a dataset

You can list the dataset’s contents like so:

rows = pq.get_dataset_content(dataset.id)
for row in rows:
    print(row)