Eval Engineering for AI Developers

Eval engineering for AI developers is a 5-part course run as a series of live streams, hosted by Jim Bennett, Principal Developer Advocate at Galileo. 90% of AI agents don’t make it successfully to production. The biggest reason is the AI engineers building these apps don’t have a clear way of evaluating that these agents are doing what they should do, and using the results of this evaluation to fix them. In this course, you will learn all about evals for AI applications. You’ll start with some out-of-the-box metrics and learn about evals, then move onto understanding observability for AI apps, analyzing failure states, defining custom metrics, then finally using these across your whole SDLC. This is hands on, so be prepared to write some code, create some metrics, and do some homework!

Prerequisites

A basic knowledge of Python, and Python 3.10 or higher installed
An OpenAI API key. Other LLMs are supported, but the code samples use the OpenAI SDK.
A Galileo account. The free account is fine for these lessons.
A clone or download of the Eval engineering GitHub repo

Lessons

Lesson 1 - Hello Evals

In this first lesson, you will

Learn what evals are
Learn how you can use simple evals to detect issues in an AI application
Get hands on adding an eval to an app

Lesson 2 - Observability in AI apps

In this second lesson, you will

Use observability to visualize the components of a typical multi-agent AI application
Learn about the different components that make up these applications
Apply some out-of-the-box metrics to start to get an understanding of how your application is working

Lesson 3 - Failure analysis

In this third lesson, you will

Learn the process for finding failures in your AI applications
Build out rubrics for identifying failure cases
Learn how to group failure cases to themes that can be used for building evals

Lesson 4 - Build custom metrics

In this fourth lesson, you will

Build datasets of known inputs and outputs for cases that pass and fail
Learn how to build custom metrics for your failure cases
Determine the success of your metrics by measuring true and false positives and negatives

Lesson 5 - Eval engineering in your SDLC

In this final lesson, you will

Learn how evals fit into the SDLC
Build unit tests using evals that can be run in your CI/CD pipeline
Learn about using evals as guardrails at runtime
Add observability and alerts to detect when your application is failing

Course materials

All the course materials are available on the Galileo GitHub.

Documentation Index

​Prerequisites

​Lessons

​Lesson 1 - Hello Evals

​Lesson 2 - Observability in AI apps

​Lesson 3 - Failure analysis

​Lesson 4 - Build custom metrics

​Lesson 5 - Eval engineering in your SDLC

​Course materials

eval-engineering repo on GitHub