Eval engineering for AI developers is a 5-part course run as a series of live streams, hosted by Jim Bennett, Principal Developer Advocate at Galileo. 90% of AI agents don’t make it successfully to production. The biggest reason is the AI engineers building these apps don’t have a clear way of evaluating that these agents are doing what they should do, and using the results of this evaluation to fix them. In this course, you will learn all about evals for AI applications. You’ll start with some out-of-the-box metrics and learn about evals, then move onto understanding observability for AI apps, analyzing failure states, defining custom metrics, then finally using these across your whole SDLC. This is hands on, so be prepared to write some code, create some metrics, and do some homework!Documentation Index
Fetch the complete documentation index at: https://docs.galileo.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
- A basic knowledge of Python, and Python 3.10 or higher installed
- An OpenAI API key. Other LLMs are supported, but the code samples use the OpenAI SDK.
- A Galileo account. The free account is fine for these lessons.
- A clone or download of the Eval engineering GitHub repo
Lessons
Lesson 1 - Hello Evals
In this first lesson, you will- Learn what evals are
- Learn how you can use simple evals to detect issues in an AI application
- Get hands on adding an eval to an app
Lesson 2 - Observability in AI apps
In this second lesson, you will- Use observability to visualize the components of a typical multi-agent AI application
- Learn about the different components that make up these applications
- Apply some out-of-the-box metrics to start to get an understanding of how your application is working
Lesson 3 - Failure analysis
In this third lesson, you will- Learn the process for finding failures in your AI applications
- Build out rubrics for identifying failure cases
- Learn how to group failure cases to themes that can be used for building evals
Lesson 4 - Build custom metrics
In this fourth lesson, you will- Build datasets of known inputs and outputs for cases that pass and fail
- Learn how to build custom metrics for your failure cases
- Determine the success of your metrics by measuring true and false positives and negatives
Lesson 5 - Eval engineering in your SDLC
In this final lesson, you will- Learn how evals fit into the SDLC
- Build unit tests using evals that can be run in your CI/CD pipeline
- Learn about using evals as guardrails at runtime
- Add observability and alerts to detect when your application is failing