Discover Galileo’s data drift detection methods to monitor AI model performance, identify data changes, and maintain model reliability in production.
Data Drift in Galileo: Detecting data samples that would appear to come from a different distribution than the training data distribution
K Core-Distance(x) = cosine distance to x’s kth nearest neighbor
Map the training embedding distribution —> K Core-Distance distribution
e.g. Threshold at 95% precision - The K Core-Distance representing the 95th percentile of the reference distribution
Given a query sample, how does it compare to the reference distribution?Moreover, by picking a threshold based on a distribution percentile, we remove any dependance on the range of K Core-Distances for a given dataset - i.e. a dataset agnostic mechanism. Drift / Out of Coverage Scores: Building on this distributional perspective, we can compute a per-sample score indicating how out of distribution a data sample is.
Drift / Out of Coverage Score - The percentile a sample falls in with respect to the reference K Core-Distance distribution.Unlike analyzing K Core-Distances directly, our drift / out of coverage score is fully dataset agnostic. For example, consider the example from above. With a K Core-Distance of 0.33 and a threshold of 0.21, we considered the q as drifted. However, in general 0.33 has very little meaning without the context. In comparison, a drift_score of 0.99 captures the necessary distributional context - indicating that q falls within the 99th percentile of the reference distribution and is very likely to be out of distribution.