You can also view your samples in the embeddings space of the model. This can help you get a semantic understanding of your dataset. Using features like Color-By DEP, you might discover pockets of problematic data (e.g. decision boundaries that might benefit from more samples or a cluster of garbage samples).
Your left pane is called the Insights Menu. On the top, you can see your dataset size and choose the metric you want to guide your exploration by (F1 by default). Size and metric value update as you add filters to your dataset.
Your main source of insights will be Alerts, Metrics, and Clusters. Alerts are a distilled list of different issues we’ve identified in your dataset. Under Metrics, you’ll find different charts to help you debug your data.
Clicking on an Alert will filter the dataset to the subset of data that corresponds to the Alert.
These charts are dynamic and update as you add different filters. They are also interactive - clicking on a class or group of classes will filter the dataset accordingly, allowing you to inspect and fix the samples.
The third tab is for your Clusters. We automatically cluster your dataset taking into account frequent words and semantic distance. For each Cluster, we show you its average DEP score and the size of the cluster - factors you can use to determine which clusters are worth looking into.
We also show you the common words in the cluster, and, if you enable your OpenAI integration, we leverage GPT to generate summaries of your clusters (more details here).
Analyzing the various Clusters side-by-side with the embeddings view is often a hepful way to discover interesting pockets of data.
