The fastest way to find data errors in Galileo
dq.auto
workflow:
After installing dataquality: pip install dataquality
You simply add your data and wait for the model to train under the hood, and for Galileo to process the data. This processing can take between 5-15 minutes, depending on how much data you have.
auto
will wait until Galileo is completely done processing your data. At that point, you can go to the Galileo Console and begin inspecting.
auto
train_data
, val_data
and test_data
(pandas or huggingface)
train_data
, val_data
and test_data
hf_data
parameter
dq.auto
supports both Text Classification and Named Entity Recognition tasks, with Multi-Label support coming soon. dq.auto
automatically determines the task type based off of the provided data schema.
To see the other available parameters as well as more usage examples, see help(dq.auto)
To learn more about how dq.auto
works, and why we suggest this paradigm, see DQ Auto
auto
if:
auto
, your newest superpower in the world of Machine Learning!
We know now that more data isn’t the answer, better data is. But how do you find that data? We already know the answer to that: Galileo
But how do you get started now, and iterate quickly with data-centric techniques?
Enter: dq.auto
the secret sauce to instant data insights. We handle the training, you focus on the data.
dq.auto
is a helper function to train the most cutting-edge transformer (or any of your choosing from HuggingFace) on your dataset so it can be processed by Galileo. You provide the data, let Galileo train the model, and you’re off to the races.
The goal of this tool, and Galileo at large, is to build a data-centric view of machine learning. Keep your model static and iterate on the dataset until it’s well-formed and well-representative of your problem space. This is the path to robust and stable ML models.
auto
is not an AutoML tool. It will not perform hyperparameter tuning, and will not search through a gallery of models to optimize every percentage of f1.
In fact, auto
is quite the opposite. It intentionally keeps the model static, forcing you to understand and fix your data to improve performance.
pip install --upgrade dataquality
and use!
dq.auto
works for:
text
and label
). Trec6 Example.
tokens
and tags
or ner_tags
). MIT_movies Example.
auto
will automatically figure out your task and start the process for you.
For more docs and examples, see help(dq.auto)
in your notebook! Happy data fixing