You have questions, we have (some) answers!
dataquality >= 0.8.6
The first thing to try in this case it to restart your kernel. Dataquality uses certain python packages that require your kernel to be restarted after installation. In Jupyter you can click “Kernel -> Restart”
dataquality,
there is a known bug when upgrading. Solution: pip uninstall -y vaex-core vaex-hdf5 && pip install --upgrade --force-reinstall dataquality
“And then restart your jupyter/colab kernel
dq.finish()
after the run.
t’s possible that:
dq.wait_for_run()
(you can optionally pass in the project and run name, or the most recent will be used)
This function will wait for your run to finish processing. If it’s completed, check the console again by refreshing.
If that shows an exception, your run failed to be processed. You can see the logs from your model training by running dq.get_dq_log_file()
which will download and return the path to your logfile. That may indicate the issue. Feel free to reach out to us for more help!
spans
JSON column for my NER data can’t be loaded with json.loads
JSONDecodeError: Expecting ',' delimiter: line 1 column 84 (char 83)
It’s likely the case that you have some data in your text
field that is not valid json (extra quotes "
or '
). Unfortunately, we cannot modify the content of your span text, but we can strip out the text
field with some regex. Given a pandas dataframe df
with column spans
(from a Galileo export) you can replace df["spans"] = df.apply(json.loads)
with (make sure to import re
) df["spans"] = df.apply(lambda row: json.loads(re.sub(r","text".}", "}", row)))
Anemone apennina
clearly looks like a wrong tag error (correct span boundaries, incorrect class prediction), but is marked as a span shift.
dq.metrics.get_dataframe
. We can see that there are 2 spans with identical character boundaries, one with a label and one without (which is the prediction span).
dq.metrics
by looking at the raw data logged to Galileo. As we can see, at the token level, the span start and end indices do not align, and in fact overlap (ids 21948 and 21950), which is the reason for the span_shift error
dq.metrics.get_dataframe
dq.metrics.get_dataframe(...).to_pandas_df()
**PermissionError**
HOME
directory. If you are seeing a PermissionsError
it means that your system does not have access to your current HOME
directory. This may happen in an automated CI system like AWS Glue. To overcome this, simply change your HOME
python Environment Variable to somewhere accessible. For example, the current directory you are in
HOME
directory. Because of that, if you run a new python script in this environment again, you will need to set the HOME
variable in each new runtime.
xcodebuild -runFirstLaunch
and also allowing for any clang permission requests that pop up.
nvcc -V
. Do not run nvidia-smi, that does not give you the true CUDA version. To learn more about this installation or to do it manually, see the installation guide.
If you are training on datasets in the millions, and noticing that the Galileo processing is slowing down at the “Dimensionality Reduction” stage, you can optionally run those steps on the GPU/TPU that you are training your model with.
In order to leverage this feature, simply install dataquality
with the [cuda]
extra.
extra-index-url
to the install, because the extra required packages are hosted by Nvidia, and exist on Nvidia’s personal pypi repository, not the standard pypi repository.
After running that installation, dataquality will automatically pick up on the available libraries, and leverage your GPU/TPU to apply the dimensionality reduction.
Please validate that the installation ran correctly by running import cuml
in your environment. This must complete successfully.
To manually install these packages (at your own risk), you can run