Configuring Dq Auto
Automatic Data Insights on your Seq2Seq dataset
auto
While using auto with default settings is as simple as running dq.auto()
, you can also set granular control over dataset settings, training parameters, and generation configuration. The auto
function takes in optional parameters for dataset_config
, training_config
, and generation_config
. If a configuration parameter is omitted, default values from below will be used.
Example
Parameters
-
Parameters
-
project_name (
Union
[str
,None
]) — Optional project name. If not set, a default name will be used. Default “s2s_auto” -
run_name (
Union
[str
,None
]) — Optional run name. If not set, a random name will be generated -
train_path (
Union
[str
,None
]) — Optional training data to use. Must be a path to a local file of type.csv
,.json
, or.jsonl
. -
dataset_config (
Union
[Seq2SeqDatasetConfig
,None
]) — Optional config for loading the dataset. SeeSeq2SeqDatasetConfig
for more details -
training_config (
Union
[Seq2SeqTrainingConfig
,None
]) — Optional config for training the model. SeeSeq2SeqTrainingConfig
for more details -
generation_config (
Union
[Seq2SeqGenerationConfig
,None
]) — Optional config for post training model generation. SeeSeq2SeqGenerationConfig
for more details -
wait (
bool
) — Whether to wait for Galileo to complete processing your run. Default True
-
Dataset Config
Use the Seq2SeqGenerationConfig()
class to set the dataset for auto training.
Given either a pandas dataframe, local file path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console.
One of hf_data
, train_path
, or train_data
should be provided.
Parameters
-
Parameters
-
hf_data (
Union
[DatasetDict
,str
,None
]) — Use this param if you have huggingface data in the hub or in memory. Otherwise see train_path or train_data, val_path or val_data, and test_path or test_data. If provided, other dataset parameters are ignored. -
train_path (
Union
[str
,None
]) — Optional training data to use. Must be a path to a local file of type.csv
,.json
, or.jsonl
. -
val_path (
Union
[str
,None
]) — Optional validation data to use. Must be a path to a local file of type.csv
,.json
, or.jsonl
. If not provided, but test_path is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. -
test_path (
Union
[str
,None
]) — Optional test data to use. Must be a path to a local file of type.csv
,.json
, or.jsonl
. The test data, if provided with val, will be used after training is complete, as the hold-out set. If no validation data is provided, this will instead be used as the evaluation set. -
train_data (
Union
[DataFrame
,Dataset
,None
]) — Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Huggingface dataset hub path -
val_data (
Union
[DataFrame
,Dataset
,None
]) — Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Huggingface dataset hub path -
test_data (
Union
[DataFrame
,Dataset
,None
]) — Optional test data to use. The test data, if provided with val, will be used after training is complete, as the hold-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Huggingface dataset hub path -
input_col (
str
) — Column name of the model input in the provided dataset. Defaulttext
-
target_col (
str
) — Column name of the model target output in the provided dataset. Defaultlabel
-
Training Config
Use the Seq2SeqTrainingConfig()
class to set the training parameters for auto training.
Parameters
-
Parameters
-
model (
int
) — The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Defaultgoogle/flan-t5-base
-
epochs (
int
) — The number of epochs to train. Defaults to 3. If set to 0, training/fine-tuning will be skipped and auto will only do a forward pass with the data to gather all the necessary info to display it in the console. -
learning_rate (
float
) — Optional learning rate. Defaults to 3e-4 -
batch_size (
int
) — Optional batch size. Default 4 -
accumulation_steps (
int
) — Optional accumulation steps. Default 4 -
max_input_tokens (
int
) — Optional the maximum length in number of tokens for the inputs to the transformer model. If not set, will use tokenizer default or default 512 if tokenizer has no default -
max_target_tokens (
int
) — Optional the maximum length in number of tokens for the target outputs to the transformer model. If not set, will use tokenizer default or default 128 if tokenizer has no default -
create_data_embs (
Optional
[bool
]) — Whether to create data embeddings for this run. If True, Sentence-Transformers will be used to generate data embeddings for this dataset and uploaded with this run. You can access these embeddings via dq.metrics.get_data_embeddings in the emb column or dq.metrics.get_dataframe(…, include_data_embs=True) in the data_emb col. Default True if a GPU is available, else default False.
-
Generation Config
Use the Seq2SeqGenerationConfig()
class to set the training parameters for auto training.
Parameters
-
Parameters
-
max_new_tokens (
int
) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. Default 16 -
temperature (
float
) — The value used to modulate the next token probabilities. Default 0.2 -
do_sample (
float
) — Whether or not to use sampling ; use greedy decoding otherwise. Default False -
top_p (
float
) — If set to float < 1, only the smallest set of most probable tokens with probabilities that add up totop_p
or higher are kept for generation. Default 1.0 -
top_k (
int
) — The number of highest probability vocabulary tokens to keep for top-k-filtering. Default 50 -
generation_splits (
Union[List[str], None]
) — Optional list of splits to perform generation on after training the model. These generated outputs will show up in the console for specified splits. Default [“test”]
-
Examples
An example using auto
with a hosted huggingface summarization dataset
An example of using auto
with a local jsonl file
Where train.jsonl
might be a file with prompt
and completion
columns that looks like:
Get started with a notebook
Was this page helpful?