AdaSeq uses configuration file to control model assembling, training and evaluation. The configuration file supports yaml
json
jsonline
format.
Let's take resume.yaml as an example. A configuration file usually consists of the following fields:
experiment: ...
task: ...
dataset: ...
preprocessor: ...
data_collator: ...
model: ...
train: ...
evaluation: ...
Notice: Default = /
means this parameter is compulsory.
Parameter | Description | Type | Default |
---|---|---|---|
exp_dir | experiment directory | str | experiments |
exp_name | experiment name. all outputs will be saved to ./${exp_dir}/${exp_name}/${datetime}/ |
str | unknown |
seed | random seed | int | 42 |
task
supports the following values (see metainfo):
- word-segmentation
- part-of-speech
- named-entity-recognition
- relation-extraction
- entity-typing
Please refer to Customizing Dataset as the combination of dataset parameters is complex.
Parameter | Description | Type | Default |
---|---|---|---|
task | task of the dataset | str | None |
name | modelscope dataset name, for example damo/resume_ner |
str | None |
path | huggingface dataset name, for example conll2003 |
str | None |
data_file | data files, it can be an url, local directory or archive, it can be a dict containing train valid test as well |
str/dict | None |
data_type | used to specify data loading method | str | None |
transform | dataset post processing, usually containing name key scheme |
dict | None |
labels | label set, it can be a list labels: ['O', 'B-ORG', ...] , file or urllabels: PATH_OR_URL , or a function counting labels from dataset |
str/list/dict | None |
access_token | used to access private repos from modelscope or huggingface | str | None |
Parameter | Description | Type | Default |
---|---|---|---|
type | preprocessor type | str | / |
model_dir | tokenizer name or directory | str | / |
is_word2vec | whether to use Lookup Table | bool | False |
tokenizer_kwargs | other parameters for tokenizer | dict | None |
max_length | maximum sentence length (subtoken-level) | int | 512 |
data_collator
supports the following values (see metainfo):
- DataCollatorWithPadding
- SequenceLabelingDataCollatorWithPadding
- SpanExtractionDataCollatorWithPadding
- MultiLabelSpanTypingDataCollatorWithPadding
- MultiLabelConcatTypingDataCollatorWithPadding
Parameter | Child-Parameter | Description | Type | Default |
---|---|---|---|---|
type | model type | str | / | |
embedder | used to embed input ids to vectors, usually a pretrained model | dict | None | |
└ | type | embedder type, optional when using modelscope or huggingface model | str | None |
└ | model_name_or_path | pretrained model name or path, supporting both modelscope or huggingface models | str | / |
encoder | encode the sentence vector, such as LSTM |
dict | None | |
└ | type | encoder type | str | / |
decoder | not available, under construction | dict | None |
Parameter | Child-Parameter | Description | Type | Default |
---|---|---|---|---|
trainer | trainer type | str | None | |
max_epochs | maximum number of epochs in training | int | / | |
dataloader | used to load data | dict | / | |
└ | batch_size_per_gpu | batch size per gpu | int | / |
└ | workers_per_gpu | data loading workers per gpu | int | 0 |
optimizer | optimizer | dict | None | |
└ | type | optimizer type | str | / |
└ | lr | learning rate for all parameters except specific param_groups | float | / |
└ | options | options used in optimizer, for example grad_clip: max_norm: 2.0 |
dict | None |
└ | param_groups | param_groups can have different learning rates | list | None |
└ | └ regex | regex expression to specify parameter group | str | / |
└ | └ lr | learning rate for specific parameter group | float | / |
lr_scheduler | used to adjust learning rate uding training | dict | None | |
└ | type | supporting all lr_scheduler from pytorch (check if your pytorch version includes them) | str | / |
└ | options | options used in the lr_scheduler | dict | None |
hooks | also callbacks see ModelScope documentation | list | None |
Parameter | Child-Parameter | Description | Type | Default |
---|---|---|---|---|
dataloader | used to load data | dict | / | |
└ | batch_size_per_gpu | batch size per gpu | int | / |
└ | workers_per_gpu | data loading workers per gpu | int | 0 |
metrics | evaluation metrics | list | None | |
└ | type | metric type | str | / |