This part of tutorial shows how to prepare a custom dataset.
If you want to load a custom dataset, you should figure out 2 problems:
- Where is the dataset?
- Is the data format already supported?
In the following sections, we will introduce various dataset loading alternatives and different data formats we support (more and more on the way).
Currently, 5 types of dataset loading methods are supported. You can set it in the configuration file.
dataset:
name: ${modelscope_dataset_name}
access_token: ${access_token}
name
should be one of the uploaded datasets in ModelScope, such as damo/resume_ner.access_token
is NOT necessary unless the dataset is private.
dataset:
path: ${huggingface_dataset_name}
path
should be one of the uploaded datasets in Huggingface, such as conll2003.
dataset:
path: ${path_to_py_script_or_folder}
path
should be the absolute path to a custom python script for thedatasets.load_dataset
or a directory containing the script.
dataset:
data_file:
train: ${path_to_train_file}
valid: ${path_to_validation_file}
test: ${path_to_test_file}
data_type: ${data_format}
train
valid
test
could be the urls or local paths (absolute paths) to the dataset files.data_type
should be one of the supported data formats such asconll
.
dataset:
data_file: ${path_or_url_to_dir_or_archive}
data_type: ${data_format}
data_file
could be an url like"https://data.deepai.org/conll2003.zip"
, or a local directory (absolute path) like"/home/data/conll2003"
, or a local archive file (absolute path) like"/home/data/conll2003.zip"
. Alsodata_type
should be one of the supported data formats such asconll
.
For example, NER, CWS, POS Tagging, etc.
The widely-used CoNLL format is a specific vertical format (like TSV) that represents a tagged dataset. Normally it is a
text file with one word per line with sentences separated by an empty line. The first column in a line should be a
word and the last column should be the word's tag (usually from BIO
or BIOES
).
Data Example:
鲁 B-ORG
迅 I-ORG
文 I-ORG
学 I-ORG
院 I-ORG
组 O
织 O
有 O
关 O
专 O
家 O
我 O
是 O
另 O
一 O
句 O
话 O
To use CoNLL format, set
data_type: conll
. Optionally, you can usedelimiter: ${custom_delimiter}
to set a custom delimiter for the conll file. By default, the delimiter is whitespace or tab.
The json-tags format is similar to CoNLL format, where each sentence contains a 'text' field and a 'labels' field. The length of 'text' and 'labels' should be exactly equal to each other, so we can assign all labels to its corresponding character.
{
"text": "鲁迅文学院组织有关专家",
"labels": ["B-ORG", "I-ORG", "I-ORG", "I-ORG", "I-ORG", "O", "O", "O", "O", "O", "O"]
}
To use CoNLL format, set
data_type: json_tags
.
The json-spans format is another widely used format for both flat NER and nested NER. Each meaningful span is
represented as a dict with start
end
type
field, indicating the [start, end) offsets, and the type of the span.
{
"text": "鲁迅文学院组织有关专家",
"spans": [{"start": 0, "end": 5, "type": "ORG"}, ...]
}
What's more, we allow type
to be a list of labels, which means multi-label tagging is possible.
{
"text": "人民日报出版社新近出版了王梦奎的短文集《翠微居杂笔》。",
"spans": [{"start": 0, "end": 7, "type": ["组织", "出版商", "出版社"]}, ...]
}
To use CoNLL format, set
data_type: json_spans
.
The CLUENER format is the official format used in the CLUENER benchmark, which gathers entities of the same type in a group.
{
"text": "鲁迅文学院组织有关专家",
"label": {'ORG': [[0, 5], ...]}
}
To use CLUENER format, set
data_type: cluener
.