The source code used for Weakly-Supervised Neural Text Classification, published in CIKM 2018.
Before running, you need to first install the required packages by typing following commands:
$ pip3 install -r requirements.txt
Python 3.6 is strongly recommended; using older python versions might lead to package incompatibility issues.
python main.py --dataset ${dataset} --sup_source ${sup_source} --model ${model}
where you need to specify the dataset in ${dataset}
, the weak supervision type in ${sup_source}
(could be one of ['labels', 'keywords', 'docs']
), and the type of neural model to use in ${model}
(could be one of ['cnn', 'rnn']
).
An example run is provided in test.sh
, which can be executed by
./test.sh
More advanced settings on training and hyperparameters are commented in main.py
.
The weak supervision sources ${sup_source}
can come from any of the following:
- Label surface names (
labels
); you need to provide class names for each class in./${dataset}/classes.txt
, where each line begins with the class id (starting from0
), followed by a colon, and then the class label surface name. - Class-related keywords (
keywords
); you need to provide class-related keywords for each class in./${dataset}/keywords.txt
, where each line begins with the class id (starting from0
), followed by a colon, and then the class-related keywords separated by commas. - Labeled documents (
docs
); you need to provide labeled document ids for each class in./${dataset}/doc_id.txt
, where each line begins with the class id (starting from0
), followed by a colon, and then document ids in the corpus (starting from0
) of the corresponding class separated by commas.
Examples are given under ./agnews/
and ./yelp/
.
The final results (document labels) will be written in ./${dataset}/out.txt
, where each line is the class label id for the corresponding document.
Intermediate results (e.g. trained network weights, self-training logs) will be saved under ./results/${dataset}/${model}/
.
To execute the code on a new dataset, you need to
- Create a directory named
${dataset}
. - Put raw corpus (with or without true labels) under
./${dataset}
. - Modify the function
read_file
inload_data.py
so that it returns a list of documents in variabledata
, and corresponding true labels in variabley
(If ground truth labels are not available, simply returny = None
). - Modify
main.py
to accept the new dataset; you need to add${dataset}
to argparse, and then specify parameter settings (e.g.update_interval
,pretrain_epochs
) for the new dataset.
You can always refer to the example datasets when adapting the code for a new dataset.
Please cite the following paper if you find the code helpful for your research.
@inproceedings{meng2018weakly,
title={Weakly-Supervised Neural Text Classification},
author={Meng, Yu and Shen, Jiaming and Zhang, Chao and Han, Jiawei},
booktitle={Proceedings of the 27th ACM International Conference on Information and Knowledge Management},
pages={983--992},
year={2018},
organization={ACM}
}