Hierarchical Weighted Attention Network

Framework for benchmarking different methods used for documents embeddings with implementation of original HWAN method.

Quick links

Requirements
Getting started
Architecture
Datasets
References

Requirements

Program was tested on Ubuntu 18.04 LTS.

To run the benchmark.py you need Python3.x (recommended version is 3.7) and modules listed in requirements.txt file.

Getting started

Development Environment

This step assumes that You have conda installed.To install all dependencies you have to run the following commands:

# create virtual environment given name "benchmark", You can change it
conda create -n benchmark python=3.7

# activate environment
conda activate benchmark
python3 -m pip install -r requirements.txt

Execute

To run benchmark activate prepared conda environment and execute simillar command:

CUDA_VISIBLE_DEVICES=7 python3 benchmark.py --dataset_path datasets/bbcsport/ --models_path models/ --pretrained_path embeddings/glove.6B.100d.txt --dataset_name bbc --hwan_features_algorithm tf --hwan_features_operation mul

where CUDA_VISIBLE_DEVICES=7 selects one GPU from those available at the machine, hwan_features_algorithm defines algorithm used to compute statistical features (available bow, tf and tfidf) and hwan_features_operation defines operation of latent variables and statistical features in HWAN (available add, mul and concat).

Architecture

Abstract class resides inside benchmark_model.py. It has 2 abstract methods:

@abstractmethod
def preprocess_data(
	self,
	dataset,
	y_dataset
):
...
@abstractmethod
def train(
	self,
	x,
	y=None
):
...

As long as this steps may differ between different methods of document vectorization, they will be implemented in concrete classes.

The common methods are used for handling final KNeighborsClassifier and saving and loading pretrained models.

Pretrained embeddings

To follow proposed evaluation protocol You have to download and use GloVe embeddings. You can find them here: glove.6B.zip

Datasets

Reuters-21578, Ohsumed and 20 Newsgroups used in this benchmark were downloaded from http://disi.unitn.it/moschitti/corpora.htm.

BBC dataset

BBC dataset consists of 2 datasets of news articles from BBC News:

BBC: 2225 articles of 5 classes(business, entertainment, politics, sport, tech)
BBCSport: 737 articles of 5 classes(athletics, cricket, football, rugby, tennis)

Reuters

This is a collection of documents that appeared on Reuters newswire in 1987.

Dataset obtained from here: Reuters(90) Dataset obtained from here: Reuters(115)

Ohsumed

Includes medical abstracts from the MeSH categories of the year 1991. The specific task was to categorize the 23 cardiovascular diseases categories.

Dataset obtained from here: Ohsumed(20,000) Dataset obtained from here: Ohsumed(All)

20 Newsgroup Dataset

This data set is a collection of 20,000 messages, collected from 20 different netnews newsgroups.

Dataset obtained from here: 20 Newsgroup Dataset

References

Hierarchical Attention Network implementation

Hierarchical Attention Network model was obtained from https://github.com/Hsankesara/DeepResearch and modified to suit proposed architecture.

BBC dataset

D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006. [PDF] [BibTeX].

News Category Dataset

https://rishabhmisra.github.io/publications/

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
benchmark.py		benchmark.py
logs.tar		logs.tar
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hierarchical Weighted Attention Network

Quick links

Requirements

Getting started

Development Environment

Execute

Architecture

Pretrained embeddings

Datasets

BBC dataset

Reuters

Ohsumed

20 Newsgroup Dataset

References

Hierarchical Attention Network implementation

BBC dataset

News Category Dataset

About

Releases

Packages

Languages

License

adamwawrzynski/vectorized_documents_benchmark

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Weighted Attention Network

Quick links

Requirements

Getting started

Development Environment

Execute

Architecture

Pretrained embeddings

Datasets

BBC dataset

Reuters

Ohsumed

20 Newsgroup Dataset

References

Hierarchical Attention Network implementation

BBC dataset

News Category Dataset

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages