Support for multiclass, word embeddings, configuration file and new datasets #77

cahya-wirawan · 2017-03-08T14:53:01Z

Hi,

I added following functionalities:

multiclass classification
pre-trained word embedding using word2vec and GloVe
configuration file in yaml format
new dataset 20newsgroup (loaded using sklearn.datasets)
loading multiclass text based dataset from local directory

And also path to the movie rating dataset has been moved to the configuration file. Thanks.

- multiclass functionality - 20 newsgroup loader - yaml configuration file - using pre trained word2vec and GloVe embedded words

…ion enable_word_embeddings or set word_embeddings.default to an empty string in config.yml

- Added a functionality to load text datasets from local files

- calculate the probability from the score using softmax function and save it in prediction.csv file - fixed an issue with empty datasets["target_names"]

CMWENLIU · 2017-06-14T20:29:18Z

Hi @cahya-wirawan
Thank you so much for the functionality of multiclass classification you did.
I still have issues when loading my own local data, after following I did:

1, saved text files with categories as subfolder names in the folder: /data/bbcdata
and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech"
2, updated the config.yml file as following

line 16: default: localdata
line 52: container_path: "/data/bbcdata"

Did I missing something to run the ./train.py
Could you help me about that?
Thank you so much!

Aven

@cahya-wirawan
Following is the error I get using local data for multi-class data:
Could you help me about this?
Thanks a lot!

Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte

higher accuracy faster. The learning rate decay exponentially from 0.003 to 0.0001.

usmaann · 2018-11-07T06:01:03Z

How much is the expected training time ? and how many steps are needed to get good accuracy results/???

usmaann · 2018-11-13T00:40:05Z

Hi @cahya-wirawan
Thank you so much for the functionality of multiclass classification you did.
I still have issues when loading my own local data, after following I did:

1, saved text files with categories as subfolder names in the folder: /data/bbcdata
and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech"
2, updated the config.yml file as following
line 16: default: localdata
line 52: container_path: "/data/bbcdata"
Did I missing something to run the ./train.py
Could you help me about that?
Thank you so much!

Aven

@cahya-wirawan
Following is the error I get using local data for multi-class data:
Could you help me about this?
Thanks a lot!
Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte

Hi are you able to fix this issue? I am facing the same issue

usmaann · 2018-11-19T02:29:59Z

Hi @cahya-wirawan
Thank you so much for the functionality of multiclass classification you did.
I still have issues when loading my own local data, after following I did:
1, saved text files with categories as subfolder names in the folder: /data/bbcdata
and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech"
2, updated the config.yml file as following
line 16: default: localdata
line 52: container_path: "/data/bbcdata"
Did I missing something to run the ./train.py
Could you help me about that?
Thank you so much!
Aven
@cahya-wirawan
Following is the error I get using local data for multi-class data:
Could you help me about this?
Thanks a lot!
Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte
Hi are you able to fix this issue? I am facing the same issue

Anybody can give the solution of this problem?

cahya-wirawan added 7 commits February 20, 2017 16:48

Following functionalities have been added:

44bbdaf

- multiclass functionality - 20 newsgroup loader - yaml configuration file - using pre trained word2vec and GloVe embedded words

Fixed issues with word2vec loader

073254c

Added the possibility to disable the word embeddings by using the opt…

563c46d

…ion enable_word_embeddings or set word_embeddings.default to an empty string in config.yml

- API changes for TF 1.0

21c04ff

- Added a functionality to load text datasets from local files

Updated info about the parameter of get_datasets_localdata

6b0674c

Merge remote-tracking branch 'upstream/master'

425a21e

Added enable_word_embeddings in README file

8a83ffb

cahya-wirawan force-pushed the master branch 4 times, most recently from 71ef964 to 8a83ffb Compare March 21, 2017 00:56

cahya-wirawan force-pushed the master branch from f74b6e3 to 8a83ffb Compare March 25, 2017 09:00

cahya-wirawan force-pushed the master branch from fac93b1 to 8a83ffb Compare April 3, 2017 09:33

- added softmax function

0f508ee

- calculate the probability from the score using softmax function and save it in prediction.csv file - fixed an issue with empty datasets["target_names"]

cahya-wirawan added 3 commits July 19, 2017 16:59

Added a dynamic learning rate with high value at the beginning to get

4637773

higher accuracy faster. The learning rate decay exponentially from 0.003 to 0.0001.

Fixed a comment

36c0aa6

Added a new option decay_coefficient

68560a4

gyanmittal approved these changes Dec 9, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for multiclass, word embeddings, configuration file and new datasets #77

Support for multiclass, word embeddings, configuration file and new datasets #77

cahya-wirawan commented Mar 8, 2017

CMWENLIU commented Jun 14, 2017 •

edited

Loading

usmaann commented Nov 7, 2018

usmaann commented Nov 13, 2018

usmaann commented Nov 19, 2018

Support for multiclass, word embeddings, configuration file and new datasets #77

Are you sure you want to change the base?

Support for multiclass, word embeddings, configuration file and new datasets #77

Conversation

cahya-wirawan commented Mar 8, 2017

CMWENLIU commented Jun 14, 2017 • edited Loading

usmaann commented Nov 7, 2018

usmaann commented Nov 13, 2018

usmaann commented Nov 19, 2018

CMWENLIU commented Jun 14, 2017 •

edited

Loading