Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiclass, word embeddings, configuration file and new datasets #77

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

cahya-wirawan
Copy link

Hi,

I added following functionalities:

  • multiclass classification
  • pre-trained word embedding using word2vec and GloVe
  • configuration file in yaml format
  • new dataset 20newsgroup (loaded using sklearn.datasets)
  • loading multiclass text based dataset from local directory

And also path to the movie rating dataset has been moved to the configuration file. Thanks.

- multiclass functionality
- 20 newsgroup loader
- yaml configuration file
- using pre trained word2vec and GloVe embedded words
…ion enable_word_embeddings

or set word_embeddings.default to an empty string in config.yml
- Added a functionality to load text datasets from local files
- calculate the probability from the score using
  softmax function and save it in prediction.csv file
- fixed an issue with empty datasets["target_names"]
@CMWENLIU
Copy link

CMWENLIU commented Jun 14, 2017

Hi @cahya-wirawan
Thank you so much for the functionality of multiclass classification you did.
I still have issues when loading my own local data, after following I did:

1, saved text files with categories as subfolder names in the folder: /data/bbcdata
and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech"
2, updated the config.yml file as following

line 16: default: localdata
line 52: container_path: "/data/bbcdata"

Did I missing something to run the ./train.py
Could you help me about that?
Thank you so much!

Aven

@cahya-wirawan
Following is the error I get using local data for multi-class data:
Could you help me about this?
Thanks a lot!

Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte

@usmaann
Copy link

usmaann commented Nov 7, 2018

How much is the expected training time ? and how many steps are needed to get good accuracy results/???

@usmaann
Copy link

usmaann commented Nov 13, 2018

Hi @cahya-wirawan
Thank you so much for the functionality of multiclass classification you did.
I still have issues when loading my own local data, after following I did:

1, saved text files with categories as subfolder names in the folder: /data/bbcdata
and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech"
2, updated the config.yml file as following

line 16: default: localdata
line 52: container_path: "/data/bbcdata"

Did I missing something to run the ./train.py
Could you help me about that?
Thank you so much!

Aven

@cahya-wirawan
Following is the error I get using local data for multi-class data:
Could you help me about this?
Thanks a lot!

Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte

Hi are you able to fix this issue? I am facing the same issue

@usmaann
Copy link

usmaann commented Nov 19, 2018

Hi @cahya-wirawan
Thank you so much for the functionality of multiclass classification you did.
I still have issues when loading my own local data, after following I did:
1, saved text files with categories as subfolder names in the folder: /data/bbcdata
and there are 5 folders with corresponding txt files in bbcdata folder: "business","entertainment","politics","sport","tech"
2, updated the config.yml file as following

line 16: default: localdata
line 52: container_path: "/data/bbcdata"

Did I missing something to run the ./train.py
Could you help me about that?
Thank you so much!
Aven
@cahya-wirawan
Following is the error I get using local data for multi-class data:
Could you help me about this?
Thanks a lot!

Loading data...
Traceback (most recent call last):
  File "./train.py", line 72, in <module>
    random_state=cfg["datasets"][dataset_name]["random_state"])
  File "/home/xxliu10/repo/cahya_cnn/cnn-text-classification-tf/data_helpers.py", line 93, in get_datasets_localdata
    random_state=random_state)
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in load_files
    data = [d.decode(encoding, decode_error) for d in data]
  File "/home/xxliu10/anaconda3/lib/python3.6/site-packages/sklearn/datasets/base.py", line 232, in <listcomp>
    data = [d.decode(encoding, decode_error) for d in data]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 257: invalid start byte

Hi are you able to fix this issue? I am facing the same issue

Anybody can give the solution of this problem?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants