Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible issue with samples for training #3

Open
fjben opened this issue Aug 24, 2022 · 1 comment
Open

Possible issue with samples for training #3

fjben opened this issue Aug 24, 2022 · 1 comment

Comments

@fjben
Copy link

fjben commented Aug 24, 2022

Hello @TobiWeller,

First of all, thank you for sharing this implementation!
I'm observing some unexpected behaviour, possibly a bug, if you could check. Any help would be appreciated. Thank you!

Problem description
I ran the code in main with no problems but it seems that in the background train() is repeateadly using the same walk from the begging to the end of the train phase. More concretely if I have 48475 extracted walks, in one epoch/iteration the train runs 48475 times as expected but always using the first walk for the first entity present in the walks list of lists.

I observed the behaviour when checking the sample_batched in line 161 of Trainer.py. Every sample is some variation of the first walk as previously mentioned. Further checking, it seems that in data_reader.py, the Word2VecDataset nested for loops are using only the first line and the first words of that first line in the data.walks.

Steps to reproduce with minimal code snippet

Haven't changed anything from the original code except batch_size and iterations, and some print/log debbuging commands not shown here.

`walks_obj = Word2VecWalks('./data/mutag/train.tsv', './data/mutag/test.tsv', 'label_mutagenic')

walks = walks_obj.get_walks('./data/mutag/mutag.owl', {'http://dl-learner.org/carcinogenesis#isMutagenic'}, [['http://dl-learner.org/carcinogenesis#hasBond', 'http://dl-learner.org/carcinogenesis#inBond'], ['http://dl-learner.org/carcinogenesis#hasAtom', 'http://dl-learner.org/carcinogenesis#charge']])

w2v = Word2VecTrainer_Skipgram(walks=walks, batch_size=1, iterations=1, min_count=0)

w2v.train()
`

Environment
Operating system: Windows 10
Python version: 3.10.2
Torch version: 1.11.0

P.S. If it would be of any help I can send you the debbuging output that led to this.

@fjben
Copy link
Author

fjben commented Sep 5, 2022

Hello @TobiWeller,

The issue really does seem to be in the Word2vecDataset class. If you confirm the problem, I have a possible solution that seems to be working for me. Let me know if that may be of use to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant