Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor Results on Large Corpus #210

Open
KarahanS opened this issue Mar 31, 2023 · 9 comments
Open

Poor Results on Large Corpus #210

KarahanS opened this issue Mar 31, 2023 · 9 comments

Comments

@KarahanS
Copy link

KarahanS commented Mar 31, 2023

Greetings,

I'm trying to train my own GloVe word embeddings for Turkish language using a corpus of size ~10 GB. I have enough disk capacity on my computer and 16 GB memory. I created the vocab.txt successfully, I can confirm that there is no problem with it. Now, I believe I successfully generated a cooccurrence matrix as well which is of size ~35 GB but afterwards shuffling took too long and suddenly terminated. In contrast to the cooccurrence generation step, shuffling seems non-responsive, it's not really printing anything to the console. Then I decided to train my model on an unshuffled cooccurrence matrix directly.
I trained it for 20 iterations. Cost for each iteration was something like this (numbers are not precise but my point is that the cost increased for first 3 iterations and then gradually decreased to ~0.11):

itr=1   cost = ~2.5
itr=2   cost = ~10.5
itr=3   cost = ~14.5
itr=4   cost = ~12.5
itr=5   cost = ~10.5
           ...
itr=19 cost = ~0.14
itr=20 cost  = ~0.11

Then I loaded the word vectors using load_word2vec_format function provided by gensim. Tested the vectors with several analogy tasks and unfortunately, the results are terrible. So, here is my questions:

  1. How vital is shuffling? Can such terrible results be explained by the fact that I skipped the shuffling part?
  2. Or, isn't ~0.11 cost enough to produce some reasonable results? Should I have iterated longer?
  3. When I run the shuffling operation, I get an output like this:
Using random seed 1680251209
SHUFFLING COOCCURRENCES
array size: 1020054732
Shuffling by chunks: processed 0 lines.

I tried to print out some local variables and saw that they are increasing. So, the program is actually running but it feels like it will run to forever (if not terminates due to some error). Is it really supposed to take that long (even longer than cooccurrence matrix generation)? I'm suspicious that my memory is not enough. If it's the case, is there any solution rather than simply switching to another hardware/remote server etc.? (Also, it would be really weird that my memory is enough for matrix generation but not for shuffling o.O')

Note: I'm training on Windows using Ubuntu WSL. FYI

@KarahanS
Copy link
Author

Update: I waited for a while for shuffling to terminate and it terminated with the following error:

$ build/shuffle -memory 16.0 -verbose 2 < out/cooccur.bin > out/cooccurrence.shuf.bin
Using random seed 1680251209
SHUFFLING COOCCURRENCES
array size: 1020054732
Shuffling by chunks: processed 1020054732 lines../demo.sh: line 45:   355 Killed                  $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE

@AngledLuffa
Copy link
Contributor

Killed like that is almost definitely a memory issue.

I've never tried glove without shuffling, so I can't advise on whether or not this kind of curve is how it normally goes with unshuffled text. You could always try a smaller version of your dataset and compared shuffled vs non-shuffled if you're curious.

However, the best place to start is probably with running shuffle, where you should be able to set the array_size or memory_limit smaller. Although the default expectation is 2G, so it's a little surprising that it's not working when your system has 16G. Perhaps there is something in the way you are running it that is giving it significantly less memory.

@KarahanS
Copy link
Author

KarahanS commented Apr 4, 2023

I have solved the memory error by decreasing the value of the memory parameter in the script. Now, I have trained my model with the following parameters:

VOCAB_MIN_COUNT=10
VECTOR_SIZE=300
MAX_ITER=100
WINDOW_SIZE=5
X_MAX=100

I expect my model to give better results in syntactic/semantic analysis tasks compared to Word2Vec with (5 epochs + 300 embeddings). But unfortunately, results of GloVe are worse than Word2Vec results. Is there something wrong with my parameters? My corpus is ~10.5 GB. Overall, I have 1,384,961,747 tokens and 1,573,013 unique
words (excluding words occurring less than the minimum frequency).

Some of the possible problems that come to my mind:

  • Is there a problem with the corpus?: Well, I compare the resulting vocab.txt file from GloVe with the one I had from Word2Vec. They are almost identical. There doesn't seem to be any problem extracting the vocabulary - therefore I guess there shouldn't be any technical problem with the corpus. If there was a problem with corpus, we would understand it from vocab.txt, right?
  • Hardware related issues?: I trained models on both my local machine (i7 11390H) and on a remote machine (Intel® Xeon® Gold 6342 Processor) - results are similar.
  • Overfitting?: Well, I trained GloVe with 20 iterations as well - yet again I get awful results. (That's why I switched to 100 iterations. It is also the suggested number in the paper for 300 dimensions.)

I'm stuck at this point and can't really see why GloVe word vectors are performing extremely poorly - open to suggestions to iterate new ideas/play with parameters etc @AngledLuffa.

Note: Sorry for changing the title. My previous problem with shuffling is solved, thank you for that.

@KarahanS KarahanS changed the title Problems with Shuffling Poor Results on Large Corpus Apr 4, 2023
@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 4, 2023 via email

@KarahanS
Copy link
Author

KarahanS commented Apr 4, 2023

Let me share some glimpses from the content:
Here is the output of head -1 corpus.txt:

lovecraft'ın türkçe'deki ilk kitabı

Here is the output of head -5 corpus.txt:

lovecraft'ın türkçe'deki ilk kitabı
yazarın ikinci kitabı
lovecraft türkçe'de
cthulhu'nun çağrısı ve ardından deliliğin dağlarında adlı eserleri türkçe'ye çevrilen howard phillips lovecraft korku ve gerilim ustası bir yazar
beş mayıs howard phillips lovecraft'ın yaşamı boyunca yazdığı elli bir öyküden sekizini bir araya getiren cthulhu'nun çağrısı gotik edebiyatın klasik örneklerinden biri sayılıyor

Each example is separated by \n. Examples do not have to be single sentences, they can be a collection of couple of sentences as well. For example, there is an example like this as well:

Beşiktaş Teknik Direktörü Bernd Schuster , kulübeye çektiği İbrahim Üzülmez dışında son haftalardaki tertibiyle sahadaydı . 4'lü defansın önünde Mehmet Aurelio ile Ernst , onlarında önünde Guti , üçlü hücumcu olarak da sağda Tabata , solda Holosko ve ortada Nobre görev yaptı . Oyun anlayışında bir değişiklik düşünülmediğinden alışılagelmiş şablon içerisinde bir futbol vardı . Defans bloku kalenin uzağında kademeleniyor , kazanılan toplar Ernst ve Guti tarafından forvet elemanlarına servis ediliyordu . Dün gece gene Guti'nin ne kadar önemli bir oyuncu olduğu izlendi . Ayağından çıkan topların çoğunluğu arkadaşlarını pozisyona sokuyordu . 79'da Nobre'nin kafasına adeta topu kondurması ustalığının getirisiydi . Sarı-Kırmızılı takım topa daha çok sahip olmasına rağmen ataklarda çoğalamamanın sıkıntısını yaşadı . 2-3 önemli pozisyondan da istifade etmesini bilemediler .

Technically, this is one example composed of several sentences. We used the same corpus for Word2Vec as well - so such examples shouldn't be a problem (unless there is a more special technical issue).
As you can see, all tokens are separated by spaces.

If you can and want to spend more time and effort on it, here is the link to the corpus we are using: https://drive.google.com/file/d/1BhHG8-btnTcfndU5fvsvTG3mD9WGf6L0/view?usp=sharing

Additionally, here is our loss curve:
image

@AngledLuffa

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 4, 2023 via email

@KarahanS
Copy link
Author

KarahanS commented Apr 5, 2023

I'd be very grateful for any assistance you could provide. If you have time to train the model as well, please use window size = 5 unless there is an important reason not to do so. That's what we used in word2vec, so in order to be able to compare the models, it's better to stick to the previous window size.
Let me provide an example. This example is the direct correspondence of classical "man" - "woman" - "king" example in English. Below, you can find how to load GloVe vectors using gensim and test the analogy task:

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format("path/to/glove/vectors.txt", no_header = True, binary=False)
print(word_vectors.most_similar_cosmul(positive=['kadın', 'kral'], negative=['adam']))

The output is like this:

[('erkek', 0.8252280950546265), 
('kraliçe', 0.8103123307228088), 
('bebek', 0.8019083142280579), 
('kralın', 0.8017817139625549),
 ('aile', 0.7960394024848938),
 ('çocuk', 0.7889254689216614), 
('afgan', 0.7882615923881531), 
('annesi', 0.7867284417152405), 
('kadınların', 0.7853242754936218), 
('arap', 0.7841709852218628)]

kraliçe means queen in Turkish and that's the word we would expect to have as the first recommendation. Word2Vec, with approximately %90 probability ratio, gives kraliçe as the correct word.

So if we somehow manage to make our model return kraliçe as the first word, it means there is progress. You can ask what is erkek? It can be translated as male. So interestingly, when we subtract man from king, and add woman, our current GloVe model suggests that it is similar to male o.O'.

@KarahanS
Copy link
Author

KarahanS commented Apr 6, 2023

I came across with some sources that suggest that Word2Vec performs better than GloVe in Turkish.

  • For example, here, in the "About Glove" section, it is stated that "In the article published by Stanford University, GloVe is showed to be better than Word2Vec. But in our study for Turkish, Word2Vec gave better results".
  • Here in this paper, it is stated in the conclusion part that Word2Vec performs better than GloVe in analogy tasks.

So, I happen to think like there is no technical issue with our results - it's a fact that GloVe doesn't perform as good as Word2Vec for agglutinative languages like Turkish. If it's really the case, what would you say the main reason for that @AngledLuffa ? (Having some problems with our implementation is still an option but doesn't seem likely)

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 11, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants