Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to optimize memory (RAM) usage #6

Open
ncjhughes opened this issue Mar 1, 2022 · 1 comment
Open

How to optimize memory (RAM) usage #6

ncjhughes opened this issue Mar 1, 2022 · 1 comment

Comments

@ncjhughes
Copy link

I'm running a sentiment type training environment with about 400,000 samples. The text is short, at most 255 characters each, but most are much less.

I'm running 32G on Linux running without a GUI and few other processes running. PHP is set to memory_limit -1.

Linux routinely kills my PHP process due to lack of memory. Following this guide I've set my overcommit_memory value to 2 which allows the script to run a little longer but it eventually runs out of memory and stops.

I've ordered 32G of more RAM hoping this will help with my problem, but, are there any other techniques I could use to reduce memory usage? Or, is there a way to see/calculate how much memory is needed before even attempting to run the script?

@andrewdalpino
Copy link
Member

andrewdalpino commented Mar 16, 2022

Yes, there are. The first thing you can do is to use the TokenHashingVecotrizer instead of WordCountVectorizer. TokenHashingVecotrizer is a low-memory implementation of a bag-of-words vectorizer/embedder that uses a hashing function rather than a lookup table to "store" the vocabulary. The result is that we do not need to store the vocabulary in memory but at the cost of a non-zero probability of word collisions.

https://docs.rubixml.com/1.0/transformers/token-hashing-vectorizer.html

The next thing you can do is to make sure that samples are making use of all possible memory that PHP allocates to them i.e. no wasted memory. You can ensure this by fixing sample arrays to a length of some power of 2 greater than 2 (ex. 8, 16, 32, 64, 128, ... 1024, etc) since this is how PHP allocates memory for the underlying array. Set the max vocabulary size on WordCountVectorizer or TokenHashingVectorizer to the lowest one of these numbers that still offers the vocabulary size you need.

The third thing you can do is turn off snapshotting (which holds a copy of the last best network parameters in memory) by setting the size of the internal validation set (the "holdout set") to 0. The downside is that the network may slightly overfit the training data without a validation signal to guide early stopping.

The fourth thing you can do is to do Online instead of Batch learning. The downside here is that you probably won't be able to use WordCountVectorizer since it builds its vocabulary before training instead of during training. As such, either the full dataset would need to be loaded in memory to build the vocabulary defeating the purpose of Online learning or you'd have to load a subset of it and settle for a vocabulary that is suboptimal. Having that said TokenHashingVectorizer works fine with Online learning.

https://docs.rubixml.com/1.0/training.html#batch-vs-online-learning

To save memory with Online learning you can only load subsets of your dataset into memory at a time and do partial training from there.

As far as calculating the required memory before training - yes there is a way to do this, you'd just need to know the number of parameters in the neural net and then multiply that by 64 bits or 32 bits depending on your platform. That will give you a good estimate of the amount of memory needed to store the model. For the dataset, you could use a similar method by multiplying the number of features by the number of samples and then multiplying that by the average size of the data types used. For continuous variables, it'll be either 64 or 32 bits each, and for categorical variables, it'll be 8 bits multiplied by the length of the string (usually). The model size plus your dataset plus a buffer (say 5%) is a good estimate of the memory required for training.

@andrewdalpino andrewdalpino changed the title Running Out of Memory How to optimize memory usage Mar 21, 2022
@andrewdalpino andrewdalpino changed the title How to optimize memory usage How to optimize memory (RAM) usage Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants