Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building lexicon when corpus does not fit into memory #266

Open
davidbp opened this issue Mar 23, 2023 · 1 comment
Open

Building lexicon when corpus does not fit into memory #266

davidbp opened this issue Mar 23, 2023 · 1 comment

Comments

@davidbp
Copy link

davidbp commented Mar 23, 2023

I looked in the documentation and I could not find any tooling to build a lexicon when the Corpus can't fit on memory.

Let's say I want to build tf-idf vectors for a given lexicon of 10 million ngrams, but I can't fit in memory all the text files that I need to know there are 10 million ngrams in the corpus.

What I would like to do is to build incrementally the lexicon with batches of documents that I load (but note that I don't want to keep all the text of the documents, just tokenize them to learn the lexicon from the data).

for batch_of_documents in folder:
    update!(lexicon, batch_of_documents, tokenizer)

and then

m = DocumentTermMatrix(["some text here", "here more text"];  lexicon, tokenizer )

Is there a way to do this?

@rssdev10
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants