Building lexicon when corpus does not fit into memory #266

davidbp · 2023-03-23T11:49:24Z

I looked in the documentation and I could not find any tooling to build a lexicon when the Corpus can't fit on memory.

Let's say I want to build tf-idf vectors for a given lexicon of 10 million ngrams, but I can't fit in memory all the text files that I need to know there are 10 million ngrams in the corpus.

What I would like to do is to build incrementally the lexicon with batches of documents that I load (but note that I don't want to keep all the text of the documents, just tokenize them to learn the lexicon from the data).

for batch_of_documents in folder:
    update!(lexicon, batch_of_documents, tokenizer)

and then

m = DocumentTermMatrix(["some text here", "here more text"];  lexicon, tokenizer )

Is there a way to do this?

The text was updated successfully, but these errors were encountered:

rssdev10 · 2023-10-24T01:39:48Z

Check this way - merge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building lexicon when corpus does not fit into memory #266

Building lexicon when corpus does not fit into memory #266

davidbp commented Mar 23, 2023 •

edited

Loading

rssdev10 commented Oct 24, 2023

Building lexicon when corpus does not fit into memory #266

Building lexicon when corpus does not fit into memory #266

Comments

davidbp commented Mar 23, 2023 • edited Loading

rssdev10 commented Oct 24, 2023

davidbp commented Mar 23, 2023 •

edited

Loading