Question: Is it possible to save a Corporus to disk and then append or delete docs from it? #115

gryBox · 2017-06-24T00:20:53Z

Hi - I am hoping to use textacy.corpus.Corpus as an on disk data container. What I didn't find, in the documentation was a way to write a Doc to a corpus on disk. Is this functionality something I overlooked or did not understand?

The text was updated successfully, but these errors were encountered:

bdewilde · 2017-06-24T15:37:12Z

Hey @gryBox , unfortunately there's no functionality that does exactly what you have in mind... but it's an interesting idea. I've been meaning to work on the Corpus class for a while now, possibly differentiating into one that has access to all its documents (either on disk or in memory) and one that is fully streaming à la gensim.

That said, I'm waiting (have been waiting for a while) for spacy v2.0's release, which significantly changes how its data gets serialized to disk. I'm probably going to punt on this issue until then — but promise to revisit when 2.0 is out. If you could be more specific and detailed about the sort of functionality you want, that would be a great help to me!

gryBox · 2017-06-24T17:02:51Z

Hi @bdewilde
Here is some pseudo user case code (to get the spec going). In general I am looking at HDF5 as an inspiration. Why not just use hdf5? It does not have out of the box functionality that corpus and doc has.

# text to process....
text = 'In data networking and transmission, 64b/66b is a line code that transforms 64-bit data to 66-bit line code to provide enough state changes to allow reasonable clock recovery and facilitate alignment of the data stream at the receiver. It was defined by the IEEE 802.3 working group as part of the IEEE 802.3ae-2002 amendment which introduced 10 Gbit/s Ethernet.'


# initialize empty corpus
corpus= textacy.corpus.Corpus(lang, texts=None, docs=None, metadatas=None)

# write empty corpus to disk
corpus.save('~/Desktop', name='64b66b encoding', compression='gzip')

# close the corpus obj
corpus = None

# load text into doc
doc = textacy.doc.Doc(text, metadata=None, lang='en')

# Add a document to the corpus on disk
doc.append_to_corpus('flpth/to/corpus', doc)

# Close doc (Now the text is not using memory)
doc =None

# Now to read in the text from the corpus as a doc
newCorpus = textacy.corpus.Corpus.load('flpth/to/corpus',
                                       docs_to_load=[0,1], # or/and
                                       metadata['speaker_name'] == 'Rick Santorum'))

Let me know what you think and what needs more flushing out conceptually.

Btw: I am still interested in having a chat (in regards to the learning map)

bdewilde added the enhancement label Jun 24, 2017

gryBox closed this as completed Jun 24, 2017

gryBox reopened this Jun 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Is it possible to save a Corporus to disk and then append or delete docs from it? #115

Question: Is it possible to save a Corporus to disk and then append or delete docs from it? #115

gryBox commented Jun 24, 2017

bdewilde commented Jun 24, 2017

gryBox commented Jun 24, 2017 •

edited

Loading

Question: Is it possible to save a Corporus to disk and then append or delete docs from it? #115

Question: Is it possible to save a Corporus to disk and then append or delete docs from it? #115

Comments

gryBox commented Jun 24, 2017

bdewilde commented Jun 24, 2017

gryBox commented Jun 24, 2017 • edited Loading

gryBox commented Jun 24, 2017 •

edited

Loading