You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi - I am hoping to use textacy.corpus.Corpus as an on disk data container. What I didn't find, in the documentation was a way to write a Doc to a corpus on disk. Is this functionality something I overlooked or did not understand?
The text was updated successfully, but these errors were encountered:
Hey @gryBox , unfortunately there's no functionality that does exactly what you have in mind... but it's an interesting idea. I've been meaning to work on the Corpus class for a while now, possibly differentiating into one that has access to all its documents (either on disk or in memory) and one that is fully streaming à la gensim.
That said, I'm waiting (have been waiting for a while) for spacy v2.0's release, which significantly changes how its data gets serialized to disk. I'm probably going to punt on this issue until then — but promise to revisit when 2.0 is out. If you could be more specific and detailed about the sort of functionality you want, that would be a great help to me!
Hi @bdewilde
Here is some pseudo user case code (to get the spec going). In general I am looking at HDF5 as an inspiration. Why not just use hdf5? It does not have out of the box functionality that corpus and doc has.
# text to process....
text = 'In data networking and transmission, 64b/66b is a line code that transforms 64-bit data to 66-bit line code to provide enough state changes to allow reasonable clock recovery and facilitate alignment of the data stream at the receiver. It was defined by the IEEE 802.3 working group as part of the IEEE 802.3ae-2002 amendment which introduced 10 Gbit/s Ethernet.'
# initialize empty corpus
corpus= textacy.corpus.Corpus(lang, texts=None, docs=None, metadatas=None)
# write empty corpus to disk
corpus.save('~/Desktop', name='64b66b encoding', compression='gzip')
# close the corpus obj
corpus = None
# load text into doc
doc = textacy.doc.Doc(text, metadata=None, lang='en')
# Add a document to the corpus on disk
doc.append_to_corpus('flpth/to/corpus', doc)
# Close doc (Now the text is not using memory)
doc =None
# Now to read in the text from the corpus as a doc
newCorpus = textacy.corpus.Corpus.load('flpth/to/corpus',
docs_to_load=[0,1], # or/and
metadata['speaker_name'] == 'Rick Santorum'))
Let me know what you think and what needs more flushing out conceptually.
Btw: I am still interested in having a chat (in regards to the learning map)
Hi - I am hoping to use
textacy.corpus.Corpus
as an on disk data container. What I didn't find, in the documentation was a way to write aDoc
to a corpus on disk. Is this functionality something I overlooked or did not understand?The text was updated successfully, but these errors were encountered: