Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Is it possible to save a Corporus to disk and then append or delete docs from it? #115

Open
gryBox opened this issue Jun 24, 2017 · 2 comments

Comments

@gryBox
Copy link

gryBox commented Jun 24, 2017

Hi - I am hoping to use textacy.corpus.Corpus as an on disk data container. What I didn't find, in the documentation was a way to write a Doc to a corpus on disk. Is this functionality something I overlooked or did not understand?

@bdewilde
Copy link
Collaborator

Hey @gryBox , unfortunately there's no functionality that does exactly what you have in mind... but it's an interesting idea. I've been meaning to work on the Corpus class for a while now, possibly differentiating into one that has access to all its documents (either on disk or in memory) and one that is fully streaming à la gensim.

That said, I'm waiting (have been waiting for a while) for spacy v2.0's release, which significantly changes how its data gets serialized to disk. I'm probably going to punt on this issue until then — but promise to revisit when 2.0 is out. If you could be more specific and detailed about the sort of functionality you want, that would be a great help to me!

@gryBox
Copy link
Author

gryBox commented Jun 24, 2017

Hi @bdewilde
Here is some pseudo user case code (to get the spec going). In general I am looking at HDF5 as an inspiration. Why not just use hdf5? It does not have out of the box functionality that corpus and doc has.

# text to process....
text = 'In data networking and transmission, 64b/66b is a line code that transforms 64-bit data to 66-bit line code to provide enough state changes to allow reasonable clock recovery and facilitate alignment of the data stream at the receiver. It was defined by the IEEE 802.3 working group as part of the IEEE 802.3ae-2002 amendment which introduced 10 Gbit/s Ethernet.'


# initialize empty corpus
corpus= textacy.corpus.Corpus(lang, texts=None, docs=None, metadatas=None)

# write empty corpus to disk
corpus.save('~/Desktop', name='64b66b encoding', compression='gzip')

# close the corpus obj
corpus = None

# load text into doc
doc = textacy.doc.Doc(text, metadata=None, lang='en')

# Add a document to the corpus on disk
doc.append_to_corpus('flpth/to/corpus', doc)

# Close doc (Now the text is not using memory)
doc =None

# Now to read in the text from the corpus as a doc
newCorpus = textacy.corpus.Corpus.load('flpth/to/corpus',
                                       docs_to_load=[0,1], # or/and
                                       metadata['speaker_name'] == 'Rick Santorum'))

Let me know what you think and what needs more flushing out conceptually.

Btw: I am still interested in having a chat (in regards to the learning map)

@gryBox gryBox closed this as completed Jun 24, 2017
@gryBox gryBox reopened this Jun 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants