-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Stop truncating text in project datasets #1657
Comments
Hey @Ulipenitz 👋 I've passed on this feature request to the product team for consideration and will keep the thread updated. Meanwhile, as a workaround, can you upload the dataset to Neptune as a serialized object? Given the size of the dataset, I am assuming you wouldn't need it to be in a human-readable format on Neptune (but please correct me if I am wrong) You can upload the dataset as a pickle direct from memory by using import neptune
from neptune.types import File
DATA_PATH = "data/train"
data = {
"tokens": ["text",..., "text"],
"ner_tags": ["tag",...,"tag"]
}
project = neptune.init_project()
for i in range(10):
project[DATA_PATH][i].upload(File.as_pickle(data)) To download and use the dataset, you can download it from the run and load it using pickle: import pickle as pkl
project[DATA_PATH][i].download()
with open(DOWNLOADED_FILE_PATH, "rb") as f:
downloaded_dataset = pkl.load(f) Please let me know if this would work for you 🙏 |
Thank you for the quick reply! I already tried this, but unfortunately I get an error like this:
I used this code:
The project folder exists, but |
This was a bug in |
Sorry, I did not realize that I was not running on the newest version. It works now! |
Perfect 🎉 I'll keep the thread open in case the product team needs further details 🚀 |
Quick update:
I will try to chunk the data, so that I won't exceed this limit, but this workaround brings in some more complexity into our project. |
Is your feature request related to a problem? Please describe.
Related to this issue: #653
Describe the solution you'd like
As I am uploading a dataset (which does not fit on my local disk) to the project, I am uploading the dataset in a loop to the project like this:
Truncation to 1000 characters destroys my dataset.
As of my knowledge, there is no other way to upload a dataset from memory (without saving to a local file) directly, so this feature would be great!
Describe alternatives you've considered
I am thinking about saving these dicts {"tokens": ["text",...., "text"], "ner_tags": ["tag",...,"tag"]} to a file in each iteration and upload it as a file (e.g. data/train/0.pkl, data/train/1.pkl ... data/train/70000.pkl).
My dataset has 70.000 rows, so this is not a nice solution, since I have to make a file, upload it to neputune and delete it from local memory 70.000 times. Also when downloading the data, this will get messy as well.
The text was updated successfully, but these errors were encountered: