Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Stop truncating text in project datasets #1657

Open
Ulipenitz opened this issue Feb 18, 2024 · 6 comments
Open

Feature Request: Stop truncating text in project datasets #1657

Ulipenitz opened this issue Feb 18, 2024 · 6 comments
Assignees

Comments

@Ulipenitz
Copy link

Is your feature request related to a problem? Please describe.

Related to this issue: #653

Describe the solution you'd like

As I am uploading a dataset (which does not fit on my local disk) to the project, I am uploading the dataset in a loop to the project like this:

project[DATA_PATH].append(
            stringify_unsupported(
                {
                    "tokens": ["text",...., "text"],
                    "ner_tags": ["tag",...,"tag"]
                }
            )
        )

Truncation to 1000 characters destroys my dataset.
As of my knowledge, there is no other way to upload a dataset from memory (without saving to a local file) directly, so this feature would be great!

Describe alternatives you've considered

I am thinking about saving these dicts {"tokens": ["text",...., "text"], "ner_tags": ["tag",...,"tag"]} to a file in each iteration and upload it as a file (e.g. data/train/0.pkl, data/train/1.pkl ... data/train/70000.pkl).
My dataset has 70.000 rows, so this is not a nice solution, since I have to make a file, upload it to neputune and delete it from local memory 70.000 times. Also when downloading the data, this will get messy as well.

@SiddhantSadangi
Copy link
Member

Hey @Ulipenitz 👋

I've passed on this feature request to the product team for consideration and will keep the thread updated.

Meanwhile, as a workaround, can you upload the dataset to Neptune as a serialized object? Given the size of the dataset, I am assuming you wouldn't need it to be in a human-readable format on Neptune (but please correct me if I am wrong)

You can upload the dataset as a pickle direct from memory by using neptune.types.File.as_pickle(). It would look like shown below:

import neptune
from neptune.types import File 

DATA_PATH = "data/train"

data = {
    "tokens": ["text",..., "text"],
    "ner_tags": ["tag",...,"tag"]
}

project = neptune.init_project()

for i in range(10):
    project[DATA_PATH][i].upload(File.as_pickle(data))

To download and use the dataset, you can download it from the run and load it using pickle:

import pickle as pkl

project[DATA_PATH][i].download()

with open(DOWNLOADED_FILE_PATH, "rb") as f:
    downloaded_dataset = pkl.load(f)

Please let me know if this would work for you 🙏

@Ulipenitz
Copy link
Author

Thank you for the quick reply!

I already tried this, but unfortunately I get an error like this:

FileNotFoundError: [Errno 2] No such file or directory: 'ABSOLUTEPATH\\.neptune\\async\\project__9701b6a4-d310-4f5f-a6e0-7827a05c1e78\\exec-1708349077.259059-2024-02-19_14.24.37.259059-5884\\upload_path\\data_dummy_data-1708349077.32419-2024-02-19_14.24.37.324190.pkl'

I used this code:

project = neptune.init_project( )
data = {"a": 0, "b": 1}
project["data/dummy_data"].upload(File.as_pickle(data))

The project folder exists, but exec-1708349077 does not.

@SiddhantSadangi
Copy link
Member

This was a bug in neptune<0.19. Could you update neptune to the latest version using pip install -U neptune and try again?

@Ulipenitz
Copy link
Author

Sorry, I did not realize that I was not running on the newest version. It works now!
Also, your proposed solution works! Thanks for the help! :-)

@SiddhantSadangi
Copy link
Member

Perfect 🎉

I'll keep the thread open in case the product team needs further details 🚀

@Ulipenitz
Copy link
Author

Quick update:
Initially I tested with a subset of the data, but with the big dataset I get this error:

----NeptuneFieldCountLimitExceedException---------------------------------------------------------------------------------------

There are too many fields (more than 9000) in the [PROJECTNAME] project.
We have stopped the synchronization to the Neptune server and stored the data locally.

I will try to chunk the data, so that I won't exceed this limit, but this workaround brings in some more complexity into our project.
Would be great to have bigger limits for bigger datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants