Feature Request: Stop truncating text in project datasets #1657

Ulipenitz · 2024-02-18T17:13:27Z

Is your feature request related to a problem? Please describe.

Related to this issue: #653

Describe the solution you'd like

As I am uploading a dataset (which does not fit on my local disk) to the project, I am uploading the dataset in a loop to the project like this:

project[DATA_PATH].append(
            stringify_unsupported(
                {
                    "tokens": ["text",...., "text"],
                    "ner_tags": ["tag",...,"tag"]
                }
            )
        )

Truncation to 1000 characters destroys my dataset.
As of my knowledge, there is no other way to upload a dataset from memory (without saving to a local file) directly, so this feature would be great!

Describe alternatives you've considered

I am thinking about saving these dicts {"tokens": ["text",...., "text"], "ner_tags": ["tag",...,"tag"]} to a file in each iteration and upload it as a file (e.g. data/train/0.pkl, data/train/1.pkl ... data/train/70000.pkl).
My dataset has 70.000 rows, so this is not a nice solution, since I have to make a file, upload it to neputune and delete it from local memory 70.000 times. Also when downloading the data, this will get messy as well.

The text was updated successfully, but these errors were encountered:

SiddhantSadangi · 2024-02-19T10:31:29Z

Hey @Ulipenitz 👋

I've passed on this feature request to the product team for consideration and will keep the thread updated.

Meanwhile, as a workaround, can you upload the dataset to Neptune as a serialized object? Given the size of the dataset, I am assuming you wouldn't need it to be in a human-readable format on Neptune (but please correct me if I am wrong)

You can upload the dataset as a pickle direct from memory by using neptune.types.File.as_pickle(). It would look like shown below:

import neptune
from neptune.types import File 

DATA_PATH = "data/train"

data = {
    "tokens": ["text",..., "text"],
    "ner_tags": ["tag",...,"tag"]
}

project = neptune.init_project()

for i in range(10):
    project[DATA_PATH][i].upload(File.as_pickle(data))

To download and use the dataset, you can download it from the run and load it using pickle:

import pickle as pkl

project[DATA_PATH][i].download()

with open(DOWNLOADED_FILE_PATH, "rb") as f:
    downloaded_dataset = pkl.load(f)

Please let me know if this would work for you 🙏

Ulipenitz · 2024-02-19T13:33:19Z

Thank you for the quick reply!

I already tried this, but unfortunately I get an error like this:

FileNotFoundError: [Errno 2] No such file or directory: 'ABSOLUTEPATH\\.neptune\\async\\project__9701b6a4-d310-4f5f-a6e0-7827a05c1e78\\exec-1708349077.259059-2024-02-19_14.24.37.259059-5884\\upload_path\\data_dummy_data-1708349077.32419-2024-02-19_14.24.37.324190.pkl'

I used this code:

project = neptune.init_project( )
data = {"a": 0, "b": 1}
project["data/dummy_data"].upload(File.as_pickle(data))

The project folder exists, but exec-1708349077 does not.

SiddhantSadangi · 2024-02-19T13:45:51Z

This was a bug in neptune<0.19. Could you update neptune to the latest version using pip install -U neptune and try again?

Ulipenitz · 2024-02-19T13:53:02Z

Sorry, I did not realize that I was not running on the newest version. It works now!
Also, your proposed solution works! Thanks for the help! :-)

SiddhantSadangi · 2024-02-19T13:56:29Z

Perfect 🎉

I'll keep the thread open in case the product team needs further details 🚀

Ulipenitz · 2024-02-23T14:59:20Z

Quick update:
Initially I tested with a subset of the data, but with the big dataset I get this error:

----NeptuneFieldCountLimitExceedException---------------------------------------------------------------------------------------

There are too many fields (more than 9000) in the [PROJECTNAME] project.
We have stopped the synchronization to the Neptune server and stored the data locally.

I will try to chunk the data, so that I won't exceed this limit, but this workaround brings in some more complexity into our project.
Would be great to have bigger limits for bigger datasets.

SiddhantSadangi assigned parthpankajtiwary Feb 19, 2024

SiddhantSadangi added the feature request label Feb 19, 2024

SiddhantSadangi assigned amberRrucker and AurimasGr and unassigned parthpankajtiwary Aug 12, 2024

SiddhantSadangi unassigned AurimasGr Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Stop truncating text in project datasets #1657

Feature Request: Stop truncating text in project datasets #1657

Ulipenitz commented Feb 18, 2024

SiddhantSadangi commented Feb 19, 2024

Ulipenitz commented Feb 19, 2024

SiddhantSadangi commented Feb 19, 2024

Ulipenitz commented Feb 19, 2024

SiddhantSadangi commented Feb 19, 2024

Ulipenitz commented Feb 23, 2024

Feature Request: Stop truncating text in project datasets #1657

Feature Request: Stop truncating text in project datasets #1657

Comments

Ulipenitz commented Feb 18, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

SiddhantSadangi commented Feb 19, 2024

Ulipenitz commented Feb 19, 2024

SiddhantSadangi commented Feb 19, 2024

Ulipenitz commented Feb 19, 2024

SiddhantSadangi commented Feb 19, 2024

Ulipenitz commented Feb 23, 2024