Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data retention policy #188

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add data retention policy #188

wants to merge 4 commits into from

Conversation

asmacdo
Copy link
Member

@asmacdo asmacdo commented Aug 6, 2024

Heres a sketch of a possible data retention policy. Lets iron out what we want here prior to implementation.

Fixes: #182

from Yarik's initial thoughts : #177 (comment)

@asmacdo asmacdo requested a review from yarikoptic August 9, 2024 15:46
Remove unnecessary (and unclosed paren
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good for the starting point. After implemented/deployed we will see how it could be improved

doc/design/data-retention-policy.md Outdated Show resolved Hide resolved
doc/design/data-retention-policy.md Outdated Show resolved Hide resolved
- `nwb_cache`
- Yarn Cache
- `__pycache__`
- pip cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case user is still active -- I think it would be useful to report to the long running users, after reaching some threshold on any of those folders (e.g. 50MB) asking to clean them up.

Copy link
Member

@kabilar kabilar Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @asmacdo, should we add a separate point here about monitoring and reporting the quotas of cache directories for active users?

doc/design/data-retention-policy.md Show resolved Hide resolved
doc/design/data-retention-policy.md Show resolved Hide resolved
- large file list
- summarized data retention policy
- Notice number
- request to cleanup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meanwhile it might be worth creating a simple data record schema to store those records as well so they could be reused by the tools to assemble higher level stats etc.

Co-authored-by: Yaroslav Halchenko <[email protected]>
Comment on lines +3 to +4
Dandihub data storage on AWS EFS is expensive, and we suppose that significant portions of the data
currently stored are no longer used. Data migration is where the cost becomes extreme.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since S3 buckets can now mount to EC2 instances (reference: August 2023 blog post) and S3 costs are ~10X cheaper than EFS, as part of this data retention work perhaps we should also look into what it would take to move to S3 storage (and discuss any features that would not be available with this migration)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #198

@dandi dandi locked and limited conversation to collaborators Sep 17, 2024
@kabilar kabilar changed the title Initial commit for data retention policy discussion Add data retention policy Sep 17, 2024
- dandiarchive login information

## Automated Data Audit

Copy link
Member

@kabilar kabilar Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
At an interval of 7 days:
- Calculate home directory disk usage


## Automated Data Audit

At some interval (30 days with no login?):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
At some interval (30 days with no login?):
At an interval of 30 days with no login to JupyterHub:

Dandihub data storage on AWS EFS is expensive, and we suppose that significant portions of the data
currently stored are no longer used. Data migration is where the cost becomes extreme.

## Persistent Data locations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Persistent Data locations
## Persistent Data Locations

Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub
username.

## Known cache file cleanup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Known cache file cleanup
## Known Cache File Cleanup

Comment on lines +8 to +11
Each user has access to 2 locations: `/home/{user}` and `/shared/`.

Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub
username.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were previously considering providing a /scratch directory for each user that is automatically cleaned up after 30 days. In addition to the policy for the /home/jovyan directory, do we also want to implement a /scratch directory with a 30 day clean up policy?

Comment on lines +34 to +36
- find files larger than 100 (?) GB and mtime > 10 (?) days -- get total size and count
- find files larger than 1 (?) GB and mtime > 30 (?) days -- get total size and count
- find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- find files larger than 100 (?) GB and mtime > 10 (?) days -- get total size and count
- find files larger than 1 (?) GB and mtime > 30 (?) days -- get total size and count
- find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them
- find files larger than 100 GB and mtime > 10 days -- get total size and count
- find files larger than 1 GB and mtime > 30 days -- get total size and count
- find _pycache_ and nwb-cache folders and pip cache and mtime > 30 days -- total sizes and list of them


Notify user if:
- any of the above listed thresholds were reached
- total du exceeds some threshold (e.g. 100G)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- total du exceeds some threshold (e.g. 100G)
- total home directory disk usage exceeds 1 TB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested a quota of 1 TB for home directories as many datasets are getting to be quite large. This would provide temporary, high-capacity storage, but hopefully users won't get anywhere near this threshold. This would cost $300/user/month for standard EFS, and $23/user/month if we move to Standard S3.

If we implement a scratch directory, then perhaps the home directory can have a much smaller quota.

Notify user if:
- any of the above listed thresholds were reached
- total du exceeds some threshold (e.g. 100G)
- total outdated caches size exceeds some threshold (e.g. 1G)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- total outdated caches size exceeds some threshold (e.g. 1G)
- total outdated caches size exceeds 1 GB

- prior notification was sent more than a week ago

Notification information:
- large file list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- large file list
- summarized audit data (total size and count for each of the above thresholds)
- large file list

Copy link
Member

@kabilar kabilar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @asmacdo. This is great. A few suggestions are listed above.

@kabilar
Copy link
Member

kabilar commented Oct 14, 2024

Hi @asmacdo, please let me know when this is ready for review. And then we can update the DANDI Terms and Policies as needed.

@kabilar
Copy link
Member

kabilar commented Jan 23, 2025

@asmacdo Continuing the discussion from Slack.

As we work to ephemeral environments and given our current strategy of notifying users monthly, perhaps we should just have a policy that users with data totaling more than 10 GB would get an email notice?

Proposed updated email template:

Hi <github/dandi username>,

The DANDI team is working to reduce our DANDI Hub costs.  A large portion of our costs include data stored on [DANDI Hub](https://hub.dandiarchive.org/) (not DANDI Archive).

There is currently about X GB stored under your user directory on DANDI Hub.

The data storage available on the DANDI Hub is meant for environment management and should not exceed 10GB.  Data files should be uploaded to DANDI Archive.  Please email [email protected] if you need to store more than 10 GB.  We will review each request individually and work with you to find a solution for your compute requirements.

Can you please review your files stored on DANDI Hub, upload any relevant files to your respective Dandisets on DANDI Archive, and delete any unused files on DANDI Hub?

Thank you.

DANDI Team

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create data retention policy
3 participants