-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data retention policy #188
base: main
Are you sure you want to change the base?
Conversation
Remove unnecessary (and unclosed paren
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good for the starting point. After implemented/deployed we will see how it could be improved
- `nwb_cache` | ||
- Yarn Cache | ||
- `__pycache__` | ||
- pip cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case user is still active -- I think it would be useful to report to the long running users, after reaching some threshold on any of those folders (e.g. 50MB) asking to clean them up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @asmacdo, should we add a separate point here about monitoring and reporting the quotas of cache directories for active users?
- large file list | ||
- summarized data retention policy | ||
- Notice number | ||
- request to cleanup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meanwhile it might be worth creating a simple data record schema to store those records as well so they could be reused by the tools to assemble higher level stats etc.
Co-authored-by: Yaroslav Halchenko <[email protected]>
Dandihub data storage on AWS EFS is expensive, and we suppose that significant portions of the data | ||
currently stored are no longer used. Data migration is where the cost becomes extreme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since S3 buckets can now mount to EC2 instances (reference: August 2023 blog post) and S3 costs are ~10X cheaper than EFS, as part of this data retention work perhaps we should also look into what it would take to move to S3 storage (and discuss any features that would not be available with this migration)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #198
- dandiarchive login information | ||
|
||
## Automated Data Audit | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At an interval of 7 days: | |
- Calculate home directory disk usage | |
|
||
## Automated Data Audit | ||
|
||
At some interval (30 days with no login?): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some interval (30 days with no login?): | |
At an interval of 30 days with no login to JupyterHub: |
Dandihub data storage on AWS EFS is expensive, and we suppose that significant portions of the data | ||
currently stored are no longer used. Data migration is where the cost becomes extreme. | ||
|
||
## Persistent Data locations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Persistent Data locations | |
## Persistent Data Locations |
Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub | ||
username. | ||
|
||
## Known cache file cleanup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Known cache file cleanup | |
## Known Cache File Cleanup |
Each user has access to 2 locations: `/home/{user}` and `/shared/`. | ||
|
||
Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub | ||
username. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were previously considering providing a /scratch
directory for each user that is automatically cleaned up after 30 days. In addition to the policy for the /home/jovyan
directory, do we also want to implement a /scratch
directory with a 30 day clean up policy?
- find files larger than 100 (?) GB and mtime > 10 (?) days -- get total size and count | ||
- find files larger than 1 (?) GB and mtime > 30 (?) days -- get total size and count | ||
- find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- find files larger than 100 (?) GB and mtime > 10 (?) days -- get total size and count | |
- find files larger than 1 (?) GB and mtime > 30 (?) days -- get total size and count | |
- find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them | |
- find files larger than 100 GB and mtime > 10 days -- get total size and count | |
- find files larger than 1 GB and mtime > 30 days -- get total size and count | |
- find _pycache_ and nwb-cache folders and pip cache and mtime > 30 days -- total sizes and list of them |
|
||
Notify user if: | ||
- any of the above listed thresholds were reached | ||
- total du exceeds some threshold (e.g. 100G) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- total du exceeds some threshold (e.g. 100G) | |
- total home directory disk usage exceeds 1 TB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested a quota of 1 TB for home directories as many datasets are getting to be quite large. This would provide temporary, high-capacity storage, but hopefully users won't get anywhere near this threshold. This would cost $300/user/month for standard EFS, and $23/user/month if we move to Standard S3.
If we implement a scratch directory, then perhaps the home directory can have a much smaller quota.
Notify user if: | ||
- any of the above listed thresholds were reached | ||
- total du exceeds some threshold (e.g. 100G) | ||
- total outdated caches size exceeds some threshold (e.g. 1G) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- total outdated caches size exceeds some threshold (e.g. 1G) | |
- total outdated caches size exceeds 1 GB |
- prior notification was sent more than a week ago | ||
|
||
Notification information: | ||
- large file list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- large file list | |
- summarized audit data (total size and count for each of the above thresholds) | |
- large file list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @asmacdo. This is great. A few suggestions are listed above.
Hi @asmacdo, please let me know when this is ready for review. And then we can update the DANDI Terms and Policies as needed. |
@asmacdo Continuing the discussion from Slack. As we work to ephemeral environments and given our current strategy of notifying users monthly, perhaps we should just have a policy that users with data totaling more than 10 GB would get an email notice? Proposed updated email template:
|
Heres a sketch of a possible data retention policy. Lets iron out what we want here prior to implementation.
Fixes: #182
from Yarik's initial thoughts : #177 (comment)