Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data retention policy #188

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions doc/design/data-retention-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Data Retention Policy

Dandihub data storage on AWS EFS is expensive, and we suppose that significant portions of the data
currently stored are no longer used. Data migration is where the cost becomes extreme.
kabilar marked this conversation as resolved.
Show resolved Hide resolved

## Persistent Data locations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Persistent Data locations
## Persistent Data Locations


Each user has access to 2 locations: `/home/{user}` and `/shared/`.

Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub
username.
asmacdo marked this conversation as resolved.
Show resolved Hide resolved
kabilar marked this conversation as resolved.
Show resolved Hide resolved

## Known cache file cleanup
kabilar marked this conversation as resolved.
Show resolved Hide resolved

We should be able to safely remove the following:
kabilar marked this conversation as resolved.
Show resolved Hide resolved
- `/home/{user}/.cache`
- `nwb_cache`
- Yarn Cache
- `__pycache__`
- pip cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case user is still active -- I think it would be useful to report to the long running users, after reaching some threshold on any of those folders (e.g. 50MB) asking to clean them up.

Copy link
Member

@kabilar kabilar Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @asmacdo, should we add a separate point here about monitoring and reporting the quotas of cache directories for active users?

Comment on lines +13 to +20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Known cache file cleanup
We should be able to safely remove the following:
- `/home/{user}/.cache`
- `nwb_cache`
- Yarn Cache
- `__pycache__`
- pip cache



## Determining Last Access

EFS does not store metadata for the last access of the data. (Though they must track somehow to move to `Infrequent Access`)

Alternatives:
- use the [jupyterhub REST API](https://jupyterhub.readthedocs.io/en/stable/reference/rest-api.html#operation/get-users) check when user last used/logged in to hub.
- dandiarchive login information
Comment on lines +25 to +29
Copy link
Member

@kabilar kabilar Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EFS does not store metadata for the last access of the data. (Though they must track somehow to move to `Infrequent Access`)
Alternatives:
- use the [jupyterhub REST API](https://jupyterhub.readthedocs.io/en/stable/reference/rest-api.html#operation/get-users) check when user last used/logged in to hub.
- dandiarchive login information
- Use the [JupyterHub REST API](https://jupyterhub.readthedocs.io/en/stable/reference/rest-api.html#operation/get-users) to check when user last logged in to the hub.
- On a daily basis determine if any users had last logged in 30 or 45 days ago. If so, send the emails noted in the #reset-home-directories-after-45-days-of-inactivity section.


## Automated Data Audit

Copy link
Member

@kabilar kabilar Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
At an interval of every 7 days, calculate home directory disk usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be for our internal book keeping.

At some interval (30 days with no login?):
kabilar marked this conversation as resolved.
Show resolved Hide resolved
- find files larger than 100 (?) GB and mtime > 10 (?) days -- get total size and count
- find files larger than 1 (?) GB and mtime > 30 (?) days -- get total size and count
asmacdo marked this conversation as resolved.
Show resolved Hide resolved
- find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them
kabilar marked this conversation as resolved.
Show resolved Hide resolved
Comment on lines +33 to +36
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
At some interval (30 days with no login?):
- find files larger than 100 (?) GB and mtime > 10 (?) days -- get total size and count
- find files larger than 1 (?) GB and mtime > 30 (?) days -- get total size and count
- find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them


Notify user if:
- any of the above listed thresholds were reached
- total du exceeds some threshold (e.g. 100G)
asmacdo marked this conversation as resolved.
Show resolved Hide resolved
kabilar marked this conversation as resolved.
Show resolved Hide resolved
- total outdated caches size exceeds some threshold (e.g. 1G)
kabilar marked this conversation as resolved.
Show resolved Hide resolved
- prior notification was sent more than a week ago
kabilar marked this conversation as resolved.
Show resolved Hide resolved

Notification information:
- large file list
kabilar marked this conversation as resolved.
Show resolved Hide resolved
- summarized data retention policy
- Notice number
- request to cleanup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meanwhile it might be worth creating a simple data record schema to store those records as well so they could be reused by the tools to assemble higher level stats etc.

Comment on lines +38 to +48
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion this week, for now let's just reset home directories if the users haven't accessed the hub in 45 days. If we have many users storing more than 10 GB over the long-run we can add pieces of this back.

Suggested change
Notify user if:
- any of the above listed thresholds were reached
- total du exceeds some threshold (e.g. 100G)
- total outdated caches size exceeds some threshold (e.g. 1G)
- prior notification was sent more than a week ago
Notification information:
- large file list
- summarized data retention policy
- Notice number
- request to cleanup


### Non-response cleanup

If a user has not logged in for 60 days (30 days initial + 30 days following audit), send a warning:
`In 10 days the following files will be cleaned up`

If the user has not logged in for 70 days (30 initial + 30 after audit + 10 warning):
`The following files were removed`

Reset timer.
Comment on lines +50 to +58
Copy link
Member

@kabilar kabilar Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Non-response cleanup
If a user has not logged in for 60 days (30 days initial + 30 days following audit), send a warning:
`In 10 days the following files will be cleaned up`
If the user has not logged in for 70 days (30 initial + 30 after audit + 10 warning):
`The following files were removed`
Reset timer.
### Reset home directories after 45 days of inactivity
If a user has not logged in for 30 days, send a warning:
`In 15 days, the files in your home directory on DANDI Hub will be deleted. Please review your files stored on DANDI Hub and upload any relevant files to your respective Dandisets on DANDI Archive. If you would like to keep the files on DANDI Hub, please log into the Hub within the next 15 days.`
If the user has not logged in for 45 days, send a confirmation:
`The files in your home directory on DANDI Hub were deleted.`