Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

script to generate report for user resource usage #199

Open
wants to merge 132 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
132 commits
Select commit Hold shift + click to select a range
30e9cbf
Inital commit to add GH action to generate report
asmacdo Sep 25, 2024
3bcba91
Assume Jupyterhub Provisioning Role
asmacdo Sep 25, 2024
b5cdcf3
Fixup: indent
asmacdo Sep 25, 2024
6e118cc
Rename job
asmacdo Sep 25, 2024
5062f08
Add assumed role to update-kubeconfig
asmacdo Sep 25, 2024
d21a3a9
No need to add ProvisioningRole to masters
asmacdo Sep 25, 2024
403028f
Deploy a pod to the cluster, and schedule with Karpenter
asmacdo Sep 25, 2024
92b9925
Fixup: correct path to pod manifest
asmacdo Sep 25, 2024
478a31f
Fixup again ugh, rename file
asmacdo Sep 25, 2024
9db914e
Delete Pod even if previous step times out
asmacdo Sep 25, 2024
8458d01
Hack out initial du
asmacdo Oct 11, 2024
7999455
tmp comment out job deployment, test dockerhub build
asmacdo Nov 8, 2024
d2e65de
Fixup hyphens for image name
asmacdo Nov 8, 2024
5e9e7df
Write file to output location
asmacdo Nov 8, 2024
d33973c
use kubectl cp to retrieve report
asmacdo Nov 8, 2024
98fecbc
Combine run blocks to use vars
asmacdo Nov 8, 2024
40ae0e8
Mount efs and pass arg to du script
asmacdo Nov 8, 2024
4c978f7
Comment out repo pushing, lets see if the report runs
asmacdo Nov 8, 2024
6bd7b82
Restrict job to asmacdo for testing
asmacdo Nov 8, 2024
73c3e80
Sanity check. Just list the directories
asmacdo Nov 8, 2024
685dfb1
Job was deployed, but never assigned to node, back to sanity check
asmacdo Nov 8, 2024
f6afefc
change from job to pod
asmacdo Nov 8, 2024
6dad759
deploy pod to same namespace as pvc
asmacdo Nov 8, 2024
3a33937
Use ns in action
asmacdo Nov 8, 2024
1ffb1c9
increase timeout to 60s
asmacdo Nov 8, 2024
58e0753
fixup: image name in manifest
asmacdo Nov 8, 2024
6767755
increase timeout to 150
asmacdo Nov 8, 2024
cbf951e
override entrypoint so i can debug with exec
asmacdo Nov 8, 2024
59eb045
bound /home actually meant path was /home/home/asmacdo
asmacdo Nov 8, 2024
db140d5
Create output dir prior to writing report
asmacdo Nov 8, 2024
f90176a
pod back to job
asmacdo Nov 11, 2024
c31ccdd
Fixup use the correct job api
asmacdo Nov 11, 2024
3ee9d9f
Add namespace to pod retrieval
asmacdo Nov 11, 2024
d7f81ba
write directly to pv to test job
asmacdo Nov 11, 2024
0856baa
fixup script fstring
asmacdo Nov 11, 2024
5301b1b
no retry on failure, we were spinning up 5 pods, lets just fail 1 time
asmacdo Nov 11, 2024
7384274
Fixup backup limit job not template
asmacdo Nov 11, 2024
8e81e38
Initial report
asmacdo Nov 11, 2024
cb5db49
disable report
asmacdo Nov 11, 2024
5d188a7
deploy ec2 instance directly
asmacdo Dec 2, 2024
2f39e9c
Update AMI image
asmacdo Dec 2, 2024
3a21106
update sg and subnet
asmacdo Dec 2, 2024
6a54da0
terminate even if job fails
asmacdo Dec 2, 2024
87075fb
debug: print public ip
asmacdo Dec 2, 2024
48c7f35
explicitly allocate public ip for ec2 instance
asmacdo Dec 2, 2024
743359e
Add WIP scripts
asmacdo Dec 6, 2024
0ba12f2
rm old unused
asmacdo Dec 6, 2024
2893ab2
initial commit of scripts
asmacdo Dec 6, 2024
5ef8f80
clean up launch script
asmacdo Dec 6, 2024
b02720e
make scripe executable
asmacdo Dec 6, 2024
ae98909
fixup cleanup script
asmacdo Dec 6, 2024
7e80e4a
add a name to elastic ip (for easier manual cleanup)
asmacdo Dec 6, 2024
f2a4116
Exit on fail
asmacdo Dec 6, 2024
6ffef17
Add permission for aws ec2 wait instance-status-ok
asmacdo Dec 6, 2024
20cc085
Upload scripts to instance
asmacdo Dec 6, 2024
76477df
explicitly return
asmacdo Dec 6, 2024
b38ded1
output session variables to file
asmacdo Dec 11, 2024
f795570
modify cleanup script to retrieve instance from temporary file
asmacdo Dec 11, 2024
f8a92b2
All ec2 persmissions granted
asmacdo Dec 11, 2024
e9726df
Add EFS mount (hardcoded)
asmacdo Dec 11, 2024
c6e92f9
No pager for termination
asmacdo Dec 11, 2024
17d77cd
force pseudo-terminal, otherwise hangs after yum install
asmacdo Dec 11, 2024
2246af5
Add doublequotes to variable usage for proper expansion
asmacdo Dec 11, 2024
b49b7b5
Fixup -t goes on ssh, not scp
asmacdo Dec 11, 2024
584ac4d
Mount as a single command, since we dont have access to pty
asmacdo Dec 11, 2024
4a700e5
add todos for manual steps
asmacdo Dec 11, 2024
6339924
Disable job for now
asmacdo Dec 11, 2024
17130ef
Update AMI to ubuntu
asmacdo Dec 12, 2024
cc29df5
Roll back to AL 2023
asmacdo Dec 12, 2024
295361c
drop gzip, just write json
asmacdo Dec 13, 2024
a667c04
include target dir in relative paths
asmacdo Dec 13, 2024
a91beb0
Second script will not produce user report, but directory stats json
asmacdo Dec 13, 2024
9371982
inital algorithm hackout
asmacdo Dec 13, 2024
8cead5a
Clean up and refactor for simplicity
asmacdo Dec 13, 2024
86a7c72
Add basic tests
asmacdo Dec 13, 2024
fc1cab1
test multiple directories in root
asmacdo Dec 13, 2024
2308aed
comment about [:-1]
asmacdo Dec 13, 2024
84754fe
support abspaths
asmacdo Dec 14, 2024
a1427ac
[DATALAD RUNCMD] blacken
asmacdo Dec 14, 2024
16e4890
test propagation with files in all dirs
asmacdo Dec 14, 2024
528833d
Write files to disk as they are inspected
asmacdo Dec 15, 2024
3c0e7f7
Comment out column headers in output
asmacdo Dec 15, 2024
260c69d
Write all fields for every file
asmacdo Dec 15, 2024
87dd8ca
Convert to reading tsv
asmacdo Dec 15, 2024
e0e0a32
Fixup: update test to match tsv-read data
asmacdo Dec 15, 2024
41aaa2a
update for renamed script
asmacdo Dec 15, 2024
25e27eb
install pip
asmacdo Dec 15, 2024
204b70e
install parallel
asmacdo Dec 15, 2024
64d653e
install dependencies in launch script
asmacdo Dec 15, 2024
6475f11
Output to tmp, accept only 1 arg, target dir
asmacdo Dec 15, 2024
b67c063
add up sizes
asmacdo Dec 16, 2024
3241473
print useful info as index is created
asmacdo Dec 16, 2024
f4eb101
dont fail if output dir exists
asmacdo Dec 16, 2024
13e0e75
Create a report dict with only relevant stats
asmacdo Dec 16, 2024
a7e6991
output data reports
asmacdo Dec 20, 2024
845df00
Remove unused
asmacdo Jan 17, 2025
f283ad4
rm redundant
asmacdo Jan 23, 2025
2311c92
WIP: run against all users, construct a totals tsv
asmacdo Jan 23, 2025
b65df42
use constants better
asmacdo Jan 23, 2025
474bc18
keep track of errors separately so we can depend on file data to be c…
asmacdo Jan 23, 2025
fd44a50
stats[username] is total, lets keep that and not worry about others f…
asmacdo Jan 27, 2025
a7bafd1
fixup tests
asmacdo Jan 27, 2025
b712d8c
Drop (comment) full directory counting
asmacdo Jan 27, 2025
8ef8918
Add success message
asmacdo Jan 28, 2025
5c3c858
add report generation readme
asmacdo Jan 28, 2025
147d947
remove development artifacts
asmacdo Jan 28, 2025
5f5a1ac
Update .github/scripts/launch-ec2.sh
asmacdo Jan 28, 2025
31969a2
rm todo
asmacdo Jan 28, 2025
a42e1c4
Add report generation logic back in
asmacdo Jan 28, 2025
8d4f014
enh: WIP count nwb_files, bids_datasets, zarr_files
asmacdo Jan 29, 2025
2fd7afe
fixup: rename generate statistics
asmacdo Jan 29, 2025
1399bc4
fixup: tests pass
asmacdo Jan 29, 2025
25ebaee
fixup
asmacdo Jan 29, 2025
6197b81
propegate bids_datasets
asmacdo Jan 29, 2025
ca63cdf
test bids_datasets propegate
asmacdo Jan 29, 2025
aae6d1b
WIP: wishful refactor
asmacdo Jan 29, 2025
9d693f6
Refactor: DirectoryStats class
asmacdo Jan 29, 2025
27981b7
add user cache counter
asmacdo Jan 29, 2025
0ed52d3
fixup tests for OO refactor
asmacdo Jan 29, 2025
1ce1ca9
Add usercache count test
asmacdo Jan 29, 2025
f07eea0
Fixup clean and comment
asmacdo Jan 29, 2025
b076840
Write to json + minor refactor
asmacdo Jan 29, 2025
4d143bb
Add zarr and nwb counters
asmacdo Jan 29, 2025
e34c14f
blacken
asmacdo Jan 29, 2025
2391fd5
bf: handle nested username/.cache
asmacdo Jan 29, 2025
c1db06b
Only write summary
asmacdo Jan 29, 2025
f0efbdb
Use increment consistently
asmacdo Jan 29, 2025
a5d909b
bf: accept abspaths, but output relative paths
asmacdo Jan 29, 2025
cf1ef5c
Set paths for real usage
asmacdo Jan 29, 2025
1e3e6f5
bf: handle case where there are no files
asmacdo Jan 29, 2025
7664cac
fixup: set paths for real usage 2
asmacdo Jan 29, 2025
b4cc442
Print user starting for usage sanity
asmacdo Jan 29, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 1 addition & 63 deletions .aws/terraform-jupyterhub-provisioning-policies.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,69 +4,7 @@
{
"Effect": "Allow",
"Action": [
"ec2:AllocateAddress",
"ec2:AssociateAddress",
"ec2:AssociateRouteTable",
"ec2:AssociateVpcCidrBlock",
"ec2:AttachInternetGateway",
"ec2:AttachNetworkInterface",
"ec2:AuthorizeSecurityGroupEgress",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CreateInternetGateway",
"ec2:CreateLaunchTemplate",
"ec2:CreateLaunchTemplateVersion",
"ec2:CreateNatGateway",
"ec2:CreateNetworkAcl",
"ec2:CreateNetworkAclEntry",
"ec2:CreateNetworkInterface",
"ec2:CreateNetworkInterfacePermission",
"ec2:CreateRoute",
"ec2:CreateRouteTable",
"ec2:CreateSecurityGroup",
"ec2:CreateSubnet",
"ec2:CreateTags",
"ec2:CreateVpc",
"ec2:DeleteInternetGateway",
"ec2:DeleteLaunchTemplate",
"ec2:DeleteLaunchTemplateVersions",
"ec2:DeleteNatGateway",
"ec2:DeleteNetworkAcl",
"ec2:DeleteNetworkAclEntry",
"ec2:DeleteNetworkInterface",
"ec2:DeleteRoute",
"ec2:DeleteRouteTable",
"ec2:DeleteSecurityGroup",
"ec2:DeleteSubnet",
"ec2:DeleteTags",
"ec2:DeleteVpc",
"ec2:DescribeAddresses",
"ec2:DescribeAddressesAttribute",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeInternetGateways",
"ec2:DescribeLaunchTemplateVersions",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeNatGateways",
"ec2:DescribeNetworkAcls",
"ec2:DescribeNetworkInterfacePermissions",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeRouteTables",
"ec2:DescribeSecurityGroupRules",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeVpcAttribute",
"ec2:DescribeVpcs",
"ec2:DetachInternetGateway",
"ec2:DetachNetworkInterface",
"ec2:DisassociateAddress",
"ec2:DisassociateRouteTable",
"ec2:DisassociateVpcCidrBlock",
"ec2:ModifyNetworkInterfaceAttribute",
"ec2:ModifyVpcAttribute",
"ec2:ReleaseAddress",
"ec2:ReplaceRoute",
"ec2:RevokeSecurityGroupEgress",
"ec2:RevokeSecurityGroupIngress",
"ec2:RunInstances",
"ec2:*",
"ecr-public:GetAuthorizationToken",
"eks:*",
"elasticfilesystem:CreateFileSystem",
Expand Down
275 changes: 275 additions & 0 deletions .github/scripts/calculate-directory-stats.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
#!/usr/bin/env python3

import glob
import os
import csv
import json
import sys
import unittest
from collections import Counter, defaultdict
from pathlib import Path
from pprint import pprint
from typing import Iterable, Tuple

TOTALS_OUTPUT_FILE = "all_users_total.json"
OUTPUT_DIR = "/home/ec2-user/hub-user-reports/"
INPUT_DIR = "/home/ec2-user/hub-user-indexes"


csv.field_size_limit(sys.maxsize)


class DirectoryStats(defaultdict):
COUNTED_FIELDS = [
"total_size",
"file_count",
"nwb_files",
"nwb_size",
"bids_datasets",
"zarr_files",
"zarr_size",
"user_cache_file_count",
"user_cache_size",
]
root = str

def __init__(self, root):
super().__init__(lambda: Counter({key: 0 for key in self.COUNTED_FIELDS}))
self.root = root

def increment(self, path: str, field: str, amount: int = 1):
if field not in self.COUNTED_FIELDS:
raise KeyError(
f"Invalid field '{field}'. Allowed fields: {self.COUNTED_FIELDS}"
)
self[path][field] += amount

def propagate_dir(self, current_parent: str, previous_parent: str):
"""Propagate counts up the directory tree."""
assert os.path.isabs(current_parent) == os.path.isabs(
previous_parent
), "Both must be absolute or both relative"

highest_common = os.path.commonpath([current_parent, previous_parent])
assert highest_common, "highest_common must either be a target directory or /"

path_to_propagate = os.path.relpath(previous_parent, highest_common)
nested_dir_list = path_to_propagate.split(os.sep)[:-1] # Exclude last directory

while nested_dir_list:
working_dir = os.path.join(highest_common, *nested_dir_list)
for field in self.COUNTED_FIELDS:
self[working_dir][field] += self[previous_parent][field]
nested_dir_list.pop()
previous_parent = working_dir

# Final propagation to the common root
for field in self.COUNTED_FIELDS:
self[highest_common][field] += self[previous_parent][field]

def inc_if_bids(self, parent: str, path: str):
"""Check if a file indicates a BIDS dataset and increment the count."""
if path.endswith("dataset_description.json"):
self.increment(parent, "bids_datasets")

def inc_if_usercache(self, parent: str, filepath: str, size: int):
if filepath.startswith(f"{self.root}/.cache"):
self.increment(parent, "user_cache_file_count")
self.increment(parent, "user_cache_size", size)

def inc_if_nwb(self, parent: str, path: str, size: int):
if path.lower().endswith(".nwb"):
self.increment(parent, "nwb_files")
self.increment(parent, "nwb_size", size)

def inc_if_zarr(self, parent: str, path: str, size: int):
if path.lower().endswith(".zarr"):
self.increment(parent, "zarr_files")
self.increment(parent, "zarr_size", size)

@classmethod
def from_index(cls, username, user_tsv_file):
"""Separated from from_data for easier testing"""
data = cls._iter_file_metadata(user_tsv_file)
return cls.from_data(username, data)

@classmethod
def from_data(cls, root, data: Iterable[Tuple[str, str, str, str]]):
"""
Build DirectoryStats from an iterable of (filepath, size, modified, created).
Assumes depth-first listing.
"""
instance = cls(root=root)
previous_parent = ""

for filepath, size, _, _ in data:
parent = os.path.dirname(filepath)

instance.increment(parent, "file_count")
instance.increment(parent, "total_size", int(size))
instance.inc_if_bids(parent, filepath)
instance.inc_if_nwb(parent, filepath, int(size))
instance.inc_if_zarr(parent, filepath, int(size))
instance.inc_if_usercache(parent, filepath, int(size))

if previous_parent == parent:
continue
# Going deeper
elif not previous_parent or os.path.dirname(parent) == previous_parent:
previous_parent = parent
continue
else: # Done with this directory
instance.propagate_dir(parent, previous_parent)
previous_parent = parent

# Final propagation to ensure root directory gets counts
if previous_parent: # No previous_parent means no data
leading_dir = previous_parent.split(os.sep)[0] or "/"
instance.propagate_dir(leading_dir, previous_parent)

return instance

def _iter_file_metadata(file_path):
"""
Reads a tsv and returns an iterable that yields one row of file metadata at
a time, excluding comments.
"""
file_path = Path(file_path)
with file_path.open(mode="r", newline="", encoding="utf-8") as file:
reader = csv.reader(file, delimiter="\t")
for row in reader:
# Skip empty lines or lines starting with '#'
if not row or row[0].startswith("#"):
continue
yield row

@property
def summary(self):
return self[self.root]

def __repr__(self):
"""Cleaner representation for debugging."""
return "\n".join([f"{path}: {dict(counts)}" for path, counts in self.items()])


def main():
os.makedirs(OUTPUT_DIR, exist_ok=True)
pattern = f"{INPUT_DIR}/*-index.tsv"
outfile_path = Path(OUTPUT_DIR, TOTALS_OUTPUT_FILE)
output_stats = {}
for user_index_path in glob.iglob(pattern):
filename = os.path.basename(user_index_path)
username = filename.removesuffix("-index.tsv")
print(f"Starting {username}")
full_stats = DirectoryStats.from_index(username, user_index_path)
output_stats[username] = full_stats.summary

with outfile_path.open(mode="w", encoding="utf-8") as totals_file:
json.dump(output_stats, totals_file, indent=2)

print(f"Success: report written to {outfile_path}")


class TestDirectoryStatistics(unittest.TestCase):
def test_propagate_dir(self):
stats = DirectoryStats(root="a")
stats["a/b/c"].update({"total_size": 100, "file_count": 3})
stats["a/b"].update({"total_size": 10, "file_count": 0})
stats["a"].update({"total_size": 1, "file_count": 0})

stats.propagate_dir("a", "a/b/c")
self.assertEqual(stats["a"]["file_count"], 3)
self.assertEqual(stats["a/b"]["file_count"], 3)
self.assertEqual(stats["a"]["total_size"], 111)

def test_propagate_dir_abs_path(self):
stats = DirectoryStats(root="/a")
stats["/a/b/c"].update({"file_count": 3})

stats.propagate_dir("/a", "/a/b/c")
self.assertEqual(stats["/a"]["file_count"], 3)
self.assertEqual(stats["/a/b"]["file_count"], 3)

def test_propagate_dir_files_in_all(self):
stats = DirectoryStats(root="a")
stats["a/b/c"].update({"file_count": 3})
stats["a/b"].update({"file_count": 2})
stats["a"].update({"file_count": 1})

stats.propagate_dir("a", "a/b/c")
self.assertEqual(stats["a"]["file_count"], 6)
self.assertEqual(stats["a/b"]["file_count"], 5)

def test_from_data_empty(self):
sample_data = []
stats = DirectoryStats.from_data("a", sample_data)
self.assertEqual(stats["a"]["file_count"], 0)

def test_generate_statistics_inc_bids_0(self):
sample_data = [("a/b/file3.txt", 3456, "2024-12-01", "2024-12-02")]
stats = DirectoryStats.from_data("a", sample_data)
self.assertEqual(stats["a/b"]["bids_datasets"], 0)
self.assertEqual(stats["a"]["bids_datasets"], 0)

def test_generate_statistics_inc_bids_subdatasets(self):
sample_data = [
("a/b/c/subdir_of_bids", 3456, "2024-12-01", "2024-12-02"),
("a/b/dataset_description.json", 3456, "2024-12-01", "2024-12-02"),
("a/d/dataset_description.json", 3456, "2024-12-01", "2024-12-02"),
(
"a/d/subdataset/dataset_description.json",
3456,
"2024-12-01",
"2024-12-02",
),
]
stats = DirectoryStats.from_data("a", sample_data)
self.assertEqual(stats["a/b/c"]["bids_datasets"], 0)
self.assertEqual(stats["a/b"]["bids_datasets"], 1)
self.assertEqual(stats["a/d/subdataset"]["bids_datasets"], 1)
self.assertEqual(stats["a/d"]["bids_datasets"], 2)
self.assertEqual(stats["a"]["bids_datasets"], 3)

def test_generate_statistics_inc_usercache(self):
sample_data = [
("a/.cache/x", 3456, "2024-12-01", "2024-12-02"),
("a/.cache/y", 3456, "2024-12-01", "2024-12-02"),
("a/.cache/nested/y", 3456, "2024-12-01", "2024-12-02"),
("a/b/notcache", 3456, "2024-12-01", "2024-12-02"),
]
stats = DirectoryStats.from_data("a", sample_data)
self.assertEqual(stats["a"]["user_cache_file_count"], 3)
self.assertEqual(stats["a"]["user_cache_size"], 3456 * 3)
self.assertEqual(stats["a/.cache"]["user_cache_file_count"], 3)
self.assertEqual(stats["a/.cache"]["user_cache_size"], 3456 * 3)
self.assertEqual(stats["a/b"]["user_cache_file_count"], 0)
self.assertEqual(stats["a/b"]["user_cache_size"], 0)

def test_generate_statistics(self):
sample_data = [
("a/b/file3.txt", 3456, "2024-12-01", "2024-12-02"),
("a/b/c/file1.txt", 1234, "2024-12-01", "2024-12-02"),
("a/b/c/file2.txt", 2345, "2024-12-01", "2024-12-02"),
("a/b/c/d/file4.txt", 4567, "2024-12-01", "2024-12-02"),
("a/e/file3.txt", 5678, "2024-12-01", "2024-12-02"),
("a/e/f/file1.txt", 6789, "2024-12-01", "2024-12-02"),
("a/e/f/file2.txt", 7890, "2024-12-01", "2024-12-02"),
("a/e/f/g/file4.txt", 8901, "2024-12-01", "2024-12-02"),
]
stats = DirectoryStats.from_data("a", sample_data)
self.assertEqual(stats["a/b/c/d"]["file_count"], 1)
self.assertEqual(stats["a/b/c"]["file_count"], 3)
self.assertEqual(stats["a/b"]["file_count"], 4)
self.assertEqual(stats["a/e/f/g"]["file_count"], 1)
self.assertEqual(stats["a/e/f"]["file_count"], 3)
self.assertEqual(stats["a/e"]["file_count"], 4)
self.assertEqual(stats["a"]["file_count"], 8)


if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
unittest.main(
argv=sys.argv[:1]
) # Run tests if "test" is provided as an argument
else:
main()
Loading