-
-
Notifications
You must be signed in to change notification settings - Fork 13
feat: add large-file-script. #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This looks like a useful tool! Could you please update the README too? Thanks for your contribution 👍 |
| This script scans the entire Git history and identifies the largest files that | ||
| have ever existed in the repository. It uses 'git rev-list' to extract object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering all files that have ever existed significantly diminishes the utility of this tool. Consider the following output of refs to the same file:
(Side note: If we are going to claim 'to have ever existed', we should probably do an explicit fetchin the beginning.)
I can think of a couple of ways to make this script more useful:
- explicitly limit scope: e.g., only consider files that exist at the HEAD of the current branch
- refine the output: e.g., add distinguishing/informative column(s) to the output
- parametrize the command: e.g., add more options to filter or sort output by different fields
@yuting1214, what do you think? Do any of the above (alone or in conjunction) sound more or less appealing to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Michael,
I'm more than happy to contribute more about this repo. It's just I'm a little occupied at the moment.
Once I settled a few things, I'll come back and write more code with you. Stay tuned.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is being reviewed by Cursor Bugbot
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
large-files
Outdated
| large-files # Show top 20 largest files | ||
| large-files 50 # Show top 50 largest files | ||
| EOF | ||
| exit 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Error conditions exit with success code zero
The usage() function always exits with code 0, but it's called in error scenarios: when too many arguments are provided (line 35-36) and when the argument isn't a valid integer (line 40-41). Error conditions incorrectly report success to any scripts or automation checking the exit code. The "Error:" message is printed but then usage() exits with 0 instead of a non-zero error code.
Additional Locations (2)
large-files
Outdated
| git cat-file --batch-check='%(objectsize:disk) %(rest)' | | ||
| sort -rh | | ||
| head -n "$NUM_FILES" | | ||
| awk '{ printf "%10s %s\n", $1, $2 }' | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Filenames with spaces are truncated in output
The awk command uses $2 to print the filename, but awk splits fields on whitespace by default. For files with spaces in their paths (e.g., my document.txt), only the portion before the first space is printed, and the rest of the filename is silently dropped. The command needs to print all fields from $2 onward to preserve the complete file path.
Hi @michen00 michen00,
I'm submitting this PR to your recent post on LinkedIn about the opportunity for an entry-level data scientist at Aicadium.
I'm thrilled by the chance and, as per your instructions, decided to contribute to one of your repositories.
What This PR Does:
This PR introduces a new utility script, large-files, which helps identify large files in a Git repository. By scanning the full repository history, this script makes it easier to locate oversized files that may impact performance, slow down cloning, or unnecessarily bloat the repository.
Here is my LinkedIn profile, look forward to further discussing with you.
https://www.linkedin.com/in/yu-ting-chen/
Note
Adds a
large-filesBash utility to display the top N largest files ever committed in a Git repo, with validation and human-readable output.large-filesBash scriptNlargest objects/files (default20).usage().git rev-list --objects --all->git cat-file --batch-check->sort -rh->head -n N->awk->numfmtfor human-readable sizes.Written by Cursor Bugbot for commit 446cb54. Configure here.