Skip to content

Conversation

@yuting1214
Copy link

@yuting1214 yuting1214 commented Feb 25, 2025

Hi @michen00 michen00,

I'm submitting this PR to your recent post on LinkedIn about the opportunity for an entry-level data scientist at Aicadium.
I'm thrilled by the chance and, as per your instructions, decided to contribute to one of your repositories.

What This PR Does:

This PR introduces a new utility script, large-files, which helps identify large files in a Git repository. By scanning the full repository history, this script makes it easier to locate oversized files that may impact performance, slow down cloning, or unnecessarily bloat the repository.

Here is my LinkedIn profile, look forward to further discussing with you.
https://www.linkedin.com/in/yu-ting-chen/


Note

Adds a large-files Bash utility to display the top N largest files ever committed in a Git repo, with validation and human-readable output.

  • CLI Utility: Add large-files Bash script
    • Scans full Git history to list top N largest objects/files (default 20).
    • Validates single numeric argument; shows usage/help via usage().
    • Ensures execution inside a Git repository before running.
    • Pipeline: git rev-list --objects --all -> git cat-file --batch-check -> sort -rh -> head -n N -> awk -> numfmt for human-readable sizes.

Written by Cursor Bugbot for commit 446cb54. Configure here.

@michen00
Copy link
Owner

michen00 commented Mar 1, 2025

This looks like a useful tool! Could you please update the README too? Thanks for your contribution 👍

Comment on lines +16 to +17
This script scans the entire Git history and identifies the largest files that
have ever existed in the repository. It uses 'git rev-list' to extract object
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering all files that have ever existed significantly diminishes the utility of this tool. Consider the following output of refs to the same file:

image

(Side note: If we are going to claim 'to have ever existed', we should probably do an explicit fetchin the beginning.)

I can think of a couple of ways to make this script more useful:

  • explicitly limit scope: e.g., only consider files that exist at the HEAD of the current branch
  • refine the output: e.g., add distinguishing/informative column(s) to the output
  • parametrize the command: e.g., add more options to filter or sort output by different fields

@yuting1214, what do you think? Do any of the above (alone or in conjunction) sound more or less appealing to you?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Michael,
I'm more than happy to contribute more about this repo. It's just I'm a little occupied at the moment.
Once I settled a few things, I'll come back and write more code with you. Stay tuned.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

large-files Outdated
large-files # Show top 20 largest files
large-files 50 # Show top 50 largest files
EOF
exit 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Error conditions exit with success code zero

The usage() function always exits with code 0, but it's called in error scenarios: when too many arguments are provided (line 35-36) and when the argument isn't a valid integer (line 40-41). Error conditions incorrectly report success to any scripts or automation checking the exit code. The "Error:" message is printed but then usage() exits with 0 instead of a non-zero error code.

Additional Locations (2)

Fix in Cursor Fix in Web

large-files Outdated
git cat-file --batch-check='%(objectsize:disk) %(rest)' |
sort -rh |
head -n "$NUM_FILES" |
awk '{ printf "%10s %s\n", $1, $2 }' |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Filenames with spaces are truncated in output

The awk command uses $2 to print the filename, but awk splits fields on whitespace by default. For files with spaces in their paths (e.g., my document.txt), only the portion before the first space is printed, and the rest of the filename is silently dropped. The command needs to print all fields from $2 onward to preserve the complete file path.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants