feat: add large-file-script. #10

yuting1214 · 2025-02-25T16:04:08Z

I'm submitting this PR to your recent post on LinkedIn about the opportunity for an entry-level data scientist at Aicadium.
I'm thrilled by the chance and, as per your instructions, decided to contribute to one of your repositories.

What This PR Does:

This PR introduces a new utility script, large-files, which helps identify large files in a Git repository. By scanning the full repository history, this script makes it easier to locate oversized files that may impact performance, slow down cloning, or unnecessarily bloat the repository.

Here is my LinkedIn profile, look forward to further discussing with you.
https://www.linkedin.com/in/yu-ting-chen/

Note

Adds a large-files Bash utility to display the top N largest files ever committed in a Git repo, with validation and human-readable output.

CLI Utility: Add large-files Bash script
- Scans full Git history to list top N largest objects/files (default 20).
- Validates single numeric argument; shows usage/help via usage().
- Ensures execution inside a Git repository before running.
- Pipeline: git rev-list --objects --all -> git cat-file --batch-check -> sort -rh -> head -n N -> awk -> numfmt for human-readable sizes.

^{Written by Cursor Bugbot for commit 446cb54. Configure here.}

michen00 · 2025-03-01T09:23:55Z

This looks like a useful tool! Could you please update the README too? Thanks for your contribution 👍

michen00 · 2025-03-06T07:38:46Z

large-files

+  This script scans the entire Git history and identifies the largest files that
+  have ever existed in the repository. It uses 'git rev-list' to extract object


Considering all files that have ever existed significantly diminishes the utility of this tool. Consider the following output of refs to the same file:

(Side note: If we are going to claim 'to have ever existed', we should probably do an explicit fetchin the beginning.)

I can think of a couple of ways to make this script more useful:

explicitly limit scope: e.g., only consider files that exist at the HEAD of the current branch

refine the output: e.g., add distinguishing/informative column(s) to the output

parametrize the command: e.g., add more options to filter or sort output by different fields

@yuting1214, what do you think? Do any of the above (alone or in conjunction) sound more or less appealing to you?

Hi Michael,
I'm more than happy to contribute more about this repo. It's just I'm a little occupied at the moment.
Once I settled a few things, I'll come back and write more code with you. Stay tuned.

for more information, see https://pre-commit.ci

cursor

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2025-11-26T10:45:08Z

large-files

+  large-files        # Show top 20 largest files
+  large-files 50     # Show top 50 largest files
+EOF
+    exit 0


Bug: Error conditions exit with success code zero

The usage() function always exits with code 0, but it's called in error scenarios: when too many arguments are provided (line 35-36) and when the argument isn't a valid integer (line 40-41). Error conditions incorrectly report success to any scripts or automation checking the exit code. The "Error:" message is printed but then usage() exits with 0 instead of a non-zero error code.

Additional Locations (2)

large-files#L34-L36

large-files#L39-L41

cursor · 2025-11-26T10:45:08Z

large-files

+    git cat-file --batch-check='%(objectsize:disk) %(rest)' |
+    sort -rh |
+    head -n "$NUM_FILES" |
+    awk '{ printf "%10s %s\n", $1, $2 }' |


Bug: Filenames with spaces are truncated in output

The awk command uses $2 to print the filename, but awk splits fields on whitespace by default. For files with spaces in their paths (e.g., my document.txt), only the portion before the first space is printed, and the rest of the filename is silently dropped. The command needs to print all fields from $2 onward to preserve the complete file path.

feat: add large-file-script.

0da9b06

michen00 requested changes Mar 6, 2025

View reviewed changes

michen00 and others added 19 commits March 9, 2025 11:39

Merge branch 'main' into mark-chen-pr

268e0bd

Merge branch 'main' into mark-chen-pr

fea6565

Merge branch 'main' into mark-chen-pr

42f2b3e

Merge branch 'main' into mark-chen-pr

1229cfb

Merge branch 'main' into mark-chen-pr

e53e0cb

Merge branch 'main' into mark-chen-pr

4113f8f

Merge branch 'main' into mark-chen-pr

e564c47

Merge branch 'main' into mark-chen-pr

d5c04a3

Merge branch 'main' into mark-chen-pr

4c6db2f

Merge branch 'main' into mark-chen-pr

dc7373b

Merge branch 'main' into mark-chen-pr

97c74bf

Merge branch 'main' into mark-chen-pr

c9e29ce

Merge branch 'main' into mark-chen-pr

4e8e66f

Merge branch 'main' into mark-chen-pr

d65f6d2

Merge branch 'main' into mark-chen-pr

a3be28b

chore: autofix via pre-commit hooks

9b5f72c

for more information, see https://pre-commit.ci

Merge branch 'main' into mark-chen-pr

71c247d

Merge branch 'main' into mark-chen-pr

18acd2c

Merge branch 'main' into mark-chen-pr

446cb54

cursor bot reviewed Nov 26, 2025

View reviewed changes

michen00 and others added 7 commits November 26, 2025 05:56

fix(large-files): make the script executable

8f3cab6

Merge branch 'main' into mark-chen-pr

b66c0b0

Merge branch 'main' into mark-chen-pr

9baab2a

Merge branch 'main' into mark-chen-pr

f585fd5

Merge branch 'main' into mark-chen-pr

4811f24

Merge branch 'main' into mark-chen-pr

51b7259

Merge branch 'main' into mark-chen-pr

a09b445

Merge branch 'main' into mark-chen-pr

95a6e85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add large-file-script. #10

feat: add large-file-script. #10

Uh oh!

yuting1214 commented Feb 25, 2025 •

edited by cursor bot

Loading

Uh oh!

michen00 commented Mar 1, 2025

Uh oh!

michen00 Mar 6, 2025

Uh oh!

yuting1214 Mar 10, 2025

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Nov 26, 2025

Uh oh!

cursor bot Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		This script scans the entire Git history and identifies the largest files that
		have ever existed in the repository. It uses 'git rev-list' to extract object

Uh oh!

feat: add large-file-script. #10

Are you sure you want to change the base?

feat: add large-file-script. #10

Uh oh!

Conversation

yuting1214 commented Feb 25, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michen00 commented Mar 1, 2025

Uh oh!

michen00 Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

yuting1214 Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Uh oh!

cursor bot Nov 26, 2025

Choose a reason for hiding this comment

Bug: Error conditions exit with success code zero

Uh oh!

cursor bot Nov 26, 2025

Choose a reason for hiding this comment

Bug: Filenames with spaces are truncated in output

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuting1214 commented Feb 25, 2025 •

edited by cursor bot

Loading