diff --git a/README.md b/README.md index 1aa8e60..c678146 100644 --- a/README.md +++ b/README.md @@ -1,83 +1,372 @@ -# GitHub-ForceLargeFiles +# GitHub Force Large Files (GFL) -This package is a simple work around for pushing large files to a GitHub repo. +This tool provides a workaround for pushing large files to a GitHub repository without using Git LFS. It intelligently finds large files, compresses them, and splits them into smaller chunks that are compliant with GitHub's file size limits. +It also provides a fully integrated Git workflow to automatically add, commit, and push the generated chunks in batches, overcoming GitHub's total push size limit. This project is intended to overcome the constraints: 100 MiB limit on individual files and the 2 GiB limit per single git push. -Since GitHub only allows pushing files up to 100 MB, a different service (such as [LFS](https://git-lfs.github.com/)) has to be used for larger files. This package compresses and splits large files that can be pushed to a GitHub repo without LFS. +## Features -It starts off at a root directory and traverses down subdirectories, and scans every file contained. If any file has a size that is above `threshold_size`, then they are compressed and split to multiple archives, each having a maximum size of `partition_size`. Compressing/Splitting works for any file extension. +- **Cross-Platform**: Pure Python implementation, works on Windows, macOS, and Linux. +- **Smart Splitting**: Automatically scans a directory for files exceeding a size threshold. +- **Integrity Check**: Uses SHA256 hashes for both the original file and each individual chunk to ensure that reversed files are bit-for-bit identical to the originals. +- **Manifest System**: Tracks all split files and their chunks in a `.gfl_manifest.json` file, preventing accidental data loss and enabling robust tracking. +- **Integrated Git Workflow**: Provides commands to automate the entire process: `split` -> `commit` -> `push`. +- **Batch Committing & Pushing**: Automatically commits and pushes changes in batches to work around repository size limits. -After compression/split, files can be pushed the usual way, using `git push`. +## Requirements +- Python 3.9+ +- Git installed and available in your system's PATH. +- Required packages can be installed via `pip install -r requirements.txt`. -## Parallelization -- Although traversing directories in `src/main.py` is serial, compressing/splitting each file through 7z is parallelized by default. -- Reversing with `src/reverse.py` is entirely serial. (TODO: Parallelize this too) +## Installation +1. Clone the repository: -## Requirements -- Python 3.x.x. -- You need to have 7z installed. Visit the [7z Download](https://www.7-zip.org/download.html) page for more information. -- Folders/Files in the traversed directories should have appropriate read/write permissions. + ```sh + git clone https://github.com/your-username/GitHub-ForceLargeFiles.git + cd GitHub-ForceLargeFiles + ``` + +2. Install the required Python packages: + + ```sh + pip install -r requirements.txt + ``` +--- -## Example Usage -Run with the default parameters: +## Command Reference +The primary entry point is the `gfl.py` script. The tool is designed to be run from anywhere, as long as you specify the target repository using `--root_dir`. + +```sh +python gfl.py [COMMAND] [OPTIONS] ``` -$ python3 src/main.py --root_dir ~/MyFolder + +### Common Options + +These options are available for all commands. + +- `--root_dir `: Specifies the path to the Git repository you want to operate on. If not provided, it defaults to the current directory or automatically finds the Git repository root from parent directories. +- `--no-status-check`: Bypasses the safety check that ensures the Git repository is in a clean state before running a command. **Use with caution**, as running `gfl` on a repository with uncommitted changes or in the middle of a merge/rebase can lead to unexpected results. + +--- + +### `split` + +Scans for files larger than a given threshold, then compresses and splits them into smaller chunks. It records all information into the `.gfl_manifest.json` file, including hashes for integrity checks. + +- **Usage:** + + ```sh + python gfl.py split [OPTIONS] + ``` + +- **Key Options:** + - `--threshold-size `: Sets the file size threshold in megabytes. Files larger than this will be split. Default: `98`. + - `--partition-size `: Sets the maximum size for each chunk. Default: `95`. + - `--no-delete-original`: Use this flag to keep the original large file after splitting. By default, it is deleted. +- **Examples:** + + ```sh + # Split files larger than 80MB in the current repo + python gfl.py split --threshold-size 80 + + # Run split on a different repository without deleting the original file + python gfl.py split --root_dir /path/to/another/repo --no-delete-original + ``` + +--- + +### `reverse` + +Reassembles the original large files from their chunks using the `.gfl_manifest.json` file. It verifies the integrity of every chunk before assembly, and the final file after assembly. + +- **Usage:** + + ```sh + python gfl.py reverse [OPTIONS] + ``` + +- **Key Options:** + - `--no-delete-partitions`: Use this flag to keep the chunk files after the original file has been restored. By default, they are deleted. +- **Safety Feature:** This command will refuse to overwrite an existing file if its content does not match the original file recorded in the manifest. This prevents accidental data loss. +- **Examples:** + + ```sh + # Restore all large files in the current repo + python gfl.py reverse + + # Restore files in a different repo and keep the chunk files + python gfl.py reverse --root_dir /path/to/another/repo --no-delete-partitions + ``` + +--- + +### `commit` + +A smart commit tool that stages all changes and creates one or more clean, size-aware commits. This command replaces the need for manual `git add` and `git commit`. + +- **Usage:** + + ```sh + python gfl.py commit [OPTIONS] + ``` + +- **Key Logic:** + 1. First, it creates a separate commit for any file deletions. + 2. Then, it intelligently batches all other changes (new files, modified chunks, etc.) into commits that respect a given size limit. +- **Key Options:** + - `--commit-limit `: The size limit in gigabytes for a single commit batch. Default: `0.5`. +- **Examples:** + + ```sh + # Commit all changes in the current repo + python gfl.py commit + + # Commit changes in another repo, with a batch limit of 500MB + python gfl.py commit --root_dir /path/to/another/repo --commit-limit 0.5 + ``` + +--- + +### `push` + +Pushes your local commits to the remote repository (`origin`) one by one. This is essential for repositories with many GFL commits, as it bypasses the per-push size limits imposed by services like GitHub. + +- **Usage:** + + ```sh + python gfl.py push [OPTIONS] + ``` + +- **Example:** + + ```sh + # Push commits for the repository located at ../my-repo + python gfl.py push --root_dir ../my-repo + ``` + +--- + +### `auto-push` + +A convenience command that runs the entire authoring workflow for you: `split` -> `commit` -> `push`. + +- **Usage:** + + ```sh + python gfl.py auto-push [OPTIONS] + ``` + +- **Details:** This command accepts the options from all the commands it wraps, such as `--threshold-size` and `--commit-limit`. +- **Example:** + + ```sh + # Run the full workflow on a different repo, with a 150MB split threshold and 1GB commit limit + python gfl.py auto-push --root_dir ../my-other-repo --threshold-size 150 --commit-limit 1.0 + ``` + +--- + +### `download` + +Downloads a repository by fetching its commits one by one. This is useful for cloning very large repositories that might fail with a standard `git clone` due to network issues. It automatically resumes from the last downloaded commit if the process is interrupted. + +- **Usage:** + + ```sh + python gfl.py download [OPTIONS] + ``` + +- **Key Options:** + - `repo_url`: The full URL of the repository to download (HTTPS or SSH). + - `--output `: (Optional) Directory to download the repository into. Defaults to a new folder in the current directory named after the repository. + - `--proxy `: (Optional) Sets a proxy for the GitHub API requests needed to fetch the commit list. Example: `--proxy 'socks5h://127.0.0.1:9050'`. + - `--token `: (Optional) A GitHub Personal Access Token (PAT) to authenticate API requests and increase rate limits. The script will automatically use the `GITHUB_TOKEN` environment variable if it is set. +- **Examples:** + + ```sh + # Download a repository to a specific directory + python gfl.py download https://github.com/user/big-repo.git --output /path/to/my-downloads/big-repo + + # Download a repository using SSH and a SOCKS5 proxy for the API calls + python gfl.py download git@github.com:user/big-repo.git --proxy 'socks5h://127.0.0.1:9050' + ``` + +--- + +## Workflows + +Here are the three primary workflows for using GFL. + +### Workflow 1: Creating a New Repository with Large Files + +**Goal:** To create a brand new repository on GitHub for a project that contains large files. + +**Step 1: Create an Empty Repository on GitHub** + +1. Go to [GitHub.com](https://github.com) and create a new repository. +2. **Important:** Do **not** initialize the repository with a `README`, `.gitignore`, or `license`. You need a completely empty repository to begin. +3. Copy the repository's URL (e.g., `https://github.com/your-username/your-big-repo.git`). + +**Step 2: Prepare Your Local Project** + +1. Create a folder for your project on your computer and navigate into it. + + ```sh + mkdir my-big-project + cd my-big-project + ``` + +2. Initialize a Git repository. + + ```sh + git init + ``` + +3. Add your GitHub repository as the remote origin (paste the URL from Step 1). + + ```sh + git remote add origin https://github.com/your-username/your-big-repo.git + ``` + +4. Copy all your project files, including the very large ones, into this folder. +5. (Optional) Place the `gfl.py` script in this folder so you can run it easily. + +**Step 3: Process and Push Your Large Files** + +Use the all-in-one `auto-push` command. This single command will scan for large files, split them, create size-aware commits, and push everything to GitHub sequentially. + +```sh +# This command handles the entire workflow for you. +# (Optional) Add --manage-gitignore to have GFL automatically ignore the original large files. +python gfl.py auto-push --manage-gitignore +``` + +### Workflow 2: Adding Large Files to an Existing Repository + +**Goal:** To add new large files to a project that is already in a GitHub repository. + +**Step 1: Add Your Files to Your Local Repository** + +1. Make sure you have the latest version of your project by running `git pull`. +2. Copy or save your new large files into your project's folder (e.g., into a `data` or `assets` sub-folder). + +**Step 2: Process and Push the New Files** + +From inside your repository's directory, run the `auto-push` command. It will ignore all the existing files, find only the new large files you just added, and run the full split, commit, and push process for them. + +```sh +# This command finds the new large files and handles the full workflow. +python gfl.py auto-push +``` + +If you want more control, you can also run the commands manually: +`python gfl.py split` -> `python gfl.py commit` -> `python gfl.py push`. + +### Workflow 3: Cloning and Restoring a GFL Repository + +**Goal:** To download a repository that was created with GFL and restore the original large files. + +**Step 1: Clone the Repository** + +Standard `git clone` may fail on very large repositories. It's highly recommended to use the GFL `download` command, which fetches commits one by one for better reliability. + +```sh +# Replace the URL with your repository's URL +python gfl.py download https://github.com/your-username/your-big-repo.git + +# Navigate into the newly created directory +cd your-big-repo ``` -which will traverse down every subdirectory starting from `~/MyFolder`, and reduce all files over 100 MB to smaller archives with maximum size of approximately 95 MB. The default option is to delete the original (large) files afterwards, but this can be turned off. -The comparison below describes the use of this package more clearly. +**Step 2: Restore the Original Large Files** -Before: +After cloning, the repository only contains the small chunk files. Run the `reverse` command to reassemble the original large files. + +```sh +# This reads the manifest, rebuilds the large files, and deletes the chunks. +python gfl.py reverse ``` -$ tree --du -h ~/MyFolder - -└── [415M] My Datasets -│ ├── [6.3K] Readme.txt -│   └── [415M] Data on Leaf-Tailed Gecko -│   ├── [ 35M] DatasetA.zip -│   ├── [ 90M] DatasetB.zip -│   ├── [130M] DatasetC.zip -│    └── [160M] Books -│    ├── [ 15M] RegularBook.pdf -│    └── [145M] BookWithPictures.pdf -└── [818M] Video Conference Meetings - ├── [817M] Discussion_on_Fermi_Paradox.mp4 - └── [1.1M] Notes_on_Discussion.pdf + +Your local copy of the project is now complete, and all the large files are restored to their original state. + +## Advanced Usage + +### Bypassing API Rate Limits + +When using the `download` command, you might encounter GitHub's API rate limit, especially if you are on a shared network or using a proxy/VPN. Unauthenticated requests are limited to 60 per hour per IP. To solve this, you can use a GitHub Personal Access Token (PAT) to raise the limit to 5,000 requests per hour. + +1. **Create a Personal Access Token** on GitHub. You can follow the [official GitHub documentation](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token). The only permission (scope) required for this script to read commit history from a public repository is `public_repo`. + +2. **Provide the Token to the Script**. You have two options: + + - **(Recommended)** Set an environment variable named `GITHUB_TOKEN`. The script will automatically detect and use it. + + ```sh + # On Linux/macOS + export GITHUB_TOKEN="your_github_token_here" + + # On Windows (Command Prompt) + set GITHUB_TOKEN="your_github_token_here" + + # On Windows (PowerShell) + $env:GITHUB_TOKEN="your_github_token_here" + ``` + + - Use the `--token` command-line flag. **Note:** This may expose the token in your shell's history. + + ```sh + python gfl.py download git@github.com:user/big-repo.git --token "your_github_token_here" + ``` + +### Using a Proxy (e.g., Tor) + +The `download` command requires making two types of network requests: API calls to GitHub and Git operations. To route all traffic through a proxy like Tor, you must configure both. + +#### 1. Configure API Requests (Script-Side) + +Use the `--proxy` flag on the `download` command. This tells the script to use a proxy for its GitHub API requests. + +```sh +# Example using Tor's default proxy address +python gfl.py download git@github.com:user/big-repo.git --proxy 'socks5h://127.0.0.1:9050' --token "your_github_token_here" --output ..\outputdir +``` + +#### 2. Configure Git Operations (Client-Side) + +You also need to configure your local Git client to use the proxy. This is a one-time setup on your machine. + +**For HTTPS URLs:** + +Run the following command to tell Git to use the proxy for all HTTPS operations: + +```sh +git config --global http.proxy 'socks5h://127.0.0.1:9050' ``` -After: +**For SSH URLs:** + +Edit your SSH configuration file (usually at `~/.ssh/config` on Linux/macOS or `C:\Users\\.ssh\config` on Windows) and add the following `Host` block. This tells SSH to route connections to `github.com` through the proxy. ``` -$ tree --du -h ~/MyFolder - -└── [371M] My Datasets -│ ├── [6.3K] Readme.txt -│   └── [371M] Data on Leaf-Tailed Gecko -│   ├── [ 35M] DatasetA.zip -│   ├── [ 90M] DatasetB.zip -│   ├── [ 95M] DatasetC.zip.7z.001 -│   ├── [ 18M] DatasetC.zip.7z.002 -│    └── [133M] Books -│    ├── [ 15M] RegularBook.pdf -│    ├── [ 95M] BookWithPictures.pdf.7z.001 -│    └── [ 23M] BookWithPictures.pdf.7z.002 -└── [794M] Video Conference Meetings - ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.001 - ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.002 - ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.003 - ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.004 - ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.005 - ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.006 - ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.007 - ├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.008 - ├── [ 33M] Discussion_on_Fermi_Paradox.mp4.7z.009 - └── [1.1M] Notes_on_Discussion.pdf +Host github.com + User git + Hostname github.com + Port 22 + ProxyCommand nc -X 5 -x 127.0.0.1:9050 %h %p ``` -To revert back to the original files, run: + +*Note: This method requires `nc` (netcat) to be installed on your system.* + +``` +Host github.com + User git + Hostname github.com + Port 22 + ProxyCommand C:/Users//scoop/apps/git/current/mingw64/bin/connect.exe -a none -S 127.0.0.1:9050 %h %p -a none -S 127.0.0.1:9050 %h %p ``` -$ python3 src/reverse.py --root_dir ~/MyFolder -``` + +*Note: This method requires `connect.exe` (from git-for-windows) on your system.* diff --git a/gfl.py b/gfl.py new file mode 100644 index 0000000..9fa384f --- /dev/null +++ b/gfl.py @@ -0,0 +1,932 @@ +import argparse +import sys +import os +import git +import json +import hashlib +import lzma +import io +import multiprocessing +import subprocess +import requests +import re + +# --- Constants --- +DEFAULT_THRESHOLD_SIZE_MB = 98 +DEFAULT_PARTITION_SIZE_MB = 95 +DEFAULT_COMMIT_LIMIT_GB = 0.5 + + +def get_git_root(path): + """Find the git repository root.""" + try: + repo = git.Repo(path, search_parent_directories=True) + return repo.working_dir + except git.InvalidGitRepositoryError: + return None + + +def check_git_status(git_root, command): + """Checks if the Git repository is in a clean state before proceeding.""" + if not git_root: + print("Error: Not a Git repository. Please run in a git repository.") + sys.exit(1) + repo = git.Repo(git_root) + try: + # 1. Critical states that should always cause an exit + if repo.head.is_detached: + print( + "Error: HEAD is detached. Please check out a branch before running gfl." + ) + sys.exit(1) + git_dir = repo.git_dir + if os.path.exists(os.path.join(git_dir, "MERGE_HEAD")): + print( + "Error: A merge is in progress. Please resolve the conflict and commit before running gfl." + ) + sys.exit(1) + if os.path.exists(os.path.join(git_dir, "rebase-apply")) or os.path.exists( + os.path.join(git_dir, "rebase-merge") + ): + print( + "Error: A rebase is in progress. Please complete or abort it before running gfl." + ) + sys.exit(1) + + # 2. Command-specific checks for working directory changes + status_output = repo.git.status(porcelain=True) + if command in ["split", "auto-push", "reverse"]: + # For these commands, we allow untracked files but nothing else. + non_untracked_changes = [ + line for line in status_output.splitlines() if not line.startswith("??") + ] + if non_untracked_changes: + print( + f"Error: Your repository has staged or modified files.", + file=sys.stderr, + ) + print( + f"Please commit or stash them before running '{command}'.", + file=sys.stderr, + ) + print( + "Git status details:\n" + "\n".join(non_untracked_changes), + file=sys.stderr, + ) + sys.exit(1) + + # For 'commit' and 'push', no working directory checks are needed. + print("Git status is clean for this operation. Proceeding...") + finally: + # Ensure the repo object is closed to release git handles. + repo.close() + + +def main(): + """Main function to parse arguments and call appropriate sub-commands.""" + + parser = argparse.ArgumentParser( + description="GitHub Force Large Files (GFL): A tool to handle large files in Git repositories." + ) + + subparsers = parser.add_subparsers( + dest="command", required=True, help="Available commands" + ) + + git_root = get_git_root(os.getcwd()) + + default_dir = git_root if git_root else os.getcwd() + + base_parser = argparse.ArgumentParser(add_help=False) + + base_parser.add_argument( + "--root_dir", + type=str, + default=default_dir, + help="Root directory. Defaults to Git repo root or current directory.", + ) + base_parser.add_argument( + "--no-status-check", + action="store_true", + help="Bypass the git status safety check before running a command.", + ) + base_parser.add_argument( + "--manage-gitignore", + action="store_true", + help="Enable automatic management of .gitignore for large files.", + ) + + commit_parser = argparse.ArgumentParser(add_help=False) + + commit_parser.add_argument( + "--commit-limit", + type=float, + default=DEFAULT_COMMIT_LIMIT_GB, + help=f"Size limit in GB for a single commit (default: {DEFAULT_COMMIT_LIMIT_GB}GB).", + ) + + parser_split = subparsers.add_parser( + "split", + help="Scan for large files, compress and split them into chunks.", + parents=[base_parser], + ) + + parser_split.add_argument( + "--delete-original", + action=argparse.BooleanOptionalAction, + default=True, + help="Delete the original file after splitting (default: True).", + ) + + parser_split.add_argument( + "--threshold-size", + type=int, + default=DEFAULT_THRESHOLD_SIZE_MB, + help=f"Max threshold of the original file size to split into archive in MB (default: {DEFAULT_THRESHOLD_SIZE_MB}MB).", + ) + + parser_split.add_argument( + "--partition-size", + type=int, + default=DEFAULT_PARTITION_SIZE_MB, + help=f"Max size of an individual archive in MB (default: {DEFAULT_PARTITION_SIZE_MB}MB).", + ) + + parser_split.set_defaults(func=handle_split) + + parser_reverse = subparsers.add_parser( + "reverse", + help="Recreate original large files from their chunks.", + parents=[base_parser], + ) + + parser_reverse.add_argument( + "--delete-partitions", + type=bool, + default=True, + help="Do you want to delete the partition archives after extracting the original files?", + ) + + parser_reverse.set_defaults(func=handle_reverse) + + parser_commit = subparsers.add_parser( + "commit", + help="Batch and commit all changes by size and type.", + parents=[base_parser, commit_parser], + ) + + parser_commit.set_defaults(func=handle_commit) + + parser_push = subparsers.add_parser( + "push", + help="Push commits in batches to the remote repository.", + parents=[base_parser], + ) + + parser_push.set_defaults(func=handle_push) + + parser_auto_push = subparsers.add_parser( + "auto-push", + help="Fully automated workflow: split -> commit -> push.", + parents=[base_parser, commit_parser], + ) + + parser_auto_push.set_defaults(func=handle_auto_push) + + parser_download = subparsers.add_parser( + "download", + help="Download a repository commit by commit.", + parents=[base_parser], + ) + parser_download.add_argument( + "repo_url", type=str, help="The URL of the git repository to download." + ) + parser_download.add_argument( + "--proxy", + type=str, + help="Proxy for API requests (e.g., 'socks5h://127.0.0.1:9050').", + ) + parser_download.add_argument( + "--token", type=str, help="GitHub Personal Access Token for authentication." + ) + parser_download.add_argument( + "--output", + type=str, + help="Directory to download the repository into. Defaults to a new folder named after the repo.", + ) + parser_download.set_defaults(func=handle_download) + parser_download.set_defaults(func=handle_download) + + args = parser.parse_args() + + # Perform git status check for commands that need a specific repo state + + if not args.no_status_check and args.command in ["split", "reverse", "auto-push"]: + + current_git_root = get_git_root(args.root_dir) + + check_git_status(current_git_root, args.command) + + args.func(args) + + +def get_file_hash(file_path): + sha256 = hashlib.sha256() + with open(file_path, "rb") as f: + while chunk := f.read(8192): + sha256.update(chunk) + return sha256.hexdigest() + + +def is_over_threshold(f_full_dir, args): + threshold_bytes = args.threshold_size * (1024**2) + return os.stat(f_full_dir).st_size > threshold_bytes + + +def process_file_worker(file_info): + f_full_dir, args, git_root = file_info + relative_path = os.path.relpath(f_full_dir, git_root).replace(os.sep, "/") + print(f"Processing '{relative_path}'...") + current_hash = get_file_hash(f_full_dir) + with open(f_full_dir, "rb") as f_in: + compressed_data = lzma.compress(f_in.read()) + partition_size_bytes = args.partition_size * (1024**2) + f_full_dir_noext, ext = os.path.splitext(f_full_dir) + archive_base_name = f_full_dir_noext + "." + ext[1:] + ".xz" + chunks = [] + chunk_num = 1 + with io.BytesIO(compressed_data) as compressed_stream: + while True: + chunk_data = compressed_stream.read(partition_size_bytes) + if not chunk_data: + break + chunk_filename_abs = f"{archive_base_name}.{str(chunk_num).zfill(3)}" + with open(chunk_filename_abs, "wb") as f_chunk: + f_chunk.write(chunk_data) + chunk_relative_path = os.path.relpath(chunk_filename_abs, git_root).replace( + os.sep, "/" + ) + chunk_hash = get_file_hash(chunk_filename_abs) + chunks.append({"path": chunk_relative_path, "sha256": chunk_hash}) + chunk_num += 1 + if args.delete_original: + os.remove(f_full_dir) + return relative_path, {"original_sha256": current_hash, "chunks": chunks} + + +def traverse_and_find_files(args, manifest, git_root): + files_to_process = [] + for root, dirs, files in os.walk(args.root_dir): + if ".git" in dirs: + dirs.remove(".git") + for f in files: + f_full_dir = os.path.join(root, f) + if f == ".gfl_manifest.json" or not os.path.isfile(f_full_dir): + continue + if not is_over_threshold(f_full_dir, args): + continue + relative_path = os.path.relpath(f_full_dir, git_root).replace(os.sep, "/") + current_hash = get_file_hash(f_full_dir) + if relative_path in manifest: + if manifest[relative_path].get("original_sha256") == current_hash: + continue # Skip unchanged files + else: + # File has changed, warn and abort + print( + f"ERROR: Large file '{relative_path}' has changed since it was last split." + ) + print( + "The original file hash in the manifest does not match the current file hash." + ) + print( + "To proceed, please run 'gfl reverse' to restore the old version, manage the conflict, then re-run 'gfl split'." + ) + sys.exit(1) + files_to_process.append((f_full_dir, args, git_root)) + return files_to_process + + +def batch_commit( + repo, files_to_commit, commit_message_prefix, commit_limit_bytes, author, committer +): + """Generic function to commit a list of files in size-aware batches.""" + current_batch_files, current_batch_size, commit_batches = [], 0, [] + # Sort files to make batching deterministic + sorted_files = sorted(files_to_commit.items()) + + for file_path, file_size in sorted_files: + if current_batch_size + file_size > commit_limit_bytes and current_batch_files: + commit_batches.append(current_batch_files) + current_batch_files, current_batch_size = [], 0 + current_batch_files.append(file_path) + current_batch_size += file_size + if current_batch_files: + commit_batches.append(current_batch_files) + + total_batches = len(commit_batches) + if total_batches == 0: + return False + + for i, batch in enumerate(commit_batches): + batch_num = i + 1 + commit_message = ( + f"{commit_message_prefix} ({batch_num}/{total_batches})" + if total_batches > 1 + else commit_message_prefix + ) + print( + f"Creating commit {batch_num}/{total_batches} with {len(batch)} file(s)..." + ) + repo.index.add(batch) + repo.index.commit(commit_message, author=author, committer=committer) + print(f'Successfully committed with message: "{commit_message}"') + return True + + +def handle_split(args): + print("Executing: split") + git_root = get_git_root(args.root_dir) + if not git_root: + print("Error: Not a Git repository.") + sys.exit(1) + manifest_path = os.path.join(git_root, ".gfl_manifest.json") + manifest = {} + if os.path.exists(manifest_path): + with open(manifest_path, "r") as f: + manifest = json.load(f) + files_to_process = traverse_and_find_files(args, manifest, git_root) + if not files_to_process: + print("No new or modified large files to split.") + return + + print(f"Found {len(files_to_process)} file(s) to process in parallel...") + with multiprocessing.Pool() as pool: + results = pool.map(process_file_worker, files_to_process) + + processed_paths = [] + for relative_path, manifest_entry in results: + manifest[relative_path] = manifest_entry + processed_paths.append(relative_path) + print(f"Updated manifest for '{relative_path}'.") + + with open(manifest_path, "w") as f: + json.dump(manifest, f, indent=4) + print(f"Manifest file '{manifest_path}' has been updated.") + + if processed_paths and args.manage_gitignore: + gitignore_path = os.path.join(git_root, ".gitignore") + print( + f"Info: Updating .gitignore for {len(processed_paths)} new large file(s)..." + ) + + existing_ignores = [] + if os.path.exists(gitignore_path): + with open(gitignore_path, "r") as f: + existing_ignores = [line.strip() for line in f.readlines()] + + with open(gitignore_path, "a") as f: + for path in processed_paths: + path_as_ignore = path.replace(os.sep, "/") + if path_as_ignore not in existing_ignores: + f.write(f"\n# GFL: Tracked large file\n{path_as_ignore}\n") + print(".gitignore has been updated.") + + +def handle_commit(args): + print("Executing: batch commit") + git_root = get_git_root(args.root_dir) + if not git_root: + print("Error: Not a Git repository.") + sys.exit(1) + repo = git.Repo(git_root) + try: + # Proactively clean up manifest and .gitignore if ALL chunks for an entry are deleted. + manifest_path = os.path.join(git_root, ".gfl_manifest.json") + if os.path.exists(manifest_path): + with open(manifest_path, "r") as f: + manifest = json.load(f) + + original_manifest = manifest.copy() + manifest_updated = False + paths_to_unignore = [] + + for original_path, data in original_manifest.items(): + chunks = data.get("chunks", []) + if not chunks: + continue + + all_chunks_deleted = True + for chunk_info in chunks: + chunk_abs_path = os.path.join( + git_root, chunk_info["path"].replace("/", os.sep) + ) + if os.path.exists(chunk_abs_path): + all_chunks_deleted = False + break + + if all_chunks_deleted: + print( + f"Info: All chunks for '{original_path}' are deleted. Removing from manifest." + ) + del manifest[original_path] + paths_to_unignore.append(original_path) + manifest_updated = True + + if manifest_updated: + with open(manifest_path, "w") as f: + json.dump(manifest, f, indent=4) + print("Manifest update complete.") + + if paths_to_unignore and args.manage_gitignore: + print("Info: Removing GFL entries from .gitignore...") + gitignore_path = os.path.join(git_root, ".gitignore") + if os.path.exists(gitignore_path): + with open(gitignore_path, "r") as f: + lines = f.readlines() + + paths_to_unignore_slashed = { + p.replace(os.sep, "/") for p in paths_to_unignore + } + with open(gitignore_path, "w") as f: + for line in lines: + if ( + line.strip() not in paths_to_unignore_slashed + and not line.strip().startswith("# GFL") + ): + f.write(line) + print(".gitignore update complete.") + + author, committer = getattr(args, "author", None), getattr( + args, "committer", None + ) + + # 1. Find tracked files that have been deleted from the working tree + deleted_files = [ + item.a_path for item in repo.index.diff(None) if item.deleted_file + ] + + # 2. Stage everything to discover all other changes + repo.git.add(all=True) + + # 3. Get a list of all other staged changes (additions/modifications) + staged_items = repo.index.diff("HEAD") + modified_files = [ + item.a_path for item in staged_items if item.a_path not in deleted_files + ] + + if not deleted_files and not modified_files: + print("No changes to commit.") + repo.git.reset( + "HEAD", "--" + ) # Clean up index if we staged but found no diff + return + + # 4. Unstage everything to handle changes in batches + repo.git.reset("HEAD", "--", ".") + committed_something = False + + # 5. Handle and commit deleted files first + if deleted_files: + print("--- Committing deleted files ---") + repo.index.remove(deleted_files, working_tree=False) + repo.index.commit( + "Remove deleted files", author=author, committer=committer + ) + print(f"Successfully committed {len(deleted_files)} deleted file(s).") + committed_something = True + + # 6. Proceed with batch-committing modified/new files + if modified_files: + manifest_rel_path = os.path.relpath(manifest_path, git_root).replace( + os.sep, "/" + ) + all_chunk_files = set() + if os.path.exists(manifest_path): + with open(manifest_path, "r") as f: + current_manifest_data = json.load(f) + for item in current_manifest_data.values(): + for chunk_info in item.get("chunks", []): + all_chunk_files.add(chunk_info["path"]) + + normal_files = { + f + for f in modified_files + if f not in all_chunk_files and f != manifest_rel_path + } + gfl_files = set(modified_files) - normal_files + commit_limit_bytes = args.commit_limit * (1024**3) + + if normal_files: + print("--- Committing normal files ---") + files_to_commit = { + f: os.path.getsize(os.path.join(git_root, f)) + for f in normal_files + if os.path.exists(os.path.join(git_root, f)) + } + if batch_commit( + repo, + files_to_commit, + "Update regular files", + commit_limit_bytes, + author, + committer, + ): + committed_something = True + + if gfl_files: + print("--- Committing GFL files ---") + files_to_commit = { + f: os.path.getsize(os.path.join(git_root, f)) + for f in gfl_files + if os.path.exists(os.path.join(git_root, f)) + } + if batch_commit( + repo, + files_to_commit, + "Update large file chunks", + commit_limit_bytes, + author, + committer, + ): + committed_something = True + + if not committed_something: + print("No changes to commit.") + else: + print("All changes committed.") + finally: + # Ensure the repo object is closed to release git handles. + repo.close() + + +def handle_reverse(args): + print("Executing: reverse") + git_root = get_git_root(args.root_dir) + if not git_root: + print("Error: Not a Git repository.") + sys.exit(1) + manifest_path = os.path.join(git_root, ".gfl_manifest.json") + if not os.path.exists(manifest_path): + print("Manifest file not found. Nothing to reverse.") + return + with open(manifest_path, "r") as f: + manifest = json.load(f) + entries_to_delete_from_manifest = [] + + # Iterate over a static copy of keys to prevent mutation issues + for relative_path in list(manifest.keys()): + data = manifest[relative_path] + original_file_abs_path = os.path.join( + git_root, relative_path.replace("/", os.sep) + ) + original_hash_in_manifest = data.get("original_sha256") + chunks = data.get("chunks", []) + + # Pre-check to prevent overwriting an existing, different file + if os.path.exists(original_file_abs_path): + existing_file_hash = get_file_hash(original_file_abs_path) + if existing_file_hash == original_hash_in_manifest: + print( + f"Skipping reversal for '{relative_path}': Correct file already exists." + ) + if args.delete_partitions: + print("Deleting chunk files...") + for chunk_info in chunks: + chunk_abs_path = os.path.join( + git_root, chunk_info["path"].replace("/", os.sep) + ) + if os.path.exists(chunk_abs_path): + os.remove(chunk_abs_path) + entries_to_delete_from_manifest.append(relative_path) + continue # Proceed to the next file in the manifest + else: + print( + f"ERROR: A different file already exists at '{relative_path}'. Will not overwrite." + ) + print("Please move or delete this file and run 'reverse' again.") + continue # Proceed to the next file in the manifest + + print(f"Reversing '{relative_path}'...") + if not chunks: + continue + + compressed_data = io.BytesIO() + all_chunks_valid = True + for chunk_info in chunks: + chunk_relative_path = chunk_info["path"] + chunk_expected_hash = chunk_info["sha256"] + chunk_abs_path = os.path.join( + git_root, chunk_relative_path.replace("/", os.sep) + ) + + if not os.path.exists(chunk_abs_path): + print( + f"ERROR: Chunk file {chunk_relative_path} not found! Cannot reverse '{relative_path}'." + ) + all_chunks_valid = False + break + + chunk_current_hash = get_file_hash(chunk_abs_path) + if chunk_current_hash != chunk_expected_hash: + print( + f"ERROR: Chunk file {chunk_relative_path} is corrupt! Hash mismatch." + ) + print(f"Expected: {chunk_expected_hash}, Got: {chunk_current_hash}") + print(f"Cannot reverse '{relative_path}'.") + all_chunks_valid = False + break + + with open(chunk_abs_path, "rb") as f_chunk: + compressed_data.write(f_chunk.read()) + + if all_chunks_valid: + decompressed_data = lzma.decompress(compressed_data.getvalue()) + with open(original_file_abs_path, "wb") as f_out: + f_out.write(decompressed_data) + print(f"Successfully recreated '{relative_path}'. Verifying integrity...") + recreated_hash = get_file_hash(original_file_abs_path) + if recreated_hash == original_hash_in_manifest: + print("Integrity check passed.") + if args.delete_partitions: + print("Deleting chunk files...") + for chunk_info in chunks: + chunk_abs_path = os.path.join( + git_root, chunk_info["path"].replace("/", os.sep) + ) + if os.path.exists(chunk_abs_path): + os.remove(chunk_abs_path) + entries_to_delete_from_manifest.append(relative_path) + else: + print( + f"ERROR: Integrity check failed for '{relative_path}'. The reversed file is corrupt." + ) + + if entries_to_delete_from_manifest: + for entry in entries_to_delete_from_manifest: + del manifest[entry] + with open(manifest_path, "w") as f: + json.dump(manifest, f, indent=4) + print(f"Manifest file '{manifest_path}' has been updated.") + + +def handle_push(args): + print("Executing: git push") + git_root = get_git_root(args.root_dir) + if not git_root: + print("Error: Not a Git repository.", file=sys.stderr) + sys.exit(1) + + repo = git.Repo(git_root) + try: + remote_name = "origin" + try: + remote = repo.remote(name=remote_name) + except ValueError: + print( + f"Error: Remote '{remote_name}' not found in repository.", + file=sys.stderr, + ) + sys.exit(1) + + current_branch = repo.active_branch + + try: + remote_branch = remote.refs[current_branch.name] + commit_range = f"{remote_branch.commit}..{current_branch.commit}" + commits_to_push = list(repo.iter_commits(commit_range)) + except IndexError: + # Remote branch doesn't exist, push all commits on the current branch + commits_to_push = list(repo.iter_commits(current_branch.name)) + + if not commits_to_push: + print("No new commits to push.") + return + + n_commits = len(commits_to_push) + print(f"Found {n_commits} commit(s) to push.") + + for i, commit in enumerate(reversed(commits_to_push)): + print(f"--- Pushing commit {i + 1}/{n_commits} ({commit.hexsha[:7]}) ---") + refspec = f"{commit.hexsha}:refs/heads/{current_branch.name}" + cmd = ["git", "push", "--progress", remote_name, refspec] + + try: + process = subprocess.Popen( + cmd, + cwd=git_root, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, # Redirect stderr to stdout to capture all progress + text=True, + encoding="utf-8", + bufsize=1, # Line-buffered + universal_newlines=True, + ) + + for line in iter(process.stdout.readline, ""): + print(line, end="") + + process.stdout.close() + return_code = process.wait() + + if return_code != 0: + print( + f"\nERROR: Git push failed for commit {commit.hexsha[:7]} with exit code {return_code}", + file=sys.stderr, + ) + sys.exit(1) + + print(f"--- Successfully pushed commit {commit.hexsha[:7]} ---") + + except Exception as e: + print( + f"\nERROR: Failed to execute git push for commit {commit.hexsha[:7]}." + ) + print(e, file=sys.stderr) + sys.exit(1) + + print("\n--- Updating remote HEAD to the latest commit ---") + try: + final_push_cmd = ["git", "push", remote_name, f"HEAD:{current_branch.name}"] + subprocess.run( + final_push_cmd, cwd=git_root, check=True, capture_output=True, text=True + ) + print("Remote branch updated successfully.") + except subprocess.CalledProcessError as e: + print("Error updating remote HEAD:", file=sys.stderr) + print(e.stderr, file=sys.stderr) + sys.exit(1) + + print("\nAll pushes complete.") + finally: + # Ensure the repo object is closed to release git handles. + repo.close() + + +def handle_auto_push(args): + print("Executing: auto-push workflow") + handle_split(args) + handle_commit(args) + handle_push(args) + + +def handle_download(args): + print("Executing: download") + repo_url = args.repo_url + + # 1. Parse GitHub URL + match = re.search(r"github\.com[/:]([^/]+)/([^/]+?)(?:\.git)?$", repo_url) + if not match: + print( + f"Error: Could not parse owner and repo from URL: {repo_url}", + file=sys.stderr, + ) + print( + "Please use a standard GitHub URL format (e.g., https://github.com/owner/repo.git)", + file=sys.stderr, + ) + sys.exit(1) + + owner, repo_name = match.groups() + + # 2. Set up local directory + if args.output: + download_dir = args.output + else: + download_dir = os.path.join(args.root_dir, repo_name) + download_dir = os.path.abspath(download_dir) + print(f"Target directory: {download_dir}") + if not os.path.exists(download_dir): + os.makedirs(download_dir) + + # 3. Prepare headers and proxies for API request + headers = {} + token = args.token or os.environ.get("GITHUB_TOKEN") + if token: + print("Using GitHub token for authentication.") + headers["Authorization"] = f"token {token}" + + proxies = None + if args.proxy: + print(f"Using proxy for API requests: {args.proxy}") + proxies = {"http": args.proxy, "https": args.proxy} + + # 4. Get all commit SHAs from GitHub API + print(f"Fetching commit list for {owner}/{repo_name}...") + all_commits = [] + api_url = f"https://api.github.com/repos/{owner}/{repo_name}/commits" + page_num = 0 + + while api_url: + try: + page_num += 1 + response = requests.get( + api_url, params={"per_page": 100}, proxies=proxies, headers=headers + ) + response.raise_for_status() + commits = response.json() + if not commits: + break + + num_commits_on_page = len(commits) + print( + f"Fetched page {page_num}, containing {num_commits_on_page} commit(s)..." + ) + + all_commits.extend([c["sha"] for c in commits]) + if "next" in response.links: + api_url = response.links["next"]["url"] + else: + api_url = None + except requests.exceptions.RequestException as e: + print(f"Error fetching commits from GitHub API: {e}", file=sys.stderr) + sys.exit(1) + + if not all_commits: + print("No commits found in the repository.") + return + + print( + f"Finished fetching. Total pages: {page_num}, Total commits: {len(all_commits)}" + ) + all_commits.reverse() # Order from oldest to newest + + # 5. Execute Git commands + def run_git_command(cmd, cwd, check=True, capture=True): + try: + print(f"Running: {' '.join(cmd)}") + result = subprocess.run( + cmd, + cwd=cwd, + check=check, + capture_output=capture, + text=True, + encoding="utf-8", + ) + if capture: + if result.stdout: + print(result.stdout.strip()) + if result.stderr: + print(result.stderr.strip(), file=sys.stderr) + return result + except subprocess.CalledProcessError as e: + print(f"Error executing command: {' '.join(e.cmd)}", file=sys.stderr) + if capture: + print(f"Exit Code: {e.returncode}", file=sys.stderr) + print(f"Stdout: {e.stdout}", file=sys.stderr) + print(f"Stderr: {e.stderr}", file=sys.stderr) + sys.exit(1) + except FileNotFoundError: + print( + "Error: 'git' command not found. Please ensure Git is installed and in your PATH.", + file=sys.stderr, + ) + sys.exit(1) + + # Initialize repo if it doesn't exist + if not os.path.exists(os.path.join(download_dir, ".git")): + run_git_command(["git", "init"], cwd=download_dir) + run_git_command(["git", "remote", "add", "origin", repo_url], cwd=download_dir) + last_downloaded_sha = None + else: + print("Existing git repository found.") + # Check for the last commit in the existing repo + result = run_git_command( + ["git", "rev-parse", "HEAD"], cwd=download_dir, check=False + ) + if result.returncode == 0: + last_downloaded_sha = result.stdout.strip() + print(f"Last downloaded commit: {last_downloaded_sha}") + else: + print("Could not determine last commit. Starting fresh download.") + last_downloaded_sha = None + + # Determine which commits to fetch + commits_to_fetch = all_commits + if last_downloaded_sha: + try: + last_commit_index = all_commits.index(last_downloaded_sha) + commits_to_fetch = all_commits[last_commit_index + 1 :] + print(f"Resuming download from commit after {last_downloaded_sha[:7]}...") + except ValueError: + print( + f"Warning: Last local commit {last_downloaded_sha[:7]} not found in remote history. Starting fresh download." + ) + + if not commits_to_fetch: + print("Repository is already up to date.") + return + + total_to_fetch = len(commits_to_fetch) + print(f"Found {total_to_fetch} new commit(s) to download.") + + for i, commit_sha in enumerate(commits_to_fetch): + print(f"--- [{(i + 1)}/{total_to_fetch}] Fetching commit: {commit_sha} ---") + run_git_command(["git", "fetch", "origin", commit_sha], cwd=download_dir) + run_git_command(["git", "reset", "--hard", "FETCH_HEAD"], cwd=download_dir) + + print("\nDownload complete!") + print(f"Repository '{repo_name}' is now at the latest commit in '{download_dir}'.") + + +if __name__ == "__main__": + main() diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..19008fd --- /dev/null +++ b/requirements.txt @@ -0,0 +1,3 @@ +gitpython +requests +PySocks \ No newline at end of file diff --git a/src/main.py b/src/main.py deleted file mode 100644 index eeffd57..0000000 --- a/src/main.py +++ /dev/null @@ -1,61 +0,0 @@ -import sys -import os -import shutil -import subprocess -import argparse - - -def parse_arguments(): - parser = argparse.ArgumentParser(description='GitHub-ForceLargeFiles') - - parser.add_argument('--root_dir', type=str, default=os.getcwd(), help="Root directory to start traversing. Defaults to current working directory.") - parser.add_argument('--delete_original', type=bool, default=True, help="Do you want to delete the original (large) file after compressing to archives?") - - parser.add_argument('--partition_ext', type=str, default="7z", choices=["7z", "xz", "bzip2", "gzip", "tar", "zip", "wim"], help="Extension of the partitions. Recommended: 7z due to compression ratio and inter-OS compability.") - parser.add_argument('--cmds_into_7z', type=str, default="a", help="Commands to pass in to 7z.") - - # The two arguments below default to compressing files only if they are over 100 MB. - parser.add_argument('--threshold_size', type=int, default=100, help="Max threshold of the original file size to split into archive. I.e. files with sizes below this arg are ignored.") - parser.add_argument('--threshold_size_unit', type=str, default='m', choices=['b', 'k', 'm', 'g'], help="Unit of the threshold size specified (bytes, kilobytes, megabytes, gigabytes).") - - # The two arguments below default to creating archives with a maximum size of 95 MB. - parser.add_argument('--partition_size', type=int, default=95, help="Max size of an individual archive. May result in actual partition size to be higher than this value due to disk formatting. In that case, reduce this arg value.") - parser.add_argument('--partition_size_unit', type=str, default='m', choices=['b', 'k', 'm', 'g'], help="Unit of the partition size specified (bytes, kilobytes, megabytes, gigabytes).") - - args = parser.parse_args() - return args - - -def check_7z_install(): - if shutil.which("7z"): - return True - else: - sys.exit("ABORTED. You do not have 7z properly installed at this time. Make sure it is added to PATH.") - - -def is_over_threshold(f_full_dir, args): - size_dict = { - "b": 1e-0, - "k": 1e-3, - "m": 1e-6, - "g": 1e-9 - } - return os.stat(f_full_dir).st_size * size_dict[args.threshold_size_unit] >= args.threshold_size - - -def traverse_root_dir(args): - for root, _, files in os.walk(args.root_dir): - for f in files: - f_full_dir = os.path.join(root, f) - - if is_over_threshold(f_full_dir, args): - f_full_dir_noext, ext = os.path.splitext(f_full_dir) - prc = subprocess.run(["7z", "-v" + str(args.partition_size) + args.partition_size_unit, args.cmds_into_7z, f_full_dir_noext + "." + ext[1:] + "." + args.partition_ext, f_full_dir]) - - if args.delete_original and prc.returncode == 0: - os.remove(f_full_dir) - - -if __name__ == '__main__': - check_7z_install() - traverse_root_dir(parse_arguments()) \ No newline at end of file diff --git a/src/reverse.py b/src/reverse.py deleted file mode 100644 index a45bb4a..0000000 --- a/src/reverse.py +++ /dev/null @@ -1,45 +0,0 @@ -import sys -import os -import shutil -import subprocess -import argparse - - -def parse_arguments(): - parser = argparse.ArgumentParser(description='GitHub-ForceLargeFiles_reverse') - - parser.add_argument('--root_dir', type=str, default=os.getcwd(), help="Root directory to start traversing. Defaults to current working directory.") - parser.add_argument('--delete_partitions', type=bool, default=True, help="Do you want to delete the partition archives after extracting the original files?") - - args = parser.parse_args() - return args - - -def check_7z_install(): - if shutil.which("7z"): - return True - else: - sys.exit("ABORTED. You do not have 7z properly installed at this time. Make sure it is added to PATH.") - - -def is_partition(f_full_dir): - return any(f_full_dir.endswith(ext) for ext in [".7z.001", ".xz.001", ".bzip2.001", ".gzip.001", ".tar.001", ".zip.001", ".wim.001"]) - - -def reverse_root_dir(args): - for root, _, files in os.walk(args.root_dir): - for f in files: - f_full_dir = os.path.join(root, f) - - if is_partition(f_full_dir): - prc = subprocess.run(["7z", "e", f_full_dir, "-o" + root]) - - if args.delete_partitions and prc.returncode == 0: - f_noext, _ = os.path.splitext(f) - os.chdir(root) - os.system("rm" + " \"" + f_noext + "\"*") - - -if __name__ == '__main__': - check_7z_install() - reverse_root_dir(parse_arguments()) \ No newline at end of file diff --git a/test_gfl.py b/test_gfl.py new file mode 100644 index 0000000..c04b520 --- /dev/null +++ b/test_gfl.py @@ -0,0 +1,265 @@ +import unittest +import os +import sys +import shutil +import hashlib +import argparse +import git +import stat +import random +import subprocess + +# --- Global Helper Functions for Manual Testing --- + + +def get_file_hash(file_path): + """Calculates the SHA256 hash of a file.""" + sha256 = hashlib.sha256() + with open(file_path, "rb") as f: + while chunk := f.read(8192): + sha256.update(chunk) + return sha256.hexdigest() + + +def create_dummy_file(file_path, size_mb): + """Create a dummy file of a given size in MB and return its hash.""" + print(f"Creating dummy file: {file_path} ({size_mb:.2f}MB)") + os.makedirs(os.path.dirname(file_path), exist_ok=True) + with open(file_path, "wb") as f: + f.write(os.urandom(int(size_mb * 1024 * 1024))) + print("File created.") + return get_file_hash(file_path) + + +def handle_generate_random_files(args, root_dir): + """Handler for the --generate-random-files command.""" + total_size_mb = args.generate_random_files + num_files = args.num_files + print( + f"Generating {num_files} random files with a total size of {total_size_mb}MB in '{root_dir}'..." + ) + + if num_files > 1: + split_points = sorted( + [0] + + [random.uniform(0, total_size_mb) for _ in range(num_files - 1)] + + [total_size_mb] + ) + file_sizes = [split_points[i + 1] - split_points[i] for i in range(num_files)] + else: + file_sizes = [total_size_mb] + + subdirs = ["", "movies", "data/logs", "assets/audio"] + for i, size in enumerate(file_sizes): + if size < 0.01: + continue + + chosen_subdir_rel = random.choice(subdirs) + file_name = f"random_file_{i+1}.{random.choice(['dat', 'bin', 'iso', 'zip'])}" + file_path = os.path.join(root_dir, chosen_subdir_rel, file_name) + create_dummy_file(file_path, size) + print("\nGeneration complete.") + + +class TestGFL(unittest.TestCase): + + def setUp(self): + """Set up a temporary test environment outside the project folder.""" + project_root = os.path.dirname(os.path.abspath(__file__)) + self.test_dir = os.path.join(os.path.dirname(project_root), "temp_gfl_test_dir") + + if os.path.exists(self.test_dir): + self.robust_rmtree(self.test_dir) + + os.makedirs(self.test_dir, exist_ok=True) + self.repo = git.Repo.init(self.test_dir) + self.dummy_author = git.Actor("GFL Test Bot", "gfl-bot@example.com") + + def tearDown(self): + """Clean up the test environment unless --no-cleanup is passed.""" + self.repo.close() + if not NO_CLEANUP and os.path.exists(self.test_dir): + self.robust_rmtree(self.test_dir) + + def robust_rmtree(self, path): + """Robustly remove a directory, handling read-only files from .git.""" + + def on_rm_error(func, path, exc_info): + # Handle read-only files, especially in .git directories + os.chmod(path, stat.S_IWRITE) + func(path) + + shutil.rmtree(path, onerror=on_rm_error) + + def run_gfl_command(self, args, check=True): + """Helper to run gfl.py as a subprocess from the test directory.""" + gfl_script_path = os.path.abspath( + os.path.join(os.path.dirname(__file__), "gfl.py") + ) + # Pass --no-status-check to bypass git status checks which are not relevant for these tests + cmd = [sys.executable, gfl_script_path] + args + ["--no-status-check"] + print(f"\nExecuting command: {' '.join(cmd)}") + result = subprocess.run( + cmd, check=check, capture_output=True, text=True, cwd=self.test_dir + ) + if result.stdout: + print("STDOUT:\n" + result.stdout) + if result.stderr: + print("STDERR:\n" + result.stderr, file=sys.stderr) + return result + + def test_01_split_and_reverse_integrity(self): + """Test that a file is split and then reversed to its original state via CLI.""" + print("\n--- Running test: test_01_split_and_reverse_integrity ---") + large_file_path = os.path.join(self.test_dir, "large_file.bin") + original_hash = create_dummy_file(large_file_path, 50) + + self.run_gfl_command( + ["split", "--threshold-size", "40", "--partition-size", "20"] + ) + self.assertFalse(os.path.exists(large_file_path)) + + self.run_gfl_command(["reverse"]) + self.assertTrue(os.path.exists(large_file_path)) + self.assertEqual(original_hash, get_file_hash(large_file_path)) + print("Integrity check passed!") + + def test_02_full_git_workflow_multi_file(self): + """Test split -> commit with multiple files of varying sizes via CLI.""" + print("\n--- Running test: test_02_full_git_workflow_multi_file ---") + with open(os.path.join(self.test_dir, "dummy.txt"), "w") as f: + f.write("hello") + self.repo.git.add(".") + self.repo.index.commit("Initial commit", author=self.dummy_author) + + create_dummy_file(os.path.join(self.test_dir, "videos", "video_large.mp4"), 60) + create_dummy_file(os.path.join(self.test_dir, "image_small.jpg"), 5) + + self.run_gfl_command( + ["split", "--threshold-size", "50", "--partition-size", "25"] + ) + self.run_gfl_command(["commit"]) + + repo = git.Repo(self.test_dir) + commits = list(repo.iter_commits("HEAD", max_count=3)) + + self.assertGreaterEqual(len(commits), 2) + commit_messages = [c.message for c in commits] + self.assertTrue( + any("Update large file chunks" in msg for msg in commit_messages) + ) + self.assertTrue(any("Update regular files" in msg for msg in commit_messages)) + + gfl_commit = next(c for c in commits if "Update large file chunks" in c.message) + gfl_files = list(gfl_commit.stats.files.keys()) + self.assertIn(".gfl_manifest.json", gfl_files) + self.assertTrue( + any( + f.replace("\\", "/").startswith("videos/video_large.mp4.xz.") + for f in gfl_files + ) + ) + + normal_commit = next(c for c in commits if "Update regular files" in c.message) + self.assertIn("image_small.jpg", normal_commit.stats.files) + print("Git prioritized commit test passed.") + + +def run_manual_step(step, root_dir): + print(f"\n--- Running manual step: {step} in directory: {root_dir} ---") + if step == "setup": + if os.path.exists(root_dir) and os.listdir(root_dir): + print( + "Error: Manual test directory is not empty. Please clean it up first." + ) + return + repo = git.Repo.init(root_dir) + dummy_author = git.Actor("GFL Manual Test", "manual@example.com") + with open(os.path.join(root_dir, "dummy.txt"), "w") as f: + f.write("hello") + repo.git.add(".") + repo.index.commit("Initial commit", author=dummy_author, committer=dummy_author) + print("Setup complete. Dummy repo created.") + repo.close() + return + + gfl_script_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "gfl.py") + + cmd = [sys.executable, gfl_script_path, step, "--root_dir", root_dir] + + print(f"Executing command: {' '.join(cmd)}") + try: + subprocess.run(cmd, check=True) + except subprocess.CalledProcessError as e: + print( + f"\n--- Step '{step}' failed with exit code {e.returncode} ---", + file=sys.stderr, + ) + except FileNotFoundError: + print(f"Error: Could not find the script at {gfl_script_path}", file=sys.stderr) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + description="GFL Test Runner. Use --manual for manual testing steps." + ) + parser.add_argument( + "--no-cleanup", + action="store_true", + help="Do not clean up the automated test directory after tests.", + ) + parser.add_argument( + "--manual", action="store_true", help="Enable manual testing mode." + ) + parser.add_argument( + "--step", help="[Manual Mode] The step to run (setup, split, commit, reverse)." + ) + parser.add_argument( + "--create-file", + help="[Manual Mode] Create a dummy file. Provide relative path.", + ) + parser.add_argument( + "--size", + type=int, + default=100, + help="[Manual Mode] Size of the dummy file in MB.", + ) + parser.add_argument( + "--generate-random-files", + type=int, + metavar="TOTAL_MB", + help="[Manual Mode] Generate multiple random files that sum to TOTAL_MB.", + ) + parser.add_argument( + "--num-files", + type=int, + default=10, + help="[Manual Mode] Number of files to generate for --generate-random-files.", + ) + + args, unknown = parser.parse_known_args() + + if args.manual: + project_root = os.path.dirname(os.path.abspath(__file__)) + manual_test_dir = os.path.join( + os.path.dirname(project_root), "temp_gfl_manual_test_dir" + ) + os.makedirs(manual_test_dir, exist_ok=True) + + if args.generate_random_files: + handle_generate_random_files(args, manual_test_dir) + elif args.create_file: + create_dummy_file( + os.path.join(manual_test_dir, args.create_file), args.size + ) + elif args.step: + run_manual_step(args.step, manual_test_dir) + else: + print("Error: Manual mode requires a command. See --help.") + elif any([args.step, args.create_file, args.generate_random_files]): + print("Error: Manual testing flags can only be used with the --manual flag.") + else: + print(f"--- Running Automated Tests ---") + NO_CLEANUP = args.no_cleanup + # Pass remaining args to unittest + unittest.main(argv=[sys.argv[0]] + unknown)