Skip to content
Jim McMillan edited this page Jan 6, 2025 · 1 revision

Overview

1FileLLM is a command-line and Flask-based web tool that consolidates text from various sources—such as GitHub repositories, GitHub pull requests/issues, local folders, academic papers, YouTube transcripts, and web pages—into a single, LLM-ready text file. The main goal is to enable fast and seamless creation of information-dense prompts for Large Language Models (LLMs). Text is automatically preprocessed, optionally compressed, and copied to the clipboard for immediate use in LLMs.

Key Objectives

  • Streamline text ingestion from multiple external data sources (local files, GitHub repos, YouTube transcripts, academic PDFs, etc.).
  • Provide both a command-line and a web-based interface to process and download results.
  • Preprocess text by removing stopwords and unnecessary characters, and optionally encapsulate it in XML tags for better LLM performance.
  • Report token counts for both compressed and uncompressed text, simplifying prompt-size management.

Features

  • Automatic detection of source type (local, GitHub, ArXiv, etc.).
  • Support for multiple file formats: .py, .ipynb, .txt, .md, PDFs, etc.
  • Web crawling with user-defined link depth.
  • Optionally retrieve research papers from Sci-Hub via DOI/PMID.
  • Clipboard copy of the processed text.
  • Token count reporting for immediate insight into LLM prompt sizing.

Architecture

1FileLLM’s architecture comprises:

  • Command-Line Interface (CLI) (onefilellm.py):

    • Entry point via main() to process URLs/paths.
    • Detects source type and dispatches to the relevant processing function.
    • Encapsulates logic for reading/writing output files, preprocessing text, and handling environment variables.
  • Web Interface (web_app.py):

    • A Flask-based HTTP server providing a front-end form for users to submit paths/URLs.
    • Mirrors CLI functionality by calling the same underlying processing functions from onefilellm.py.
    • Returns processed outputs as downloadable files and token counts in a web page.
  • Processing Modules (all in onefilellm.py but grouped conceptually):

    • GitHub Operations
      • process_github_repo: Recursively fetches files from a repository, downloading only allowed file types.
      • process_github_pull_request & process_github_issue: Gathers PR or issue details, plus full repo content.
    • PDF/Text Retrieval
      • process_arxiv_pdf: Retrieves and extracts text from ArXiv PDFs.
      • process_doi_or_pmid: Pulls PDFs via Sci-Hub for a given DOI/PMID, then extracts text.
      • process_local_folder: Reads local directories, processing recognized file types.
      • fetch_youtube_transcript: Uses the YouTubeTranscriptApi to retrieve transcripts.
    • Web Crawling
      • crawl_and_extract_text: Recursively scrapes web pages (HTML and optional PDFs), collecting text up to a user-defined depth.
    • Preprocessing & Token Counting
      • preprocess_text: Normalizes text, removing punctuation and stopwords, optionally preserving an XML structure.
      • get_token_count: Tokenizes text for reporting, using tiktoken.
  • Shared Utilities

    • Token & Stopword Handling (nltk, tiktoken): For text tokenization, compression, and cleaning.
    • Clipboard Integration (pyperclip): Copies final uncompressed text.
    • PDF Libraries (PyPDF2): Extracts text from PDFs.
    • HTML Parsing (BeautifulSoup): Used in web crawling and GitHub API JSON parsing.
    • Environment Variable Handling: Leverages GITHUB_TOKEN for private GitHub repo access.

Data Flow

  1. User Input
    • Provided via command-line arguments or via a Flask web form.
  2. Dispatch
    • main() or web_app.py identifies the source and calls the corresponding processing function.
  3. Fetch & Parse
    • PDF text extraction, web crawling, GitHub file downloads, or local file reads.
  4. Preprocess
    • Normalizes text by stripping stopwords and extraneous characters.
    • Optionally wraps output in XML tags for structured LLM input.
  5. Output
    • Generates uncompressed_output.txt and compressed_output.txt.
    • Copies uncompressed_output.txt to clipboard.
    • Provides optional download from the web interface.
    • Shows token counts for each output.

Setup and Installation

1. Clone or Download

git clone https://github.com/jimmc414/1filellm.git
cd 1filellm

2. Install Dependencies

Use the provided requirements.txt:

pip install -r requirements.txt

(Optionally, create a virtual environment before installing.)

3. Configure GitHub Token (Optional)

For private GitHub repository access, set a GITHUB_TOKEN environment variable:

Windows

setx GITHUB_TOKEN "YourGitHubToken"

Linux/macOS

echo 'export GITHUB_TOKEN="YourGitHubToken"' >> ~/.bashrc
source ~/.bashrc

Usage

You can run 1FileLLM in two ways:

1. Command-Line Interface (CLI)

python onefilellm.py
  • You will be prompted to enter a local path, GitHub URL, YouTube link, etc.

Or directly provide an argument:

python onefilellm.py https://github.com/jimmc414/onefilellm
  • The script detects the source type (e.g., GitHub repo) and processes accordingly.
  • Output files:
    • uncompressed_output.txt: Full text (copied to clipboard).
    • compressed_output.txt: Preprocessed text (cleaner, fewer tokens).
    • processed_urls.txt: (For web crawls) Contains each visited URL.

2. Web Interface (Flask)

  1. Launch the web server:

    python web_app.py
  2. Access http://localhost:5000 in your browser.

  3. Provide the input path/URL in the text field, click Process, and view/download the output.


Configuration

  • File Extensions:
    • Editable in onefilellm.py under is_allowed_filetype(). By default, includes .py, .txt, .md, .ipynb, etc.
  • Max Depth for Crawling:
    • In crawl_and_extract_text, the max_depth argument controls how deeply linked pages are followed. The default is 2.
  • ArXiv & Sci-Hub:
    • The tool constructs standard PDF URLs from ArXiv links.
    • Sci-Hub domain is hardcoded (sci-hub.se). Modify code if needed.

Contributing Guidelines

  1. Fork the repository and create your own feature branch from main.
  2. Add/modify tests in test_onefilellm.py to cover any new functionality.
  3. Run the test suite to ensure everything passes:
    python -m unittest test_onefilellm.py
  4. Submit a pull request with a detailed description of your changes.

Preferred Contributions

  • Improvements to text preprocessing or token counting.
  • Additional source type integrations (e.g., other API endpoints).
  • Performance optimizations or caching.
  • Security enhancements (token encryption, better error handling).

FAQ / Troubleshooting

Q1: Why am I getting an error about GITHUB_TOKEN?
A1: You must set a valid GITHUB_TOKEN if you want to access private repos. Public repos will still work without it.

Q2: The script can’t find my local PDF.
A2: Confirm the file’s absolute path or current directory context. Also check allowed file extensions.

Q3: Web crawling is slow or fails unexpectedly.
A3: Large or complex sites can cause performance or request-limit issues. Consider reducing max_depth or limiting the domain.

Q4: Sci-Hub isn’t returning a PDF.
A4: Sci-Hub might be unavailable or blocking requests from your region. Try again later, or update the Sci-Hub domain in process_doi_or_pmid().

Q5: Token counts differ from my LLM’s actual usage.
A5: Different LLMs and tokenizers can yield slightly different counts. The included tiktoken library is an approximation for certain model families.


Clone this wiki locally