Home

Overview

1FileLLM is a command-line and Flask-based web tool that consolidates text from various sources—such as GitHub repositories, GitHub pull requests/issues, local folders, academic papers, YouTube transcripts, and web pages—into a single, LLM-ready text file. The main goal is to enable fast and seamless creation of information-dense prompts for Large Language Models (LLMs). Text is automatically preprocessed, optionally compressed, and copied to the clipboard for immediate use in LLMs.

Key Objectives

Streamline text ingestion from multiple external data sources (local files, GitHub repos, YouTube transcripts, academic PDFs, etc.).
Provide both a command-line and a web-based interface to process and download results.
Preprocess text by removing stopwords and unnecessary characters, and optionally encapsulate it in XML tags for better LLM performance.
Report token counts for both compressed and uncompressed text, simplifying prompt-size management.

Features

Automatic detection of source type (local, GitHub, ArXiv, etc.).
Support for multiple file formats: .py, .ipynb, .txt, .md, PDFs, etc.
Web crawling with user-defined link depth.
Optionally retrieve research papers from Sci-Hub via DOI/PMID.
Clipboard copy of the processed text.
Token count reporting for immediate insight into LLM prompt sizing.

Architecture

1FileLLM’s architecture comprises:

Command-Line Interface (CLI) (onefilellm.py):
- Entry point via main() to process URLs/paths.
- Detects source type and dispatches to the relevant processing function.
- Encapsulates logic for reading/writing output files, preprocessing text, and handling environment variables.
Web Interface (web_app.py):
- A Flask-based HTTP server providing a front-end form for users to submit paths/URLs.
- Mirrors CLI functionality by calling the same underlying processing functions from onefilellm.py.
- Returns processed outputs as downloadable files and token counts in a web page.
Processing Modules (all in onefilellm.py but grouped conceptually):
- GitHub Operations
  - process_github_repo: Recursively fetches files from a repository, downloading only allowed file types.
  - process_github_pull_request & process_github_issue: Gathers PR or issue details, plus full repo content.
- PDF/Text Retrieval
  - process_arxiv_pdf: Retrieves and extracts text from ArXiv PDFs.
  - process_doi_or_pmid: Pulls PDFs via Sci-Hub for a given DOI/PMID, then extracts text.
  - process_local_folder: Reads local directories, processing recognized file types.
  - fetch_youtube_transcript: Uses the YouTubeTranscriptApi to retrieve transcripts.
- Web Crawling
  - crawl_and_extract_text: Recursively scrapes web pages (HTML and optional PDFs), collecting text up to a user-defined depth.
- Preprocessing & Token Counting
  - preprocess_text: Normalizes text, removing punctuation and stopwords, optionally preserving an XML structure.
  - get_token_count: Tokenizes text for reporting, using tiktoken.
Shared Utilities
- Token & Stopword Handling (nltk, tiktoken): For text tokenization, compression, and cleaning.
- Clipboard Integration (pyperclip): Copies final uncompressed text.
- PDF Libraries (PyPDF2): Extracts text from PDFs.
- HTML Parsing (BeautifulSoup): Used in web crawling and GitHub API JSON parsing.
- Environment Variable Handling: Leverages GITHUB_TOKEN for private GitHub repo access.

Data Flow

User Input
- Provided via command-line arguments or via a Flask web form.
Dispatch
- main() or web_app.py identifies the source and calls the corresponding processing function.
Fetch & Parse
- PDF text extraction, web crawling, GitHub file downloads, or local file reads.
Preprocess
- Normalizes text by stripping stopwords and extraneous characters.
- Optionally wraps output in XML tags for structured LLM input.
Output
- Generates uncompressed_output.txt and compressed_output.txt.
- Copies uncompressed_output.txt to clipboard.
- Provides optional download from the web interface.
- Shows token counts for each output.

Setup and Installation

1. Clone or Download

git clone https://github.com/jimmc414/1filellm.git
cd 1filellm

2. Install Dependencies

Use the provided requirements.txt:

pip install -r requirements.txt

(Optionally, create a virtual environment before installing.)

3. Configure GitHub Token (Optional)

For private GitHub repository access, set a GITHUB_TOKEN environment variable:

Windows

setx GITHUB_TOKEN "YourGitHubToken"

Linux/macOS

echo 'export GITHUB_TOKEN="YourGitHubToken"' >> ~/.bashrc
source ~/.bashrc

Usage

You can run 1FileLLM in two ways:

1. Command-Line Interface (CLI)

python onefilellm.py

You will be prompted to enter a local path, GitHub URL, YouTube link, etc.

Or directly provide an argument:

python onefilellm.py https://github.com/jimmc414/onefilellm

The script detects the source type (e.g., GitHub repo) and processes accordingly.
Output files:
- uncompressed_output.txt: Full text (copied to clipboard).
- compressed_output.txt: Preprocessed text (cleaner, fewer tokens).
- processed_urls.txt: (For web crawls) Contains each visited URL.

2. Web Interface (Flask)

Launch the web server:
```
python web_app.py
```
Access http://localhost:5000 in your browser.
Provide the input path/URL in the text field, click Process, and view/download the output.

Configuration

File Extensions:
- Editable in onefilellm.py under is_allowed_filetype(). By default, includes .py, .txt, .md, .ipynb, etc.
Max Depth for Crawling:
- In crawl_and_extract_text, the max_depth argument controls how deeply linked pages are followed. The default is 2.
ArXiv & Sci-Hub:
- The tool constructs standard PDF URLs from ArXiv links.
- Sci-Hub domain is hardcoded (sci-hub.se). Modify code if needed.

Contributing Guidelines

Fork the repository and create your own feature branch from main.
Add/modify tests in test_onefilellm.py to cover any new functionality.
Run the test suite to ensure everything passes:
```
python -m unittest test_onefilellm.py
```
Submit a pull request with a detailed description of your changes.

Preferred Contributions

Improvements to text preprocessing or token counting.
Additional source type integrations (e.g., other API endpoints).
Performance optimizations or caching.
Security enhancements (token encryption, better error handling).

FAQ / Troubleshooting

Q1: Why am I getting an error about GITHUB_TOKEN?
A1: You must set a valid GITHUB_TOKEN if you want to access private repos. Public repos will still work without it.

Q2: The script can’t find my local PDF.
A2: Confirm the file’s absolute path or current directory context. Also check allowed file extensions.

Q3: Web crawling is slow or fails unexpectedly.
A3: Large or complex sites can cause performance or request-limit issues. Consider reducing max_depth or limiting the domain.

Q4: Sci-Hub isn’t returning a PDF.
A4: Sci-Hub might be unavailable or blocking requests from your region. Try again later, or update the Sci-Hub domain in process_doi_or_pmid().

Q5: Token counts differ from my LLM’s actual usage.
A5: Different LLMs and tokenizers can yield slightly different counts. The included tiktoken library is an approximation for certain model families.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly