Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,4 @@ RUN curl -sSL https://install.python-poetry.org | python3 -

# Verify installations
RUN exiftool -ver && \
ffmpeg -version
ffmpeg -version
5 changes: 1 addition & 4 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
}
},

// Configure tool-specific properties
"customizations": {
"vscode": {
"settings": {
Expand Down Expand Up @@ -46,15 +45,13 @@
}
},

// Install project dependencies and dev tools
"postCreateCommand": "pip install --user -e '.[dev]' && pip install hatch pre-commit pytest mypy black isort pytest-mock",

// Comment out to connect as root instead
"remoteUser": "vscode",

"features": {
"ghcr.io/devcontainers/features/git:1": {},
"ghcr.io/devcontainers/features/github-cli:1": {},
"ghcr.io/devcontainers-contrib/features/hatch:2": {}
}
}
}
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,4 @@ repos:
- id: trailing-whitespace
- id: check-yaml
- id: check-json
exclude: "^.devcontainer/"
90 changes: 89 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 📚 Readium

A powerful Python tool for extracting, analyzing, and converting documentation from repositories and directories into accessible formats.
A powerful Python tool for extracting, analyzing, and converting documentation from repositories, directories, and URLs into accessible formats.

<p align="center">
<img src="logo.webp" alt="Readium" width="80%">
Expand All @@ -12,13 +12,18 @@ A powerful Python tool for extracting, analyzing, and converting documentation f
- Support for private repositories using tokens
- Branch selection for Git repositories
- Secure token handling and masking
- 🌐 **Process webpages and URLs** to convert directly to Markdown
- Extract main content from documentation websites
- Convert HTML to well-formatted Markdown
- Support for tables, links, and images in converted content
- 🔄 **Convert multiple document formats** to Markdown using MarkItDown integration
- 🎯 **Target specific subdirectories** for focused analysis
- ⚡ **Process a wide range of file types**:
- Documentation files (`.md`, `.mdx`, `.rst`, `.txt`)
- Code files (`.py`, `.js`, `.java`, etc.)
- Configuration files (`.yml`, `.toml`, `.json`, etc.)
- Office documents with MarkItDown (`.pdf`, `.docx`, `.xlsx`, `.pptx`)
- Webpages and HTML via direct URL processing
- 🎛️ **Highly configurable**:
- Customizable file size limits
- Flexible file extension filtering
Expand Down Expand Up @@ -59,6 +64,9 @@ readium https://github.com/username/repository -b feature-branch
# Process a private Git repository with token
readium https://[email protected]/username/repository

# Process a webpage and convert to Markdown
readium https://example.com/documentation

# Save output to a file
readium /path/to/directory -o output.md

Expand All @@ -85,6 +93,12 @@ readium /path/to/directory --debug

# Generate split files for fine-tuning
readium /path/to/directory --split-output ./training-data/

# Process URL with content preservation mode
readium https://example.com/docs --url-mode full

# Process URL with main content extraction (default)
readium https://example.com/docs --url-mode clean
```

### Python API
Expand Down Expand Up @@ -118,12 +132,68 @@ summary, tree, content = reader.read_docs(
# Process private Git repository with token
summary, tree, content = reader.read_docs('https://[email protected]/username/repo')

# Process a webpage and convert to Markdown
summary, tree, content = reader.read_docs('https://example.com/documentation')

# Access results
print("Summary:", summary)
print("\nFile Tree:", tree)
print("\nContent:", content)
```

## 🌐 URL to Markdown

Readium can process web pages and convert them directly to Markdown:

```bash
# Process a webpage
readium https://example.com/documentation

# Save the output to a file
readium https://example.com/documentation -o docs.md

# Process URL preserving more content
readium https://example.com/documentation --url-mode full

# Process URL extracting only main content (default)
readium https://example.com/documentation --url-mode clean
```

### URL Conversion Configuration

The URL to Markdown conversion can be configured with several options:

- `--url-mode`: Processing mode (`clean` or `full`)
- `clean` (default): Extracts only the main content, ignoring menus, ads, etc.
- `full`: Attempts to preserve most of the page content

### Python API for URLs

```python
from readium import Readium, ReadConfig

# Configure with URL options
config = ReadConfig(
url_mode="clean", # 'clean' or 'full'
include_tables=True,
include_images=True,
include_links=True,
include_comments=False,
debug=True
)

reader = Readium(config)

# Process a URL
summary, tree, content = reader.read_docs('https://example.com/documentation')

# Save the content
with open('documentation.md', 'w', encoding='utf-8') as f:
f.write(content)
```

Readium uses [trafilatura](https://github.com/adbar/trafilatura) for web content extraction and conversion, which is especially effective for extracting the main content from technical documentation, tutorials, and other web resources.

## 🔧 Configuration

The `ReadConfig` class supports the following options:
Expand Down Expand Up @@ -151,6 +221,15 @@ config = ReadConfig(
# Specify extensions for MarkItDown processing
markitdown_extensions={'.pdf', '.docx', '.xlsx'},

# URL processing mode: 'clean' or 'full'
url_mode='clean',

# URL content options
include_tables=True,
include_images=True,
include_links=True,
include_comments=False,

# Enable debug mode
debug=False
)
Expand Down Expand Up @@ -268,6 +347,11 @@ readium /path/to/repository \
--target-dir docs \
--use-markitdown \
--debug

# Process a URL and create split files
readium https://example.com/docs \
--split-output ./training-data/ \
--url-mode clean
```

Python API:
Expand All @@ -286,6 +370,9 @@ reader.split_output_dir = "./training-data/"

# Process and generate split files
summary, tree, content = reader.read_docs('/path/to/repository')

# Process a URL and generate split files
summary, tree, content = reader.read_docs('https://example.com/docs')
```

## 🛠️ Development
Expand Down Expand Up @@ -340,5 +427,6 @@ This project is licensed under the MIT License - see the LICENSE file for detail
## 🙏 Acknowledgments

- Microsoft and MarkItDown for their powerful document conversion tool
- Trafilatura for excellent web content extraction capabilities
- Rich library for beautiful console output
- Click for the powerful CLI interface
Loading