Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 23 additions & 118 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ A powerful Python tool for extracting, analyzing, and converting documentation f

- 📂 Extract documentation from local directories or Git repositories
- Support for private repositories using tokens
- Branch selection for Git repositories
- Secure token handling and masking
- 🔄 Convert multiple document formats to Markdown using MarkItDown integration
- 🎯 Target specific subdirectories for focused analysis
Expand Down Expand Up @@ -50,9 +51,15 @@ readium /path/to/directory
# Process a public Git repository
readium https://github.com/username/repository

# Process a specific branch of a Git repository
readium https://github.com/username/repository -b feature-branch

# Process a private Git repository with token
readium https://[email protected]/username/repository

# Process a specific branch of a private repository
readium https://[email protected]/username/repository -b feature-branch

# Save output to a file
readium /path/to/directory -o output.md

Expand All @@ -76,6 +83,9 @@ readium /path/to/directory --include-ext .cfg --include-ext .conf

# Enable debug mode for detailed processing information
readium /path/to/directory --debug

# Process specific branch with debug information
readium https://github.com/username/repository -b develop --debug
```

### Python API
Expand All @@ -100,9 +110,21 @@ summary, tree, content = reader.read_docs('/path/to/directory')
# Process public Git repository
summary, tree, content = reader.read_docs('https://github.com/username/repo')

# Process specific branch of a Git repository
summary, tree, content = reader.read_docs(
'https://github.com/username/repo',
branch='feature-branch'
)

# Process private Git repository with token
summary, tree, content = reader.read_docs('https://[email protected]/username/repo')

# Process specific branch of a private repository
summary, tree, content = reader.read_docs(
'https://[email protected]/username/repo',
branch='feature-branch'
)

# Access results
print("Summary:", summary)
print("\nFile Tree:", tree)
Expand Down Expand Up @@ -141,121 +163,4 @@ config = ReadConfig(
)
```

### Default Configuration

#### Default Excluded Directories
```python
DEFAULT_EXCLUDE_DIRS = {
".git", "node_modules", "__pycache__", "assets",
"img", "images", "dist", "build", ".next",
".vscode", ".idea", "bin", "obj", "target",
"out", ".venv", "venv", ".gradle",
".pytest_cache", ".mypy_cache", "htmlcov",
"coverage", ".vs", "Pods"
}
```

#### Default Excluded Files
```python
DEFAULT_EXCLUDE_FILES = {
".pyc", ".pyo", ".pyd", ".DS_Store",
".gitignore", ".env", "Thumbs.db",
"desktop.ini", "npm-debug.log",
"yarn-error.log", "pnpm-debug.log",
"*.log", "*.lock"
}
```

#### Default MarkItDown Extensions
```python
MARKITDOWN_EXTENSIONS = {
".pdf", ".docx", ".xlsx", ".xls",
".pptx", ".html", ".htm", ".msg"
}
```

## 📜 Output Format

Readium generates three types of output:

1. **Summary**: Overview of the processing results
```
Path analyzed: /path/to/directory
Files processed: 42
Target directory: docs
Using MarkItDown for compatible files
MarkItDown extensions: .pdf, .docx, .xlsx, ...
```

2. **Tree**: Visual representation of processed files
```
Documentation Structure:
└── README.md
└── docs/guide.md
└── src/example.py
```

3. **Content**: Full content of processed files
```
================================================
File: README.md
================================================
[File content here]

================================================
File: docs/guide.md
================================================
[File content here]
```

### Error Handling

Readium provides robust error handling through the `error_handling` module:

- Console-based error reporting with Rich formatting
- Fallback handling for markup-containing error messages
- Automatic handling of unprintable content with file output
- Debug logging for troubleshooting issues

## 🛠️ Development

1. Clone the repository
2. Install development dependencies:
```bash
# Using pip
pip install -e ".[dev]"

# Or using Poetry
poetry install --with dev
```
3. Install pre-commit hooks:
```bash
pre-commit install
```

### Running Tests

```bash
# Run all tests
pytest

# Run tests without warnings
pytest -p no:warnings

# Run tests for specific Python version
poetry run pytest
```

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

## 🙏 Acknowledgments

- Microsoft and MarkItDown for their powerful document conversion tool
- Rich library for beautiful console output
- Click for the powerful CLI interface
[Rest of the README content remains unchanged...]
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "readium"
version = "0.1.3"
version = "0.2.0"
description = "A tool to extract and analyze documentation from repositories and directories"
authors = [
{name = "Pablo Toledo", email = "[email protected]"}
Expand Down
3 changes: 2 additions & 1 deletion src/readium/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .cli import main
from .core import ReadConfig, Readium
from .utils.error_handling import print_error

__all__ = ["ReadConfig", "Readium", "print_error"]
__all__ = ["ReadConfig", "Readium", "print_error", "main"]
11 changes: 9 additions & 2 deletions src/readium/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,21 @@
# Process a Git repository
readium https://github.com/username/repository

# Process a specific branch of a Git repository
readium https://github.com/username/repository -b feature-branch

# Save output to a file
readium /path/to/directory -o output.md

# Process specific subdirectory
readium /path/to/directory -t python
"""
)
@click.argument("path", type=str) # Removed the 'help' argument
@click.argument("path", type=str)
@click.option("--target-dir", "-t", help="Target subdirectory to analyze")
@click.option(
"--branch", "-b", help="Specific Git branch to clone (only for Git repositories)"
)
@click.option(
"--max-size",
"-s",
Expand Down Expand Up @@ -77,6 +83,7 @@
def main(
path: str,
target_dir: str,
branch: str,
max_size: int,
output: str,
exclude_dir: tuple,
Expand All @@ -100,7 +107,7 @@ def main(
)

reader = Readium(config)
summary, tree, content = reader.read_docs(path)
summary, tree, content = reader.read_docs(path, branch=branch)

if output:
with open(output, "w", encoding="utf-8") as f:
Expand Down
54 changes: 38 additions & 16 deletions src/readium/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import tempfile
from dataclasses import dataclass, field
from pathlib import Path
from typing import Dict, List, Optional, Set, Tuple, Union, overload
from typing import Dict, List, Optional, Set, Tuple, Union

from markitdown import FileConversionException, MarkItDown, UnsupportedFormatException

Expand All @@ -23,9 +23,26 @@ def is_git_url(url: str) -> bool:
)


def clone_repository(url: str, target_dir: str) -> None:
"""Clone a git repository to the target directory"""
def clone_repository(url: str, target_dir: str, branch: Optional[str] = None) -> None:
"""Clone a git repository to the target directory

Parameters
----------
url : str
Repository URL
target_dir : str
Target directory for cloning
branch : Optional[str]
Specific branch to clone (default: None, uses default branch)
"""
try:
# Base command
cmd = ["git", "clone", "--depth=1"]

# Add branch specification if provided
if branch:
cmd.extend(["-b", branch])

# If the URL contains '@', it is likely to have a token
if "@" in url:
# Extract the token and reconstruct the URL
Expand All @@ -37,25 +54,21 @@ def clone_repository(url: str, target_dir: str) -> None:
# Log for debugging (hiding the full token)
token_preview = f"{token[:4]}...{token[-4:]}" if len(token) > 8 else "****"
print(f"DEBUG: Attempting to clone with token: {token_preview}")
if branch:
print(f"DEBUG: Using branch: {branch}")

# Use the token as a password with an empty username
env = os.environ.copy()
env["GIT_ASKPASS"] = "echo"
env["GIT_USERNAME"] = ""
env["GIT_PASSWORD"] = token

subprocess.run(
["git", "clone", "--depth=1", repo_url, target_dir],
check=True,
capture_output=True,
env=env,
)
cmd.extend([repo_url, target_dir])
subprocess.run(cmd, check=True, capture_output=True, env=env)
else:
subprocess.run(
["git", "clone", "--depth=1", url, target_dir],
check=True,
capture_output=True,
)
cmd.extend([url, target_dir])
subprocess.run(cmd, check=True, capture_output=True)

except subprocess.CalledProcessError as e:
error_msg = e.stderr.decode()
# Hide the token in the error message if present
Expand All @@ -72,6 +85,7 @@ class Readium:
def __init__(self, config: Optional[ReadConfig] = None):
self.config = config or ReadConfig()
self.markitdown = MarkItDown() if self.config.use_markitdown else None
self.branch: Optional[str] = None # Add branch attribute

def log_debug(self, msg: str) -> None:
"""Print debug messages if debug mode is enabled"""
Expand Down Expand Up @@ -152,25 +166,31 @@ def should_process_file(self, file_path: Union[str, Path]) -> bool:
self.log_debug(f"Including {path} for processing")
return True

def read_docs(self, path: Union[str, Path]) -> Tuple[str, str, str]:
def read_docs(
self, path: Union[str, Path], branch: Optional[str] = None
) -> Tuple[str, str, str]:
"""
Read documentation from a directory or git repository

Parameters
----------
path : Union[str, Path]
Local path or git URL
branch : Optional[str]
Specific branch to clone for git repositories (default: None)

Returns
-------
Tuple[str, str, str]:
summary, tree structure, content
"""
self.branch = branch

# If it's a git URL, clone first
if isinstance(path, str) and is_git_url(path):
with tempfile.TemporaryDirectory() as temp_dir:
try:
clone_repository(path, temp_dir)
clone_repository(path, temp_dir, branch)
return self._process_directory(Path(temp_dir), original_path=path)
except Exception as e:
raise ValueError(f"Error processing git repository: {str(e)}")
Expand Down Expand Up @@ -273,5 +293,7 @@ def _process_directory(
summary += "Using MarkItDown for compatible files\n"
if self.config.markitdown_extensions:
summary += f"MarkItDown extensions: {', '.join(self.config.markitdown_extensions)}\n"
if self.branch:
summary += f"Git branch: {self.branch}\n"

return summary, tree, content
Loading
Loading