Releases: pablotoledo/Readium
Minor fix 0.5.2
0.5.2
Fix related with arguments in cli
0.5.1
Readium 0.5.1 Release Notes
Overview
Readium 0.5.1 introduces significant enhancements to token tree functionality, streamlines the CLI interface, updates dependencies, and simplifies the codebase. This release focuses on improving the analysis capabilities while making the tool more maintainable and user-friendly.
Key Improvements
Token Tree Enhancements
- Added comprehensive token tree functionality to both CLI and Python API
- Integrated token count display by file and directory structure
- Included token tree by default in standard output
- Implemented the
tiktokenlibrary for reliable tokenization of documentation content
CLI Interface Improvements
- Added new
--tokensflag for displaying token metrics - Introduced dedicated subcommand for showing only the token tree
- Simplified handling of arguments like
--exclude-dir - Enhanced validation for user inputs to prevent common errors
- Improved error messaging for better troubleshooting
Dependency Management
- Made
tiktokena core dependency rather than optional - Removed the unnecessary
tokenizersoptional dependency group - Upgraded
pypdfto version 4.3.1 for better compatibility
Codebase Simplification
- Removed unused
print_errorutility and related imports - Streamlined
__all__definitions in core modules - Improved code organization for better maintainability
- Enhanced type annotations for better code quality
Cleanup
- Removed the deprecated
pr.shscript to reduce repository clutter - Eliminated redundant code and simplified implementation patterns
Conclusion
Readium 0.5.1 significantly enhances the tool's ability to analyze documentation through token metrics while improving the developer and user experience through a more streamlined interface and simplified codebase. These improvements make Readium more powerful for extracting insights from documentation across repositories, directories, and URLs.
Release 0.4.1
This pull request introduces significant enhancements to the Readium tool, including the integration of the MarkItDown library for document conversion, improvements to the CLI for directory exclusion, and corresponding updates to the documentation and test cases. Below is a summary of the most important changes grouped by theme.
MarkItDown Integration:
- Added support for
MarkItDownto convert various document formats (e.g., PDF, DOCX, PPTX) to Markdown. This feature can be enabled via the--use-markitdownCLI option oruse_markitdown=Truein the Python API. [1] [2] [3] - Updated the
README.mdwith detailed instructions and examples for enablingMarkItDownintegration, including dependency installation and supported file types. [1] [2]
Directory Exclusion Enhancements:
- Improved the CLI to allow specifying multiple directories for exclusion using the
-x/--exclude-diroption. Added validation to ensure no empty values are provided and included a runtime display of excluded directories. [1] [2] [3] - Updated documentation with examples of using the
-xshort form for excluding directories and clarified behavior for invalid inputs. [1] [2]
Codebase Updates:
- Modified the
mainfunction insrc/readium/cli.pyto handle the new--use-markitdownoption and validate directory exclusion inputs. [1] [2] - Updated the
ReadConfigobject to includemarkitdown_extensionswhen the feature is enabled.
Test Suite Enhancements:
- Added integration tests for
MarkItDownto verify real conversion functionality with actual files. - Added unit tests for the
-x/--exclude-dirCLI option to ensure correct handling of single, multiple, duplicate, and empty values.
These updates enhance the usability and flexibility of the Readium tool, making it more powerful for document processing tasks.
Release 0.4.0
This pull request introduces a new feature to exclude specific file extensions from processing in the readium tool. The changes include updates to the CLI, configuration, core logic, and documentation, as well as the addition of comprehensive tests to ensure the feature works as expected.
New Feature: Exclude File Extensions
CLI and Configuration Updates:
- Added a new
--exclude-extoption to the CLI, allowing users to specify file extensions to exclude from processing. This option can be used multiple times (e.g.,--exclude-ext .json --exclude-ext .yml) (src/readium/cli.py, src/readium/cli.pyR87-R92). - Updated the
ReadConfigclass to include anexclude_extensionsattribute, which takes precedence overinclude_extensionsduring processing (src/readium/config.py, src/readium/config.pyR179). - Documented the new
exclude_extensionsattribute in theReadConfigdocstring (src/readium/config.py, src/readium/config.pyL163-R169).
Core Logic Enhancements:
- Modified the
should_process_filemethod to check for excluded extensions (case-insensitive) and skip processing files with those extensions (src/readium/core.py, src/readium/core.pyR240-R244).
Documentation:
- Updated the
README.mdto explain the new--exclude-extoption and clarify that exclusions take precedence over inclusions (README.md, [1] [2] [3].
Testing
- Added a new test file,
tests/test_extension_exclusion.py, with multiple test cases to validate the behavior of theexclude_extensionsfeature:- Basic exclusion of single and multiple extensions.
- Interaction between
include_extensionsandexclude_extensions. - Case-insensitive matching of extensions.
- CLI integration tests for the
--exclude-extoption. - Tests for excluding all extensions and ensuring no files are processed.
- Tests for extension exclusion in Git repositories (
tests/test_extension_exclusion.py, tests/test_extension_exclusion.pyR1-R147).
0.3.1
Fixing dependencies
0.3.0
🚀 New Features
Web Page to Markdown Conversion
Readium now supports direct web page content extraction and conversion to Markdown! This exciting update allows users to:
- 🔗 Convert web pages to clean, readable Markdown
- 🛠 Process URLs alongside local directories and repositories
- 🎛 Configure content extraction with flexible modes:
clean: Extract only main content (default)full: Preserve most page content
Key Enhancements
- Powerful Web Scraping: Leveraging Trafilatura for intelligent content extraction
- Configurable Processing:
- Control table, image, and link inclusion
- Choose between focused and comprehensive extraction modes
- Seamless Integration: New functionality works alongside existing Readium features
🛠 CLI and API Updates
Command Line Examples
# Convert webpage to Markdown
readium https://example.com/docs
# Full content mode
readium https://example.com/docs --url-mode full
# Save to specific output file
readium https://example.com/docs -o webpage.mdPython API
from readium import Readium, ReadConfig
config = ReadConfig(
url_mode='clean', # 'clean' or 'full'
include_tables=True,
include_images=True
)
reader = Readium(config)
summary, tree, content = reader.read_docs('https://example.com/docs')🔍 Processing Modes
-
Clean Mode (Default):
- Focuses on main content
- Removes menus, ads, and navigation elements
- Ideal for documentation and technical content
-
Full Mode:
- Preserves more page structure
- Includes additional elements
- Useful for comprehensive content capture
📦 Dependencies
- Added [Trafilatura](https://github.com/adbar/trafilatura) for intelligent web content extraction
🔒 Compatibility
- Python 3.10-3.12
- Minimal impact on existing Readium workflows
- Optional web processing functionality
0.2.0
New Features
🌿 Git Branch Selection
Now you can analyze documentation from specific Git branches using the new -b/--branch option.
# Analyze a specific branch
readium https://github.com/username/repo -b feature-branch
# Analyze a private repository's branch
readium https://[email protected]/username/repo -b developPython API Support
reader = Readium(config)
summary, tree, content = reader.read_docs(
'https://github.com/username/repo',
branch='feature-branch'
)0.1.3
Release Notes: Enhanced Dependency Management and Error Handling
🚀 New Features and Enhancements
-
Dependencies and Configuration Updates
- Workflow Improvements:
- Updated
.github/workflows/test.ymlto usepip install ".[dev]", streamlining the installation of development dependencies. - Retained
pytestexecution with-p no:warningsfor cleaner test output.
- Updated
- Dependency Management:
- Moved and separated dependencies into:
[tool.poetry.dependencies]for main dependencies.[tool.poetry.group.dev.dependencies]for development-specific dependencies.
- Adjusted dependencies like
black,isort,mypy,pypdf, and others for better organization.
- Moved and separated dependencies into:
- Configuration Enhancements:
- Added
isortconfiguration inpyproject.tomlfor consistent import sorting across the project.
- Added
- Workflow Improvements:
-
Code Enhancements
- Error Handling:
- Introduced a
print_errorfunction inerror_handling.pyfor safer error handling with fallback support for unprintable content. - Integrated
print_erroracross various modules for consistent error handling.
- Introduced a
- CLI Improvements:
- Added detailed help text, examples, and enhanced the description of the
outputoption incli.py. - Improved error handling for unprintable content in CLI outputs.
- Added detailed help text, examples, and enhanced the description of the
- Core Enhancements:
- Refined type hinting in
core.pywithoverloadand more specific annotations for improved code clarity and safety. - Enhanced debug logging and error handling during file processing.
- Refined type hinting in
- Error Handling:
-
Testing
- New Unit Tests:
test_cli.py: Validated CLI help text, examples, and the functionality of theoutputoption.test_error_handling.py: Tested theprint_errorfunction under various scenarios (e.g., normal text, rich markup, fallback support).
- Test Updates:
- Updated
test_basic.pyby removing obsolete comments for better readability and relevance.
- Updated
- New Unit Tests:
📋 Key Benefits
- Streamlined dependency management for clearer separation between main and development requirements.
- Improved error handling mechanisms ensure safer and more robust handling of edge cases.
- Enhanced developer experience with better documentation, consistent configurations, and comprehensive testing.
- User experience improvements through enriched CLI help text and more intuitive output options.
0.1.1
Release Notes - v0.1.1
Patch release addressing MarkItDown integration and processing improvements.
Bug Fixes
MarkItDown Integration
- Resolved critical parsing issues with complex document conversions
- Fixed unexpected errors during document format detection
- Improved handling of edge cases in multi-format document processing
- Corrected inconsistent metadata extraction in mixed document types
Processing Stability
- Enhanced error resilience in file conversion workflows
- Mitigated potential runtime exceptions during document parsing
- Stabilized memory management for large document conversions
Minor Improvements
- Refined MarkItDown configuration detection
- Updated dependency compatibility checks
- Optimized internal logging for conversion processes
Upgrade Recommendations
- Strongly recommended for users experiencing document conversion instabilities
- No breaking changes from previous version
- Direct upgrade path from v0.1.0
Installation
pip install readium==0.1.10.1.0
Release Notes - v0.1.0
Initial release of Readium, a documentation extraction and analysis tool.
What's New
Core Features
- Documentation extraction from local directories and Git repositories
- Support for multiple document formats through MarkItDown integration
- Configurable file processing with size limits and exclusion patterns
- Debug mode for detailed processing information
File Support
- Documentation:
.md,.mdx,.rst,.txt - Office documents (via MarkItDown):
.pdf,.docx,.xlsx,.pptx - Source code: Multiple programming languages supported
- Configuration:
.yml,.toml,.json, etc.
Command Line Interface
- Basic directory/repository processing
- Output file generation
- Configurable options for processing control
- Debug mode support
Python API
ReadConfigclass for flexible configurationReadiumclass for programmatic access- Integration with MarkItDown for document conversion
Installation
pip install readiumKnown Issues
- Binary files are excluded by default unless processed through MarkItDown
- Git repository processing requires git to be installed
Dependencies
- Python ≥ 3.10
- Required packages: click, rich, markitdown, black, isort