Skip to content

Releases: pablotoledo/Readium

Minor fix 0.5.2

17 May 11:31

Choose a tag to compare

0.5.2

Fix related with arguments in cli

0.5.1

16 May 23:45

Choose a tag to compare

Readium 0.5.1 Release Notes

Overview

Readium 0.5.1 introduces significant enhancements to token tree functionality, streamlines the CLI interface, updates dependencies, and simplifies the codebase. This release focuses on improving the analysis capabilities while making the tool more maintainable and user-friendly.

Key Improvements

Token Tree Enhancements

  • Added comprehensive token tree functionality to both CLI and Python API
  • Integrated token count display by file and directory structure
  • Included token tree by default in standard output
  • Implemented the tiktoken library for reliable tokenization of documentation content

CLI Interface Improvements

  • Added new --tokens flag for displaying token metrics
  • Introduced dedicated subcommand for showing only the token tree
  • Simplified handling of arguments like --exclude-dir
  • Enhanced validation for user inputs to prevent common errors
  • Improved error messaging for better troubleshooting

Dependency Management

  • Made tiktoken a core dependency rather than optional
  • Removed the unnecessary tokenizers optional dependency group
  • Upgraded pypdf to version 4.3.1 for better compatibility

Codebase Simplification

  • Removed unused print_error utility and related imports
  • Streamlined __all__ definitions in core modules
  • Improved code organization for better maintainability
  • Enhanced type annotations for better code quality

Cleanup

  • Removed the deprecated pr.sh script to reduce repository clutter
  • Eliminated redundant code and simplified implementation patterns

Conclusion

Readium 0.5.1 significantly enhances the tool's ability to analyze documentation through token metrics while improving the developer and user experience through a more streamlined interface and simplified codebase. These improvements make Readium more powerful for extracting insights from documentation across repositories, directories, and URLs.

Release 0.4.1

25 Apr 22:31
1760521

Choose a tag to compare

This pull request introduces significant enhancements to the Readium tool, including the integration of the MarkItDown library for document conversion, improvements to the CLI for directory exclusion, and corresponding updates to the documentation and test cases. Below is a summary of the most important changes grouped by theme.

MarkItDown Integration:

  • Added support for MarkItDown to convert various document formats (e.g., PDF, DOCX, PPTX) to Markdown. This feature can be enabled via the --use-markitdown CLI option or use_markitdown=True in the Python API. [1] [2] [3]
  • Updated the README.md with detailed instructions and examples for enabling MarkItDown integration, including dependency installation and supported file types. [1] [2]

Directory Exclusion Enhancements:

  • Improved the CLI to allow specifying multiple directories for exclusion using the -x/--exclude-dir option. Added validation to ensure no empty values are provided and included a runtime display of excluded directories. [1] [2] [3]
  • Updated documentation with examples of using the -x short form for excluding directories and clarified behavior for invalid inputs. [1] [2]

Codebase Updates:

  • Modified the main function in src/readium/cli.py to handle the new --use-markitdown option and validate directory exclusion inputs. [1] [2]
  • Updated the ReadConfig object to include markitdown_extensions when the feature is enabled.

Test Suite Enhancements:

  • Added integration tests for MarkItDown to verify real conversion functionality with actual files.
  • Added unit tests for the -x/--exclude-dir CLI option to ensure correct handling of single, multiple, duplicate, and empty values.

These updates enhance the usability and flexibility of the Readium tool, making it more powerful for document processing tasks.

Release 0.4.0

24 Apr 20:56
88fed49

Choose a tag to compare

This pull request introduces a new feature to exclude specific file extensions from processing in the readium tool. The changes include updates to the CLI, configuration, core logic, and documentation, as well as the addition of comprehensive tests to ensure the feature works as expected.

New Feature: Exclude File Extensions

CLI and Configuration Updates:

  • Added a new --exclude-ext option to the CLI, allowing users to specify file extensions to exclude from processing. This option can be used multiple times (e.g., --exclude-ext .json --exclude-ext .yml) (src/readium/cli.py, src/readium/cli.pyR87-R92).
  • Updated the ReadConfig class to include an exclude_extensions attribute, which takes precedence over include_extensions during processing (src/readium/config.py, src/readium/config.pyR179).
  • Documented the new exclude_extensions attribute in the ReadConfig docstring (src/readium/config.py, src/readium/config.pyL163-R169).

Core Logic Enhancements:

  • Modified the should_process_file method to check for excluded extensions (case-insensitive) and skip processing files with those extensions (src/readium/core.py, src/readium/core.pyR240-R244).

Documentation:

  • Updated the README.md to explain the new --exclude-ext option and clarify that exclusions take precedence over inclusions (README.md, [1] [2] [3].

Testing

  • Added a new test file, tests/test_extension_exclusion.py, with multiple test cases to validate the behavior of the exclude_extensions feature:
    • Basic exclusion of single and multiple extensions.
    • Interaction between include_extensions and exclude_extensions.
    • Case-insensitive matching of extensions.
    • CLI integration tests for the --exclude-ext option.
    • Tests for excluding all extensions and ensuring no files are processed.
    • Tests for extension exclusion in Git repositories (tests/test_extension_exclusion.py, tests/test_extension_exclusion.pyR1-R147).

0.3.1

26 Mar 10:03

Choose a tag to compare

Fixing dependencies

0.3.0

26 Mar 09:36
8ee105b

Choose a tag to compare

🚀 New Features

Web Page to Markdown Conversion

Readium now supports direct web page content extraction and conversion to Markdown! This exciting update allows users to:

  • 🔗 Convert web pages to clean, readable Markdown
  • 🛠 Process URLs alongside local directories and repositories
  • 🎛 Configure content extraction with flexible modes:
    • clean: Extract only main content (default)
    • full: Preserve most page content

Key Enhancements

  • Powerful Web Scraping: Leveraging Trafilatura for intelligent content extraction
  • Configurable Processing:
    • Control table, image, and link inclusion
    • Choose between focused and comprehensive extraction modes
  • Seamless Integration: New functionality works alongside existing Readium features

🛠 CLI and API Updates

Command Line Examples

# Convert webpage to Markdown
readium https://example.com/docs

# Full content mode
readium https://example.com/docs --url-mode full

# Save to specific output file
readium https://example.com/docs -o webpage.md

Python API

from readium import Readium, ReadConfig

config = ReadConfig(
    url_mode='clean',      # 'clean' or 'full'
    include_tables=True,
    include_images=True
)

reader = Readium(config)
summary, tree, content = reader.read_docs('https://example.com/docs')

🔍 Processing Modes

  • Clean Mode (Default):

    • Focuses on main content
    • Removes menus, ads, and navigation elements
    • Ideal for documentation and technical content
  • Full Mode:

    • Preserves more page structure
    • Includes additional elements
    • Useful for comprehensive content capture

📦 Dependencies

🔒 Compatibility

  • Python 3.10-3.12
  • Minimal impact on existing Readium workflows
  • Optional web processing functionality

0.2.0

19 Jan 15:35
a4ec351

Choose a tag to compare

New Features

🌿 Git Branch Selection

Now you can analyze documentation from specific Git branches using the new -b/--branch option.

# Analyze a specific branch
readium https://github.com/username/repo -b feature-branch

# Analyze a private repository's branch
readium https://[email protected]/username/repo -b develop

Python API Support

reader = Readium(config)
summary, tree, content = reader.read_docs(
    'https://github.com/username/repo',
    branch='feature-branch'
)

0.1.3

14 Jan 23:56
22cb9d9

Choose a tag to compare

Release Notes: Enhanced Dependency Management and Error Handling

🚀 New Features and Enhancements

  1. Dependencies and Configuration Updates

    • Workflow Improvements:
      • Updated .github/workflows/test.yml to use pip install ".[dev]", streamlining the installation of development dependencies.
      • Retained pytest execution with -p no:warnings for cleaner test output.
    • Dependency Management:
      • Moved and separated dependencies into:
        • [tool.poetry.dependencies] for main dependencies.
        • [tool.poetry.group.dev.dependencies] for development-specific dependencies.
      • Adjusted dependencies like black, isort, mypy, pypdf, and others for better organization.
    • Configuration Enhancements:
      • Added isort configuration in pyproject.toml for consistent import sorting across the project.
  2. Code Enhancements

    • Error Handling:
      • Introduced a print_error function in error_handling.py for safer error handling with fallback support for unprintable content.
      • Integrated print_error across various modules for consistent error handling.
    • CLI Improvements:
      • Added detailed help text, examples, and enhanced the description of the output option in cli.py.
      • Improved error handling for unprintable content in CLI outputs.
    • Core Enhancements:
      • Refined type hinting in core.py with overload and more specific annotations for improved code clarity and safety.
      • Enhanced debug logging and error handling during file processing.
  3. Testing

    • New Unit Tests:
      • test_cli.py: Validated CLI help text, examples, and the functionality of the output option.
      • test_error_handling.py: Tested the print_error function under various scenarios (e.g., normal text, rich markup, fallback support).
    • Test Updates:
      • Updated test_basic.py by removing obsolete comments for better readability and relevance.

📋 Key Benefits

  • Streamlined dependency management for clearer separation between main and development requirements.
  • Improved error handling mechanisms ensure safer and more robust handling of edge cases.
  • Enhanced developer experience with better documentation, consistent configurations, and comprehensive testing.
  • User experience improvements through enriched CLI help text and more intuitive output options.

0.1.1

11 Jan 17:34

Choose a tag to compare

Release Notes - v0.1.1

Patch release addressing MarkItDown integration and processing improvements.

Bug Fixes

MarkItDown Integration

  • Resolved critical parsing issues with complex document conversions
  • Fixed unexpected errors during document format detection
  • Improved handling of edge cases in multi-format document processing
  • Corrected inconsistent metadata extraction in mixed document types

Processing Stability

  • Enhanced error resilience in file conversion workflows
  • Mitigated potential runtime exceptions during document parsing
  • Stabilized memory management for large document conversions

Minor Improvements

  • Refined MarkItDown configuration detection
  • Updated dependency compatibility checks
  • Optimized internal logging for conversion processes

Upgrade Recommendations

  • Strongly recommended for users experiencing document conversion instabilities
  • No breaking changes from previous version
  • Direct upgrade path from v0.1.0

Installation

pip install readium==0.1.1

0.1.0

11 Jan 17:03

Choose a tag to compare

Release Notes - v0.1.0

Initial release of Readium, a documentation extraction and analysis tool.

What's New

Core Features

  • Documentation extraction from local directories and Git repositories
  • Support for multiple document formats through MarkItDown integration
  • Configurable file processing with size limits and exclusion patterns
  • Debug mode for detailed processing information

File Support

  • Documentation: .md, .mdx, .rst, .txt
  • Office documents (via MarkItDown): .pdf, .docx, .xlsx, .pptx
  • Source code: Multiple programming languages supported
  • Configuration: .yml, .toml, .json, etc.

Command Line Interface

  • Basic directory/repository processing
  • Output file generation
  • Configurable options for processing control
  • Debug mode support

Python API

  • ReadConfig class for flexible configuration
  • Readium class for programmatic access
  • Integration with MarkItDown for document conversion

Installation

pip install readium

Known Issues

  • Binary files are excluded by default unless processed through MarkItDown
  • Git repository processing requires git to be installed

Dependencies

  • Python ≥ 3.10
  • Required packages: click, rich, markitdown, black, isort