Skip to content

Feature: Text Cleaning Pipeline for Chunking and Extraction #18

@mubashir-oss

Description

@mubashir-oss

Improve the quality of text extraction by adding a comprehensive and configurable text cleaning and normalization pipeline.

Key Features

  • Ligature replacement and Unicode normalization
  • Bullet point and ordered list cleanup
  • Paragraph grouping for broken lines
  • Whitespace and line break normalization
  • Quote normalization and MIME encoding handling

Implementation Notes

  • Build a modular text cleaning pipeline
  • Support configurable cleaning options

📈 Impact

Enhances chunking and extraction accuracy for scanned documents, PDFs, and content with inconsistent formatting. This will improve downstream processing and data usability.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions