Improve the quality of text extraction by adding a comprehensive and configurable text cleaning and normalization pipeline.
Key Features
- Ligature replacement and Unicode normalization
- Bullet point and ordered list cleanup
- Paragraph grouping for broken lines
- Whitespace and line break normalization
- Quote normalization and MIME encoding handling
Implementation Notes
- Build a modular text cleaning pipeline
- Support configurable cleaning options
📈 Impact
Enhances chunking and extraction accuracy for scanned documents, PDFs, and content with inconsistent formatting. This will improve downstream processing and data usability.