Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add open-source text extraction libraries #293

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

garrethlee
Copy link

@garrethlee garrethlee commented Sep 27, 2024

Description

Refactored extraction logic to separate HTML cleaning and text extraction into distinct steps. This allows chaining the cleaning step from one library with the extraction step from another, enhancing flexibility and interoperability.

Context

  • Most extractors follow a two-step process:
    1. Clean raw HTML into a sanitized representation (usually a stripped down version of HTML)
    2. Convert the cleaned HTML to plaintext.
  • Readability, for example, only provides an HTML cleaning method and lacks built-in plaintext conversion. To handle such cases, we now support chaining steps across libraries (e.g., clean_html from one library and extract from another).
  • Direct use cases, such as Trafilatura, remain unaffected—its extract function works independently, while clean_html is reserved for interoperability scenarios like inscriptis.

Thus, we break down the extraction functionality into the two phases referenced above, in the form of a clean_html and extract method in each Extractor.

Changes

  • Added clean_html as a standalone method in extractors
  • Refactored the logic in applicable extractors to separate cleaning and extracting processes.
  • Integrated new text extraction libraries (readabilipy, readability, resiliparse) to extend functionality and improve coverage.

garrethlee and others added 27 commits September 24, 2024 17:06
… initialization

- Added a default `clean_html` method to the `BaseExtractor` class, providing a warning for extractors that do not implement their own.
- Implemented specific `clean_html` methods in `Inscriptis`, `Justext`, `ReadabiliPy`, `Readability`, and `Trafilatura` extractors to handle HTML cleaning.
- Updated the `Inscriptis` extractor to accept a preprocessor during initialization.
- Modified the `extract` methods in `ReadabiliPy` and `Readability` to utilize the new `clean_html` method.
- Adjusted the `Justext` extractor to remove the default English language parameter from `get_stoplist`.
- Updated tests to reflect changes in extractor initialization and functionality.
… initialization

- Added a default `clean_html` method to the `BaseExtractor` class, providing a warning for extractors that do not implement their own.
- Implemented specific `clean_html` methods in `Inscriptis`, `Justext`, `ReadabiliPy`, `Readability`, and `Trafilatura` extractors to handle HTML cleaning.
- Updated the `Inscriptis` extractor to accept a preprocessor during initialization.
- Modified the `extract` methods in `ReadabiliPy` and `Readability` to utilize the new `clean_html` method.
- Adjusted the `Justext` extractor to remove the default English language parameter from `get_stoplist`.
- Updated tests to reflect changes in extractor initialization and functionality.
@garrethlee garrethlee marked this pull request as ready for review December 21, 2024 23:50
@garrethlee garrethlee changed the title Add several open-source text extraction libraries Add open-source text extraction libraries Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants