Skip to content

Ingest CCSS FAQ pages as a single chunk #3

@MathyouMB

Description

@MathyouMB

NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment

🧠 Context

When ingesting FAQ pages from the Carleton Computer Science Society (CCSS) site (i.e., URLs like https://ccss.carleton.ca/resources/faq/questions/**), the current logic creates multiple chunks—including one just for the footer text (© 2025 Carleton Computer Science Society), which pollutes the index.

These pages should be treated as structured, self-contained documents. Rather than splitting them up, we should ingest the entire page as a single chunk and explicitly exclude generic or boilerplate content like the footer.


🛠 Implementation Plan

  1. In WebpageIngestionService, detect if the source URL starts with https://ccss.carleton.ca/resources/faq/questions/.

  2. If it matches, bypass the default chunking logic and instead:

    • Strip out the footer and boilerplate content.
    • Store the entire cleaned-up page content as one chunk.
  3. Add a unit test to ensure:

    • The page is ingested as a single chunk.
    • The chunk does not contain the © 2025 Carleton Computer Science Society text.

✅ Acceptance Criteria

  • If the URL matches the pattern https://ccss.carleton.ca/resources/faq/questions/**, ingest the page as a single chunk.
  • Do not split the content into multiple chunks.
  • Exclude footer content such as © 2025 Carleton Computer Science Society from the chunk.
  • The resulting chunk should contain only the meaningful FAQ content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions