-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment
🧠 Context
When ingesting FAQ pages from the Carleton Computer Science Society (CCSS) site (i.e., URLs like https://ccss.carleton.ca/resources/faq/questions/**
), the current logic creates multiple chunks—including one just for the footer text (© 2025 Carleton Computer Science Society
), which pollutes the index.
These pages should be treated as structured, self-contained documents. Rather than splitting them up, we should ingest the entire page as a single chunk and explicitly exclude generic or boilerplate content like the footer.
🛠 Implementation Plan
-
In
WebpageIngestionService
, detect if the source URL starts withhttps://ccss.carleton.ca/resources/faq/questions/
. -
If it matches, bypass the default chunking logic and instead:
- Strip out the footer and boilerplate content.
- Store the entire cleaned-up page content as one chunk.
-
Add a unit test to ensure:
- The page is ingested as a single chunk.
- The chunk does not contain the
© 2025 Carleton Computer Science Society
text.
✅ Acceptance Criteria
- If the URL matches the pattern
https://ccss.carleton.ca/resources/faq/questions/**
, ingest the page as a single chunk. - Do not split the content into multiple chunks.
- Exclude footer content such as
© 2025 Carleton Computer Science Society
from the chunk. - The resulting chunk should contain only the meaningful FAQ content.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
Ready