-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment
🧠 Context
Pages under https://carleton.ca/scs/**
follow a consistent layout where the meaningful page content is contained within a specific section of the HTML (<div id="content"
or similar). However, our current ingestion logic does not account for this, and as a result, it may pick up irrelevant navigation bars, side menus, or other layout elements.
See the green section. We don't the navbar ingested each time.
To improve quality and consistency, we should restrict ingestion for these pages to only the main content section.
🛠 Implementation Plan
-
In
WebpageIngestionService
, detect if the source URL starts withhttps://carleton.ca/scs/
. -
If it matches:
- Parse the HTML and extract only the content within the main section (typically
<div id="content">
). - Use this content for chunking instead of the full page body.
- Parse the HTML and extract only the content within the main section (typically
-
Add a test with a sample HTML page from
carleton.ca/scs
to verify that only the expected content is ingested.
✅ Acceptance Criteria
- When ingesting pages from
https://carleton.ca/scs/**
, extract content only from the main content section of the page (e.g.,<div id="content">
). - Exclude headers, navigation, footers, sidebars, or any boilerplate elements.
- The chunk(s) should contain only the relevant main body content.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
Ready