Skip to content

Feat/dom source extraction#24

Open
joaocarlos wants to merge 2 commits intoPleasePrompto:mainfrom
joaocarlos:feat/dom-source-extraction
Open

Feat/dom source extraction#24
joaocarlos wants to merge 2 commits intoPleasePrompto:mainfrom
joaocarlos:feat/dom-source-extraction

Conversation

@joaocarlos
Copy link
Copy Markdown
Contributor

This PR adds DOM-based source extraction for ask_question, so citations are captured from NotebookLM’s response UI instead of asking the model to generate
them explicitly.

What changed

  • Added include_sources?: boolean to ask_question.
  • Added global default setting always-include-sources (default true).
  • Added config support:
    • CLI: config set always-include-sources <true|false>
    • Env: `NOTEBOOKLM_ALWAYS_INCLUDE_SOURCES=true|false
  • Updated `ask_question response payload to include:
    • include_sources (effective value used)
    • `sources (structured parsed references)
  • Implemented DOM citation parsing in `page-utils:
    • Extracts from citation elements like `.citation-marker and citation dialog-related buttons.
    • Normalizes citation text into `SourceReference entries.
    • Filters UI noise entries such as ... and “Show additional citations”.

Why implement this?

  • Extract directly from the DOM response to reduce model pressure/tokens for citation formatting.
  • Uses NotebookLM’s own citation UI as the source of truth.
  • Makes source inclusion automatic and configurable globally/per-call.

Validation performed

  • `npm run build (pass)
  • Live NotebookLM E2E with local build:
    • `includeSources=true returned non-zero source extraction (9 sources in test run).
  • MCP stdio integration (tools/call → `ask_question):
    • confirmed include_sources: true
    • confirmed `sources_count > 0

Implementation note

Since I'm not working on a project right now, I could only demonstrate it with an MVP. Further tests should be performed to ensure it does not compromise other features.

Example output

  {
    "success": true,
    "data": {
      "include_sources": true,
      "sources": [
        {
          "title": "J.Puthiyidam and Joseph - 2023 - Prioritization of MQTT Messages A Novel Approach.pdf",
          "raw_text": "1: J.Puthiyidam and Joseph - 2023 - Prioritization of MQTT Messages A Novel Approach.pdf"
        },
        {
          "title": "Shahri et al. - 2022 - Extending MQTT with Real-Time Communication Services Based on SDN.pdf",
          "raw_text": "4: Shahri et al. - 2022 - Extending MQTT with Real-Time Communication Services Based on SDN.pdf"
        }
      ]
    }
  }

Branch: feat/dom-source-extraction
Commit: b898219

Issue

References #20

Add hard polling guards and browser responsiveness checks in waitForLatestAnswer to avoid CPU spin when page.waitForTimeout returns early in zombie browser states.

Also unify recoverable browser error detection and use it in BrowserSession recovery paths for ask/reset/newPage.

Refs PleasePrompto#16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant