Add standalone MCP docs server mounting by rmusser01 · Pull Request #2565 · rmusser01/tldw_server

rmusser01 · 2026-07-01T03:28:53Z

Summary

Adds a runtime-neutral standalone docs mount/factory with locked_down, local_first, and online_capable profiles.
Adds a tldw_server docs host adapter boundary and routes DocsModule settings/scope creation through it.
Adds standalone mount, host adapter, registration, package discovery, and import-boundary regression coverage.

Test Plan

/Users/macbook-dev/Documents/GitHub/tldw_server2/.venv/bin/python -m pytest tldw_Server_API/tests/MCP_unified/docs -q --tb=short
/Users/macbook-dev/Documents/GitHub/tldw_server2/.venv/bin/python -m pytest tldw_Server_API/tests/MCP_unified tldw_Server_API/app/core/MCP_unified/tests/test_dynamic_module_catalog.py -k "docs or write_tools or validator or dynamic_module_catalog" -q --tb=short
Standalone import smoke: True False loaded_optional= []
/Users/macbook-dev/Documents/GitHub/tldw_server2/.venv/bin/python -m black --check ...
Bandit touched-scope scan: errors: [], results: []

Merge Gate

This PR is AI-authored and intentionally draft. A human requester must add the required Change summary explaining what changed and why these implementation choices were made before merge.

coderabbitai · 2026-07-01T03:29:00Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 91b79898-56d0-4fb2-a862-de84b984a00d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/mcp-docs-stage1

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces extensive documentation and implementation plans across several modules, including the new Audio Studio workspace, a comprehensive Data Flow Atlas, the Calendar module with CalDAV integration, and the Research Workspace migration protocol. It also updates the Chatbook API documentation to support the v1.1 format specification and adds setup guides for OmniVoice TTS. The review feedback correctly identifies a syntax error in the CalDAV smoke test documentation where JSON placeholders are unquoted, which would cause parsing errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

qodo-code-review · 2026-07-01T03:57:59Z

PR Summary by Qodo

Add standalone MCP docs mount and tldw_server host adapter wiring

✨ Enhancement 🧪 Tests 📝 Documentation ⚙️ Configuration changes 🕐 40+ Minutes

AI Description

• Add runtime-neutral docs corpus package with SQLite store, importers, retrieval, and optional URL
 ingestion.
• Introduce tldw_server host adapter boundary and register DocsModule via mcp_modules.yaml.
• Add regression tests for standalone mounting, host adapter scope mapping, and import/package
 boundaries.

Diagram

graph TD
  A["tldw_server MCP runtime"] --> B["DocsModule"] --> C["Docs host adapter"] --> D["DocsMCPToolProvider"] --> E[("SQLite docs DB")]
  D --> F["Local import service"] --> E
  D --> G["URL acquisition (optional)"] --> E

  subgraph Legend
    direction LR
    _svc(["Service"]) ~~~ _mod["Module"] ~~~ _db[("Database")]
  end

High-Level Assessment

The following are alternative approaches to this PR:

1. Use httpx/requests + urllib3 for URL fetching

➕ Less custom HTTP parsing/transport code to maintain
➕ Battle-tested redirect and TLS handling
➖ Adds runtime dependencies (conflicts with runtime-neutral goal)
➖ Harder to guarantee DNS-rebinding/IP egress controls without deeper hooks

2. Centralize settings validation in host (ModuleConfig schema)

➕ Single validation source for server deployments
➕ Earlier failure with clearer operator-facing errors
➖ Standalone mount still needs validation logic
➖ Couples runtime-neutral package to host concerns if not carefully layered

3. Use existing document store/RAG library instead of custom SQLite schema

➕ Faster iteration for advanced retrieval (embeddings, reranking)
➕ Less bespoke SQL/FTS maintenance
➖ Introduces heavier dependencies and operational complexity
➖ May not meet strict offline/local-first constraints

Recommendation: Current approach (runtime-neutral provider + explicit host adapter boundary + stdlib-only optional web acquisition) best fits the stated goals: standalone usability, import boundary hygiene, and security controls. Keep the custom fetcher/policy for now, but consider swapping the transport layer behind a protocol to allow an optional httpx-based implementation when dependency constraints are relaxed.

Files changed (62) +13399 / -5

Enhancement (28) +3682 / -0

__init__.pyExpose docs public API and standalone mount helpers +45/-0

Expose docs public API and standalone mount helpers
• Creates the package entrypoint exporting settings/models/provider/store and standalone mount/profile helpers.
apps/mcp-unified/src/mcp_unified/docs/init.py

__init__.pyAdd acquisition package exports +52/-0

Add acquisition package exports
• Defines acquisition subpackage boundary and public exports for URL ingestion support.
apps/mcp-unified/src/mcp_unified/docs/acquisition/init.py

extract.pyImplement optional HTML/text extraction with lazy rich extractors +150/-0

Implement optional HTML/text extraction with lazy rich extractors
• Adds extraction that prefers trafilatura/BeautifulSoup when available but falls back to static parsing/text decoding without importing optional dependencies at import time.
apps/mcp-unified/src/mcp_unified/docs/acquisition/extract.py

fetcher.pyImplement policy-gated URL fetcher with DNS/egress validation +386/-0

Implement policy-gated URL fetcher with DNS/egress validation
• Adds a stdlib-based HTTP(S) fetcher enforcing redirect limits, content-type/size constraints, resolver-based IP checks, and a transport that dials validated addresses to mitigate DNS rebinding.
apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py

models.pyAdd acquisition data models and protocols +114/-0

Add acquisition data models and protocols
• Defines normalized URL/policy/fetch result dataclasses and resolver/transport protocols used by acquisition components.
apps/mcp-unified/src/mcp_unified/docs/acquisition/models.py

policy.pyAdd URL normalization and source policy evaluation +367/-0

Add URL normalization and source policy evaluation
• Implements URL parsing/normalization, domain/prefix rules, local/legacy host denial, and redacted logging fields with fail-closed behavior for ambiguous inputs.
apps/mcp-unified/src/mcp_unified/docs/acquisition/policy.py

resolver.pyAdd stdlib DNS resolver and unsafe IP detection +31/-0

Add stdlib DNS resolver and unsafe IP detection
• Introduces resolver utilities used by the fetcher to detect private/unsafe egress targets.
apps/mcp-unified/src/mcp_unified/docs/acquisition/resolver.py

service.pyAdd docs URL acquisition ingestion service +140/-0

Add docs URL acquisition ingestion service
• Adds ingestion orchestration that fetches, extracts, chunks, and upserts URL content into the store with status reporting and change detection via content hash.
apps/mcp-unified/src/mcp_unified/docs/acquisition/service.py

errors.pyAdd DocsError for machine-readable failures +19/-0

Add DocsError for machine-readable failures
• Introduces a structured exception type used across import/acquisition/store flows to preserve error codes and details.
apps/mcp-unified/src/mcp_unified/docs/errors.py

__init__.pyAdd importers package exports +5/-0

Add importers package exports
• Defines the importers subpackage and its public surface.
apps/mcp-unified/src/mcp_unified/docs/importers/init.py

base.pyAdd importer base models and chunking helpers +49/-0

Add importer base models and chunking helpers
• Defines ParsedDocument/section structures and shared chunking logic used by local and URL import paths.
apps/mcp-unified/src/mcp_unified/docs/importers/base.py

html.pyAdd HTML parsing utilities +94/-0

Add HTML parsing utilities
• Implements HTML parsing into ParsedDocument, including title/section extraction for downstream storage.
apps/mcp-unified/src/mcp_unified/docs/importers/html.py

local.pyAdd trusted-roots local import service +132/-0

Add trusted-roots local import service
• Implements docs.import_path with trusted-root enforcement, file type/size checks, parsing, chunking, and store upsert.
apps/mcp-unified/src/mcp_unified/docs/importers/local.py

markdown.pyAdd Markdown/text parsing helpers +38/-0

Add Markdown/text parsing helpers
• Adds markdown/text parsing into ParsedDocument, including basic title handling.
apps/mcp-unified/src/mcp_unified/docs/importers/markdown.py

mcp_module.pyAdd DocsMCPToolProvider tool surface and execution router +331/-0

Add DocsMCPToolProvider tool surface and execution router
• Implements MCP tool definitions (search/get/context/resolve/list/import/management + Context7-compatible tools) and dispatches executions to retrieval/import/acquisition services.
apps/mcp-unified/src/mcp_unified/docs/mcp_module.py

models.pyAdd docs domain models (scope, requests, results) +68/-0

Add docs domain models (scope, requests, results)
• Defines AccessScope and request/result dataclasses used across provider, store, and retrieval layers.
apps/mcp-unified/src/mcp_unified/docs/models.py

__init__.pyAdd retrieval package exports +7/-0

Add retrieval package exports
• Defines retrieval subpackage boundary and export surface.
apps/mcp-unified/src/mcp_unified/docs/retrieval/init.py

aliases.pyAdd alias resolver and Context7 library-id mapping +25/-0

Add alias resolver and Context7 library-id mapping
• Implements name resolution across collections/keywords/packages and provides Context7-style library id resolution.
apps/mcp-unified/src/mcp_unified/docs/retrieval/aliases.py

context.pyAdd bounded context pack builder +81/-0

Add bounded context pack builder
• Builds a capped context response from search results with max-chunks/documents/characters budgets and citation aggregation.
apps/mcp-unified/src/mcp_unified/docs/retrieval/context.py

search.pyAdd retrieval service wrapper around store search/get/list +34/-0

Add retrieval service wrapper around store search/get/list
• Introduces a thin service layer that calls the store for chunk search, document retrieval, and listing endpoints.
apps/mcp-unified/src/mcp_unified/docs/retrieval/search.py

settings.pyAdd DocsSettings with coercion and validation +147/-0

Add DocsSettings with coercion and validation
• Defines docs settings (db path, trusted roots, web acquisition/policy knobs) and strict coercion/validation via from_mapping.
apps/mcp-unified/src/mcp_unified/docs/settings.py

standalone.pyAdd standalone docs mount factory and profile defaults +69/-0

Add standalone docs mount factory and profile defaults
• Provides create_standalone_docs_mount and profile-based defaults (locked_down/local_first/online_capable) to run docs tools outside the tldw_server runtime.
apps/mcp-unified/src/mcp_unified/docs/standalone.py

__init__.pyAdd store package exports +5/-0

Add store package exports
• Defines store subpackage boundary and exports the SQLite-backed store.
apps/mcp-unified/src/mcp_unified/docs/store/init.py

schema.sqlAdd SQLite schema for docs corpus and FTS index +130/-0

Add SQLite schema for docs corpus and FTS index
• Introduces tables for documents, sections, chunks, collections, keywords, aliases, and an FTS5 virtual table for chunk search.
apps/mcp-unified/src/mcp_unified/docs/store/schema.sql

sqlite.pyImplement SQLite docs catalog store with migrations and retrieval APIs +1086/-0

Implement SQLite docs catalog store with migrations and retrieval APIs
• Adds the primary persistence layer including schema migration, scoped upserts, FTS search, and collection/keyword/alias operations.
apps/mcp-unified/src/mcp_unified/docs/store/sqlite.py

__init__.pyExport docs host adapter helpers +3/-0

Export docs host adapter helpers
• Re-exports docs settings/scope conversion helpers for the docs module shim.
tldw_Server_API/app/core/MCP_unified/adapters/docs/init.py

config.pyAdd docs settings and scope mapping from host runtime +20/-0

Add docs settings and scope mapping from host runtime
• Maps ModuleConfig.settings into DocsSettings and RequestContext metadata/user_id into AccessScope.
tldw_Server_API/app/core/MCP_unified/adapters/docs/config.py

docs_module.pyAdd DocsModule host shim delegating to DocsMCPToolProvider +54/-0

Add DocsModule host shim delegating to DocsMCPToolProvider
• Introduces the server-side module that lazily creates the provider from module config, validates basic tool args, and passes scoped execution based on RequestContext.
tldw_Server_API/app/core/MCP_unified/modules/implementations/docs_module.py

Tests (16) +2316 / -5

test_dynamic_module_catalog.pyAdd docs module config and registration regression tests +61/-4

Add docs module config and registration regression tests
• Adds tests asserting docs module is declared in mcp_modules.yaml and that server registration exposes docs.status but not docs.ingest_url when web acquisition is disabled.
tldw_Server_API/app/core/MCP_unified/tests/test_dynamic_module_catalog.py

test_runtime_package_boundary.pyExtend runtime package boundary checks for docs subpackages +6/-0

Extend runtime package boundary checks for docs subpackages
• Updates packaging metadata assertions to include mcp_unified.docs subpackages and schema.sql as package data.
tldw_Server_API/app/core/MCP_unified/tests/test_runtime_package_boundary.py

test_write_tools_validators.pyAdjust validator coverage for docs write tools +3/-1

Adjust validator coverage for docs write tools
• Updates write-tool validation expectations to align with new docs tool categories and argument requirements.
tldw_Server_API/app/core/MCP_unified/tests/test_write_tools_validators.py

test_docs_acquisition_extract.pyAdd tests for extraction fallbacks and optional extractors +79/-0

Add tests for extraction fallbacks and optional extractors
• Adds coverage for content-type based extraction behavior and lazy optional dependency loading.
tldw_Server_API/tests/MCP_unified/docs/test_docs_acquisition_extract.py

test_docs_acquisition_fetcher.pyAdd URL fetcher security and redirect/content tests +295/-0

Add URL fetcher security and redirect/content tests
• Adds tests for policy gating, redirects, size/content-type enforcement, DNS/egress protections, and transport validation behavior.
tldw_Server_API/tests/MCP_unified/docs/test_docs_acquisition_fetcher.py

test_docs_acquisition_policy.pyAdd comprehensive URL policy tests +333/-0

Add comprehensive URL policy tests
• Adds tests validating URL normalization, denial of local/ambiguous hosts, rule matching, and redaction behavior.
tldw_Server_API/tests/MCP_unified/docs/test_docs_acquisition_policy.py

test_docs_acquisition_service.pyAdd tests for URL ingestion service results and upserts +173/-0

Add tests for URL ingestion service results and upserts
• Adds ingestion tests covering disabled capability handling, fetch/extract outcomes, and created/updated/unchanged semantics.
tldw_Server_API/tests/MCP_unified/docs/test_docs_acquisition_service.py

test_docs_host_adapter.pyAdd tests for host adapter settings and scope mapping +45/-0

Add tests for host adapter settings and scope mapping
• Verifies module config mapping preserves locked-down defaults and that RequestContext maps to AccessScope correctly.
tldw_Server_API/tests/MCP_unified/docs/test_docs_host_adapter.py

test_docs_import_boundaries.pyAdd import-boundary regression tests for runtime-neutral docs package +74/-0

Add import-boundary regression tests for runtime-neutral docs package
• Ensures mcp_unified.docs imports without tldw_server coupling and does not import optional web dependencies by default.
tldw_Server_API/tests/MCP_unified/docs/test_docs_import_boundaries.py

test_docs_importers.pyAdd local importer parsing and trusted-roots tests +163/-0

Add local importer parsing and trusted-roots tests
• Adds tests for supported suffix handling, file size enforcement, and trusted-root path restrictions.
tldw_Server_API/tests/MCP_unified/docs/test_docs_importers.py

test_docs_mcp_provider.pyAdd provider tool surface and Context7 compatibility tests +274/-0

Add provider tool surface and Context7 compatibility tests
• Tests tool definitions, readOnly hints/categories, Context7 resolver routing, and ingest_url enablement/disablement behavior.
tldw_Server_API/tests/MCP_unified/docs/test_docs_mcp_provider.py

test_docs_module_shim.pyAdd DocsModule shim tests for execution and validation +113/-0

Add DocsModule shim tests for execution and validation
• Adds tests ensuring DocsModule delegates to provider and rejects missing required arguments.
tldw_Server_API/tests/MCP_unified/docs/test_docs_module_shim.py

test_docs_retrieval_context.pyAdd bounded context builder tests +103/-0

Add bounded context builder tests
• Validates context pack budgeting and chunk selection behavior.
tldw_Server_API/tests/MCP_unified/docs/test_docs_retrieval_context.py

test_docs_schema_store.pyAdd SQLite store schema and query behavior tests +389/-0

Add SQLite store schema and query behavior tests
• Adds coverage for migrations, upserts, scoped uniqueness, FTS search, and collection/keyword/alias operations.
tldw_Server_API/tests/MCP_unified/docs/test_docs_schema_store.py

test_docs_settings.pyAdd DocsSettings coercion and validation tests +123/-0

Add DocsSettings coercion and validation tests
• Tests boolean/number coercion, trusted roots parsing, and source profile validation.
tldw_Server_API/tests/MCP_unified/docs/test_docs_settings.py

test_docs_standalone_mount.pyAdd standalone mount tests and profile behavior coverage +82/-0

Add standalone mount tests and profile behavior coverage
• Verifies standalone mount defaults, import/search/context operations, and profile-driven URL ingestion advertisement.
tldw_Server_API/tests/MCP_unified/docs/test_docs_standalone_mount.py

Documentation (15) +7366 / -0

2026-06-30-standalone-mcp-docs-corpus-stage1-plan.mdAdd Stage 1 standalone docs corpus implementation plan +2468/-0

Add Stage 1 standalone docs corpus implementation plan
• Introduces a detailed plan covering goals, phases, and deliverables for a runtime-neutral docs corpus and MCP tool surface.
Docs/superpowers/plans/2026-06-30-standalone-mcp-docs-corpus-stage1-plan.md

2026-06-30-standalone-mcp-docs-url-acquisition-implementation-plan.mdAdd URL acquisition implementation plan +2350/-0

Add URL acquisition implementation plan
• Documents a staged plan for URL fetching, extraction, and policy enforcement for docs ingestion.
Docs/superpowers/plans/2026-06-30-standalone-mcp-docs-url-acquisition-implementation-plan.md

2026-07-01-standalone-mcp-docs-stage3-server-mounting-plan.mdAdd server mounting plan for docs module +746/-0

Add server mounting plan for docs module
• Describes how the standalone docs package is mounted into the tldw_server MCP runtime, including adapter boundaries and configuration.
Docs/superpowers/plans/2026-07-01-standalone-mcp-docs-stage3-server-mounting-plan.md

2026-06-30-standalone-mcp-docs-catalog-design.mdAdd docs catalog design spec +856/-0

Add docs catalog design spec
• Specifies the docs catalog data model, indexing strategy, and tool-level behavior for corpus storage and retrieval.
Docs/superpowers/specs/2026-06-30-standalone-mcp-docs-catalog-design.md

2026-06-30-standalone-mcp-docs-url-acquisition-design.mdAdd URL acquisition design spec +376/-0

Add URL acquisition design spec
• Defines URL policy, normalization, and extraction design with security-focused constraints.
Docs/superpowers/specs/2026-06-30-standalone-mcp-docs-url-acquisition-design.md

task-12085 - Rebase-standalone-MCP-docs-PR-on-dev.mdRecord completion of rebase task +61/-0

Record completion of rebase task
• Adds backlog tracking documentation for the rebase effort.
backlog/completed/task-12085 - Rebase-standalone-MCP-docs-PR-on-dev.md

task-12071 - Design-standalone-MCP-document-corpus-and-Context7-compatible-RAG-tools.mdAdd backlog task for docs corpus design +66/-0

Add backlog task for docs corpus design
• Tracks the design work item for the standalone docs corpus and Context7-compatible tools.
backlog/tasks/task-12071 - Design-standalone-MCP-document-corpus-and-Context7-compatible-RAG-tools.md

task-12073 - Plan-standalone-MCP-docs-corpus-Stage-1-implementation.mdAdd backlog task for Stage 1 planning +52/-0

Add backlog task for Stage 1 planning
• Tracks the planning task for Stage 1 docs corpus implementation.
backlog/tasks/task-12073 - Plan-standalone-MCP-docs-corpus-Stage-1-implementation.md

task-12074 - Implement-standalone-MCP-docs-corpus-Stage-1.mdAdd backlog task for Stage 1 implementation +67/-0

Add backlog task for Stage 1 implementation
• Tracks the implementation task for Stage 1 docs corpus work.
backlog/tasks/task-12074 - Implement-standalone-MCP-docs-corpus-Stage-1.md

task-12076 - Design-standalone-MCP-docs-Stage-2-URL-acquisition.mdAdd backlog task for Stage 2 URL acquisition design +50/-0

Add backlog task for Stage 2 URL acquisition design
• Tracks the URL acquisition design task.
backlog/tasks/task-12076 - Design-standalone-MCP-docs-Stage-2-URL-acquisition.md

task-12077 - Plan-standalone-MCP-docs-Stage-2-URL-acquisition-implementation.mdAdd backlog task for Stage 2 URL acquisition planning +41/-0

Add backlog task for Stage 2 URL acquisition planning
• Tracks the implementation planning task for URL acquisition.
backlog/tasks/task-12077 - Plan-standalone-MCP-docs-Stage-2-URL-acquisition-implementation.md

task-12078 - Implement-standalone-MCP-docs-Stage-2-URL-acquisition.mdAdd backlog task for Stage 2 URL acquisition implementation +98/-0

Add backlog task for Stage 2 URL acquisition implementation
• Tracks the implementation work item for URL acquisition.
backlog/tasks/task-12078 - Implement-standalone-MCP-docs-Stage-2-URL-acquisition.md

task-12079 - Plan-standalone-MCP-docs-Stage-3-server-mounting.mdAdd backlog task for Stage 3 server mounting planning +61/-0

Add backlog task for Stage 3 server mounting planning
• Tracks the server mounting planning work item.
backlog/tasks/task-12079 - Plan-standalone-MCP-docs-Stage-3-server-mounting.md

task-12080 - Implement-standalone-MCP-docs-Stage-3-server-mounting.mdAdd backlog task for Stage 3 server mounting implementation +72/-0

Add backlog task for Stage 3 server mounting implementation
• Tracks the server mounting implementation work item.
backlog/tasks/task-12080 - Implement-standalone-MCP-docs-Stage-3-server-mounting.md

__init__.pyDocument MCP unified host adapters package +2/-0

Document MCP unified host adapters package
• Adds module docstring clarifying purpose of host adapter shims.
tldw_Server_API/app/core/MCP_unified/adapters/init.py

Other (3) +35 / -0

pyproject.tomlPackage the new mcp_unified.docs subpackages and schema.sql +6/-0

Package the new mcp_unified.docs subpackages and schema.sql
• Adds docs subpackages to the setuptools package list and includes the SQLite schema as package data.
apps/mcp-unified/pyproject.toml

pyproject.tomlInclude docs schema resource in top-level packaging +1/-0

Include docs schema resource in top-level packaging
• Adds mcp_unified docs schema.sql to package-data configuration so it is shipped with builds.
pyproject.toml

mcp_modules.yamlRegister DocsModule with locked-down defaults +28/-0

Register DocsModule with locked-down defaults
• Adds a docs module entry pointing to DocsModule and configures db_path/trusted_roots and disabled web acquisition by default.
tldw_Server_API/Config_Files/mcp_modules.yaml

qodo-code-review · 2026-07-01T04:03:58Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (1) 📜 Skill insights (0)

Context used

✅ Compliance rules (platform): 74 rules

1. DocsError defined outside core ✗ Dismissed 📘 Rule violation ⌂ Architecture

Description

A new custom exception DocsError is defined outside /app/core/exceptions.py, fragmenting
exception definitions and violating the project exception centralization policy. This can lead to
inconsistent error handling across modules.

Code

apps/mcp-unified/src/mcp_unified/docs/errors.py[R7-16]

+@dataclass(slots=True)
+class DocsError(Exception):
+    """Machine-readable docs corpus error."""
+
+    code: str
+    message: str
+    details: dict[str, Any] = field(default_factory=dict)
+
+    def __str__(self) -> str:
+        return f"{self.code}: {self.message}"

Evidence
The diff adds a new class DocsError(Exception) in apps/mcp-unified/.../errors.py, which is not
/app/core/exceptions.py, violating the centralization requirement for custom exception types.
Rule 224217: Centralize custom exceptions in core module
apps/mcp-unified/src/mcp_unified/docs/errors.py[7-16]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new custom exception class is defined outside the centralized exceptions module.

## Issue Context
Project policy requires custom exceptions to be centralized in `/app/core/exceptions.py` and imported from there when used.

## Fix Focus Areas
- apps/mcp-unified/src/mcp_unified/docs/errors.py[7-16]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Chunked HTTP not decoded 🐞 Bug ≡ Correctness

Description

URLFetcher reads response bodies as raw bytes until socket close and does not handle HTTP/1.1
Transfer-Encoding: chunked, so many real-world responses will be ingested with chunk framing (or
otherwise malformed body). This can break downstream extraction/parsing and cause incorrect corpus
content.

Code

apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[R327-383]

+def _read_response(
+    stream: socket.socket | ssl.SSLSocket,
+    *,
+    max_body_bytes: int | None,
+) -> tuple[int, dict[str, str], list[bytes]]:
+    header_buffer = bytearray()
+    while b"\r\n\r\n" not in header_buffer:
+        chunk = stream.recv(4096)
+        if not chunk:
+            break
+        header_buffer.extend(chunk)
+        if len(header_buffer) > _MAX_HEADER_BYTES:
+            raise ValueError("response headers too large")
+    header_bytes, separator, remainder = bytes(header_buffer).partition(b"\r\n\r\n")
+    if not separator:
+        raise ValueError("malformed HTTP response")
+    status_code, headers = _parse_headers(header_bytes)
+    body_chunks = _read_body_chunks(stream, remainder, max_body_bytes=max_body_bytes)
+    return status_code, headers, body_chunks
+
+
+def _parse_headers(header_bytes: bytes) -> tuple[int, dict[str, str]]:
+    lines = header_bytes.split(b"\r\n")
+    if not lines or len(lines[0]) > _MAX_STATUS_LINE_BYTES:
+        raise ValueError("malformed HTTP status line")
+    status_parts = lines[0].decode("iso-8859-1").split()
+    if len(status_parts) < 2:
+        raise ValueError("malformed HTTP status line")
+    status_code = int(status_parts[1])
+    headers: dict[str, str] = {}
+    for raw_line in lines[1:]:
+        if b":" not in raw_line:
+            continue
+        key, value = raw_line.split(b":", 1)
+        headers[key.decode("iso-8859-1").strip().lower()] = value.decode("iso-8859-1").strip()
+    return status_code, headers
+
+
+def _read_body_chunks(
+    stream: socket.socket | ssl.SSLSocket,
+    initial: bytes,
+    *,
+    max_body_bytes: int | None,
+) -> list[bytes]:
+    body_chunks: list[bytes] = []
+    total = 0
+    cap = max_body_bytes + 1 if max_body_bytes is not None else None
+    if initial:
+        body_chunks.append(initial)
+        total += len(initial)
+    while cap is None or total < cap:
+        chunk = stream.recv(8192)
+        if not chunk:
+            break
+        body_chunks.append(chunk)
+        total += len(chunk)
+    return body_chunks

Evidence
The fetcher parses headers but then reads the body via repeated recv() with no handling for
transfer-encoding semantics; extraction then decodes and parses the bytes as if they were a normal
HTML/text payload.
apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[327-383]
apps/mcp-unified/src/mcp_unified/docs/acquisition/extract.py[23-38]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The raw-socket HTTP implementation does not decode `Transfer-Encoding: chunked` and does not honor `Content-Length`; it simply reads until the server closes the connection. For chunked responses, this passes chunk metadata into the body, which can corrupt HTML/text ingestion.

## Issue Context
Downstream extraction (`extract_fetched_document`) assumes `body` is the decoded document payload.

## Fix Focus Areas
- apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[327-383]

## Implementation notes
One of:
- Implement minimal chunked decoding when header `transfer-encoding` contains `chunked`.
- Or, if chunked decoding is intentionally out-of-scope for stage1, explicitly detect `transfer-encoding: chunked` and return a clear denial/failure reason (e.g. `transfer_encoding_unsupported`) instead of ingesting corrupted data.
- Optionally add `Content-Length` handling to stop reading once satisfied (still respecting `max_body_bytes`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. ~~Directory import aborts~~ ✓ Resolved 🐞 Bug ≡ Correctness

Description

DocsImportService.import_path collects every file under a directory and then _parse_file raises on
the first unsupported suffix, so importing typical docs folders (images, json, etc.) fails entirely.
This makes directory-based ingestion unreliable even when many valid markdown/html files exist.

Code

apps/mcp-unified/src/mcp_unified/docs/importers/local.py[R78-102]

+    def _iter_import_files(self, target: Path) -> list[Path]:
+        if target.is_file():
+            return [target]
+        if not target.is_dir():
+            raise DocsError(
+                code="import_path_not_found",
+                message="Import path does not exist.",
+                details={"path": str(target)},
+            )
+
+        files: list[Path] = []
+        for candidate in sorted(target.rglob("*")):
+            if not candidate.is_file():
+                continue
+            files.append(self._assert_allowed_path(candidate))
+        return files
+
+    def _parse_file(self, path: Path) -> ParsedDocument:
+        suffix = path.suffix.lower()
+        if suffix not in SUPPORTED_SUFFIXES:
+            raise DocsError(
+                code="unsupported_import_format",
+                message="Unsupported local import file type.",
+                details={"path": str(path), "suffix": suffix},
+            )

Evidence
The importer walks all files under a directory without filtering, but the parser hard-fails on any
unsupported suffix, so mixed-content directories will error out before importing valid docs files.
apps/mcp-unified/src/mcp_unified/docs/importers/local.py[78-93]
apps/mcp-unified/src/mcp_unified/docs/importers/local.py[95-102]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`DocsImportService._iter_import_files()` currently includes every file under the target directory, but `_parse_file()` raises `DocsError(code="unsupported_import_format")` for any non-supported suffix. This means a single unrelated file (e.g., `.png`) aborts the whole directory import.

## Issue Context
The tool description says it can "Import local files under configured trusted roots"; in practice, directories commonly include non-text assets.

## Fix Focus Areas
- apps/mcp-unified/src/mcp_unified/docs/importers/local.py[78-115]

## Implementation notes
- Filter candidates by `SUPPORTED_SUFFIXES` in `_iter_import_files()` (preferred), OR
- Catch `DocsError` with `code == "unsupported_import_format"` inside `import_path()` loop and `continue` (optionally collect a warning list).
- Consider also catching `UnicodeDecodeError` from `read_text()` and convert to a structured `DocsError` so callers get consistent error codes.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

View more (3)

4. Tests missing required pytest marker 📘 Rule violation ▣ Testability

Description

New pytest tests were added without the required single marker (unit, integration,
external_api, or local_llm_service). This breaks the project’s test categorization and selection
standards.

Code

tldw_Server_API/tests/MCP_unified/docs/test_docs_standalone_mount.py[R13-18]

+def test_standalone_mount_defaults_to_docs_with_local_sqlite(tmp_path: Path) -> None:
+    mount = create_standalone_docs_mount({"db_path": str(tmp_path / "docs.db"), "trusted_roots": [str(tmp_path)]})
+
+    names = {tool["name"] for tool in mount.tool_definitions()}
+    status = mount.execute_tool("docs.status", {}, scope=AccessScope())
+

Evidence
The test functions are defined without any pytest marker decorators, violating the requirement that
each test has exactly one approved marker.
Rule 380651: Apply appropriate pytest markers to all tests
tldw_Server_API/tests/MCP_unified/docs/test_docs_standalone_mount.py[13-18]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New pytest tests lack the required `@pytest.mark.<category>` marker.

## Issue Context
Project policy requires each test function/class to have exactly one accepted marker so suites can be selected consistently.

## Fix Focus Areas
- tldw_Server_API/tests/MCP_unified/docs/test_docs_standalone_mount.py[13-18]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

5. ~~Missing docstrings in docs modules~~ ✓ Resolved 📘 Rule violation ✧ Quality

Description

New Python modules/classes/functions are introduced without required module and API docstrings,
reducing maintainability and violating the project docstring standard. This can also degrade
generated documentation and reviewer/auditor clarity.

Code

apps/mcp-unified/src/mcp_unified/docs/standalone.py[R1-69]

+from __future__ import annotations
+
+from dataclasses import dataclass
+from enum import Enum
+from typing import Any, Mapping
+
+from .mcp_module import DocsMCPToolProvider
+from .models import AccessScope
+from .settings import DocsSettings
+
+
+class StandaloneDocsProfile(str, Enum):
+    LOCKED_DOWN = "locked_down"
+    LOCAL_FIRST = "local_first"
+    ONLINE_CAPABLE = "online_capable"
+
+
+@dataclass(frozen=True)
+class StandaloneDocsMount:
+    module_id: str
+    name: str
+    settings: DocsSettings
+    provider: DocsMCPToolProvider
+
+    def tool_definitions(self) -> list[dict[str, Any]]:
+        return self.provider.tool_definitions()
+
+    def execute_tool(
+        self,
+        tool_name: str,
+        arguments: dict[str, Any] | None,
+        *,
+        scope: AccessScope | None = None,
+    ) -> Any:
+        return self.provider.execute(tool_name, arguments or {}, scope=scope or self.settings.default_scope)
+
+
+def standalone_docs_settings_for_profile(
+    profile: StandaloneDocsProfile | str = StandaloneDocsProfile.LOCKED_DOWN,
+    *,
+    overrides: Mapping[str, Any] | None = None,
+) -> DocsSettings:
+    profile_value = StandaloneDocsProfile(profile)
+    values: dict[str, Any] = {
+        "db_path": "Databases/mcp_docs.db",
+        "enable_web_acquisition": False,
+        "web_source_profile": profile_value.value,
+        "allow_arbitrary_public_domains": False,
+    }
+    if profile_value in {StandaloneDocsProfile.LOCAL_FIRST, StandaloneDocsProfile.ONLINE_CAPABLE}:
+        values["enable_web_acquisition"] = True
+    if overrides:
+        values.update(dict(overrides))
+    return DocsSettings.from_mapping(values)
+
+
+def create_standalone_docs_mount(
+    settings: DocsSettings | Mapping[str, Any] | None = None,
+    *,
+    profile: StandaloneDocsProfile | str = StandaloneDocsProfile.LOCKED_DOWN,
+    module_id: str = "docs",
+    name: str = "Docs Corpus",
+) -> StandaloneDocsMount:
+    if isinstance(settings, DocsSettings):
+        resolved_settings = settings
+    else:
+        resolved_settings = standalone_docs_settings_for_profile(profile, overrides=settings)
+    provider = DocsMCPToolProvider(settings=resolved_settings)
+    return StandaloneDocsMount(module_id=module_id, name=name, settings=resolved_settings, provider=provider)

Evidence
The referenced files begin immediately with imports (no module docstring) and define
classes/functions whose first statements are not docstrings, which violates the docstring
requirement for modules, classes, and functions/methods.
Rule 380617: Require comprehensive docstrings for modules, classes, and functions
apps/mcp-unified/src/mcp_unified/docs/standalone.py[1-69]
tldw_Server_API/app/core/MCP_unified/adapters/docs/config.py[1-20]
apps/mcp-unified/src/mcp_unified/docs/retrieval/search.py[1-34]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New modules and public APIs were added without the required docstrings (module docstrings at top-of-file, plus docstrings for classes and functions/methods).

## Issue Context
The compliance checklist requires comprehensive docstrings for modules, classes, and functions/methods.

## Fix Focus Areas
- apps/mcp-unified/src/mcp_unified/docs/standalone.py[1-69]
- tldw_Server_API/app/core/MCP_unified/adapters/docs/config.py[1-20]
- apps/mcp-unified/src/mcp_unified/docs/retrieval/search.py[1-34]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

6. ~~Raw SQL outside DB_Management~~ ✗ Dismissed 📘 Rule violation ⌂ Architecture

Description

New SQLite store code introduces extensive raw SQL outside /app/core/DB_Management/, violating the
required database access abstraction boundary. This increases risk of inconsistent DB access
patterns and weakens centralized controls/auditing.

Code

apps/mcp-unified/src/mcp_unified/docs/store/sqlite.py[R9-34]

+import sqlite3
+from typing import Any
+
+from ..errors import DocsError
+from ..models import AccessScope
+
+_STORE_PACKAGE = "mcp_unified.docs.store"
+_SCHEMA_RESOURCE = "schema.sql"
+_SCOPE_SENTINEL = ""
+_FTS_PHRASE_QUOTE = '"'
+_COUNT_SQL = {
+    "docs_documents": "SELECT COUNT(*) AS count FROM docs_documents",
+    "docs_chunks": "SELECT COUNT(*) AS count FROM docs_chunks",
+    "docs_collections": "SELECT COUNT(*) AS count FROM docs_collections",
+    "docs_keywords": "SELECT COUNT(*) AS count FROM docs_keywords",
+}
+_SCOPE_TABLE_INFO_SQL = {
+    "docs_documents": "PRAGMA table_info(docs_documents)",
+    "docs_collections": "PRAGMA table_info(docs_collections)",
+    "docs_keywords": "PRAGMA table_info(docs_keywords)",
+    "docs_aliases": "PRAGMA table_info(docs_aliases)",
+}
+_SCOPE_BACKFILL_SQL = {
+    "docs_documents": (
+        "UPDATE docs_documents SET owner_scope = ? WHERE owner_scope IS NULL",
+        "UPDATE docs_documents SET profile_scope = ? WHERE profile_scope IS NULL",

Evidence
The new DocsCatalogStore implementation imports sqlite3 and defines SQL strings (e.g., `SELECT
COUNT(*) ...`) in a non-DB_Management path, which the rule forbids.
Rule 380628: Enforce DB_Management abstractions; forbid raw SQL outside core layer
apps/mcp-unified/src/mcp_unified/docs/store/sqlite.py[9-34]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Raw SQL queries and direct sqlite access were added outside `/app/core/DB_Management/`, which violates the repository rule that DB access must be centralized.

## Issue Context
The docs corpus store currently constructs and executes SQL directly in `apps/mcp-unified/...`, rather than going through the project’s DB_Management abstraction.

## Fix Focus Areas
- apps/mcp-unified/src/mcp_unified/docs/store/sqlite.py[9-34]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

7. ~~Async execute_tool calls blocking code~~ ✓ Resolved 📘 Rule violation ➹ Performance

Description

DocsModule.execute_tool is async but calls a synchronous provider that performs blocking
DB/network I/O, risking event-loop stalls and reduced concurrency. This should be executed via async
I/O or offloaded to a thread executor.

Code

tldw_Server_API/app/core/MCP_unified/modules/implementations/docs_module.py[R40-46]

+    async def execute_tool(self, tool_name: str, arguments: dict[str, Any], context: Any | None = None) -> Any:
+        args = self.sanitize_input(arguments or {})
+        try:
+            self.validate_tool_arguments(tool_name, args)
+        except (OverflowError, TypeError, ValueError) as exc:
+            raise ValueError(f"Invalid arguments for {tool_name}: {exc}") from exc
+        return self._ensure_provider().execute(tool_name, args, scope=docs_scope_from_context(context))

Evidence
The MCP host module exposes an async entrypoint (async def execute_tool) but delegates directly to
synchronous code that uses sqlite3 and socket networking, which are blocking operations in typical
async runtimes.
Rule 380613: Prefer async I/O over blocking calls in async-capable code paths
tldw_Server_API/app/core/MCP_unified/modules/implementations/docs_module.py[40-46]
apps/mcp-unified/src/mcp_unified/docs/store/sqlite.py[57-69]
apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[205-224]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
An `async def` method calls synchronous code paths that perform blocking I/O, which can block the event loop.

## Issue Context
`DocsModule.execute_tool(...)` calls `DocsMCPToolProvider.execute(...)`. The provider uses synchronous SQLite (`sqlite3.connect(...)`) and the web fetch path uses synchronous sockets.

## Fix Focus Areas
- tldw_Server_API/app/core/MCP_unified/modules/implementations/docs_module.py[40-46]
- apps/mcp-unified/src/mcp_unified/docs/store/sqlite.py[57-69]
- apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[205-224]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

8. ~~Robots flag denies all~~ ✓ Resolved 🐞 Bug ≡ Correctness

Description

When DocsSettings.respect_robots is true, URLFetcher.fetch immediately returns
denied(reason="robots_unavailable") for every URL without attempting a robots check. This turns a
policy toggle into a complete web-ingestion shutdown.

Code

apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[R50-56]

+            if self.settings.respect_robots:
+                return FetchResult(
+                    status="denied",
+                    reason="robots_unavailable",
+                    final_url=normalized.redacted_url,
+                    redirects=tuple(redirects),
+                )

Evidence
The code path checks the flag and immediately returns a denial, so no URL can ever be fetched when
respect_robots is enabled.
apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[36-56]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`respect_robots` currently causes an unconditional denial for all URLs. This is behaviorally incorrect for a flag named "respect_robots".

## Issue Context
If robots support is not implemented yet, the code should either:
- implement the check, or
- surface an explicit "not implemented" reason without implying a robots-based denial.

## Fix Focus Areas
- apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[50-56]

## Implementation notes
- Implement robots fetching/parsing (with caching and timeouts) and deny only when robots rules disallow.
- Or remove the branch and treat `respect_robots` as no-op until implemented.
- Or rename/configure to a clearer "disable_web_acquisition_if_robots_required" semantic and adjust reason accordingly.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

9. ~~Decode errors misreported~~ ✓ Resolved 🐞 Bug ◔ Observability

Description

URLFetcher maps _decode_limited returning None (unsupported/invalid content-encoding or decode
failure) to reason="content_too_large", conflating distinct failure causes. This produces misleading
failure reporting and makes debugging and policy decisions harder.

Code

apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[R164-183]

+            body = _join_limited(response.body_chunks, self.settings.max_url_body_bytes)
+            if body is None:
+                return FetchResult(
+                    status="denied",
+                    reason="content_too_large",
+                    final_url=normalized.redacted_url,
+                    status_code=response.status_code,
+                    headers=headers,
+                    redirects=tuple(redirects),
+                )
+            decoded_body = _decode_limited(body, headers.get("content-encoding"), self.settings.max_url_body_bytes)
+            if decoded_body is None:
+                return FetchResult(
+                    status="denied",
+                    reason="content_too_large",
+                    final_url=normalized.redacted_url,
+                    status_code=response.status_code,
+                    headers=headers,
+                    redirects=tuple(redirects),
+                )

Evidence
The caller returns content_too_large whenever _decode_limited returns None, but _decode_limited
returns None for unsupported encodings and decode errors unrelated to size.
apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[164-183]
apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[279-296]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Decode failures are currently reported as `content_too_large`, even when the body is within limits but cannot be decoded due to `content-encoding` issues.

## Issue Context
`_decode_limited()` returns `None` for unsupported encodings and decode exceptions.

## Fix Focus Areas
- apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[164-183]
- apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py[279-296]

## Implementation notes
- Differentiate between size limit vs decode/encoding problems:
 - `content_too_large` when `_join_limited` fails or decoded length exceeds max
 - `unsupported_content_encoding` when encoding not in {identity,gzip,deflate}
 - `content_decode_failed` when zlib/EOF errors occur
- Consider returning `status="failed"` vs `"denied"` for decode errors, depending on how callers interpret the taxonomy.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

coderabbitai

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

apps/mcp-unified/src/mcp_unified/docs/acquisition/service.py (1)
1-141: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

No logging around fetch/denial outcomes.

ingest_url has several security-relevant failure/denial paths (policy denial, robots gate, fetch failure) but nothing is logged via Loguru anywhere in this module. A logger.info/logger.warning on non-"fetched" statuses would aid auditing of blocked/failed acquisition attempts without needing to inspect return values downstream.

As per coding guidelines, "Use Loguru for logging throughout the codebase (from loguru import logger)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/mcp-unified/src/mcp_unified/docs/acquisition/service.py` around lines 1
- 141, `DocsAcquisitionService.ingest_url` returns several non-"fetched"
outcomes but does not log any of them, which makes blocked or failed
acquisitions hard to audit. Add Loguru logging in this module by importing
`logger`, and emit a `logger.info` or `logger.warning` in the `ingest_url`
early-return paths for capability disabled and any `fetched.status != "fetched"`
cases, including the URL, status/reason, and final URL/redirect context; keep
the normal success path unchanged.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/mcp-unified/src/mcp_unified/docs/acquisition/extract.py`:
- Around line 23-73: The rich extraction path in extract_fetched_document is
running for non-HTML payloads, which can mislabel decoded binary or unexpected
content as HTML. Update extract_fetched_document so _extract_with_trafilatura
and _extract_with_beautifulsoup are only attempted when media_type is in
_HTML_CONTENT_TYPES, and for anything else return the plain text fallback with
document_type="text" and extraction_method="text". This should also eliminate
the unreachable static_html or _parse_static_html(...) fallback at the end by
making the HTML-only branch in extract_fetched_document and _parse_static_html
the only path that returns HTML documents.

In `@apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py`:
- Around line 327-384: The response reader currently treats all bodies as raw
bytes, so `_read_response` and `_read_body_chunks` need to detect
`Transfer-Encoding: chunked` and dechunk before returning content. Update the
body-reading path to branch on the parsed headers from `_parse_headers`, keeping
normal `Content-Length`/unframed handling intact while decoding chunked framing
safely with `max_body_bytes` enforcement. Use the existing `_read_response` and
`_read_body_chunks` flow as the integration point, and add a regression test in
`test_docs_acquisition_fetcher.py` that covers a `transfer-encoding: chunked`
response and verifies the extracted body is dechunked correctly.

In `@apps/mcp-unified/src/mcp_unified/docs/acquisition/resolver.py`:
- Around line 15-28: StdlibResolver.resolve currently creates ResolvedAddress
instances without setting is_private, so the field always stays at its default.
Update StdlibResolver.resolve to determine privacy for each resolved IP and pass
that value into ResolvedAddress when appending results, using the existing
resolve flow and the ResolvedAddress field names to keep future consumers from
getting false negatives. Keep the deduping and family filtering logic unchanged.

In `@apps/mcp-unified/src/mcp_unified/docs/acquisition/service.py`:
- Around line 32-50: `DocsAcquisitionService.ingest_url()` is calling the
blocking `self.fetcher.fetch(url)` directly on the async tool path from
`DocsModule.execute_tool()`, which can stall the event loop. Refactor the
ingestion flow so the fetch work is offloaded to a thread/executor or converted
to an async fetch API, and update the `ingest_url` call site so URL ingestion no
longer performs synchronous DNS/socket I/O inline. Use the existing `ingest_url`
and `fetcher.fetch` symbols to locate the blocking section.
- Around line 32-96: Preserve existing keywords and collection memberships
during URL re-ingest in ingest_url. The current call to store.upsert_document
treats empty keywords/collection_names as replacements, so omitted values clear
existing associations. Change ingest_url and the upsert path to distinguish “not
provided” from “clear” by using sentinel defaults like None or by fetching the
current document state before updating. Update DocsCatalogStore/get_document or
add a document-level accessor if needed so upsert_document can merge existing
memberships instead of overwriting them.

In `@apps/mcp-unified/src/mcp_unified/docs/importers/base.py`:
- Around line 30-45: In chunks_from_text, add a guard for the overlap/max_chars
relationship so start always advances and the while loop cannot hang. Validate
the inputs at the top of the function (or adjust the step logic) so overlap is
strictly smaller than max_chars, and reference the chunks_from_text helper in
base.py when implementing the check.

In `@apps/mcp-unified/src/mcp_unified/docs/importers/local.py`:
- Around line 78-93: The directory import flow in _iter_import_files and
import_path is too eager and can fail the whole batch on the first unsupported
file while leaving earlier store.upsert_document writes committed. Update
LocalDocsImporter to filter candidates by SUPPORTED_SUFFIXES before parsing,
skip unsupported files (including hidden/binary/extension-less ones), and make
the batch import resilient so one bad file does not abort the entire directory
import. Keep the fix centered around _iter_import_files, _parse_file, and
import_path so the directory walk only yields supported documents and partial
writes are avoided or safely handled.

In `@apps/mcp-unified/src/mcp_unified/docs/importers/markdown.py`:
- Around line 8-29: The markdown importer in parse_markdown is treating lines
inside fenced code blocks as headings, which can add bogus ParsedSection entries
and override the document title. Update parse_markdown in the markdown importer
to track when it is inside a fenced block (for both ``` and ~~~) and skip
HEADING_RE matching while fenced, then resume heading detection only after the
fence closes. Keep the fix localized to parse_markdown and the section/title
handling logic so real headings still populate ParsedDocument correctly.

In `@apps/mcp-unified/src/mcp_unified/docs/retrieval/context.py`:
- Around line 13-70: The retrieval packer in build() is only sampling a fixed
oversample window, so it can stop short of max_chunks when early results are too
concentrated by document. Update the search windowing logic in context.py
(around build, SearchRequest, and the chunks/seen_documents loop) to expand or
retry with a larger limit until the max_documents diversity quota or max_chunks
budget is actually satisfied. Also fix omitted to reflect the true remaining
matches rather than just len(search["results"]) - len(chunks), ideally by having
Retrieval.search expose a total-hit count or equivalent metadata and using that
when assembling the return payload.

In `@apps/mcp-unified/src/mcp_unified/docs/settings.py`:
- Around line 43-54: The numeric coercers in _coerce_positive_int and
_coerce_positive_float currently let int(value) and float(value) raise generic
ValueError messages for invalid input, which drops the field context. Update
both helpers to catch conversion failures and re-raise ValueError including
field_name in the message, consistent with the other coercers in settings.py,
while keeping the existing positive/finite validation logic intact.

In `@apps/mcp-unified/src/mcp_unified/docs/standalone.py`:
- Around line 44-49: The default db_path in create_standalone_docs_mount is
CWD-relative, which makes standalone behavior depend on where the process is
launched. Update the default db_path value in the values dict to use a stable
absolute or package-relative location instead of "Databases/mcp_docs.db", and
keep the existing override behavior intact for callers that pass db_path
explicitly.

In `@apps/mcp-unified/src/mcp_unified/docs/store/schema.sql`:
- Around line 38-49: The explicit scope indexes for docs_collections,
docs_keywords, and docs_aliases duplicate the UNIQUE constraint’s implicit
auto-index on the same column set. In schema.sql, remove
docs_collections_scope_idx and the matching scope indexes for docs_keywords and
docs_aliases, leaving the UNIQUE(owner_scope, profile_scope, name) constraint to
provide the needed index behavior.
- Around line 97-117: The docs_chunks_fts virtual table is only kept in sync by
_replace_document_rows, so it will not automatically follow docs_documents
deletions through the ON DELETE CASCADE path. Update the schema around
docs_chunks_fts to add an explicit trigger-based cleanup note or trigger plan,
and reference DocsCatalogStore/_replace_document_rows so future delete_document
or direct delete paths also remove matching FTS rows.

In `@apps/mcp-unified/src/mcp_unified/docs/store/sqlite.py`:
- Around line 663-686: The helper methods `_schema_version`, `_fts_available`,
and `_count` are swallowing SQLite/lookup failures and returning fallback values
without any trace, so add Loguru logging before each fallback. Update these
methods in `sqlite.py` to catch the same exceptions, log the exception with
clear context about which query/table failed, then return `0` or `False` as
appropriate. Keep the existing public behavior of `status()`, but ensure the
underlying error is visible for debugging by using the shared logger
consistently.
- Around line 108-195: The upsert logic in upsert_document is vulnerable to a
race because it does a SELECT followed by INSERT, which can conflict under
concurrent writers on the unique owner_scope/profile_scope/canonical_uri key.
Replace that flow with an atomic UPSERT using docs_documents and its conflict
target so the row is inserted or updated in one statement and the existing id is
returned/stable for _replace_document_rows, _replace_document_keywords, and
_replace_collection_memberships. If the database/API requires it, use a retry or
equivalent returning-id pattern, but keep the operation atomic around the unique
key.

In `@tldw_Server_API/app/core/MCP_unified/modules/implementations/docs_module.py`:
- Around line 40-54: Add validation in
DocsMCPToolProvider.execute_tool/validate_tool_arguments for the remaining docs
tool names so missing required inputs are rejected with ValueError before
provider execution. Extend validate_tool_arguments to cover docs.resolve,
docs.list, docs.get, docs.collections.create/update/set_membership,
docs.keywords.apply, resolve-library-id, and get-library-docs by checking each
expected field used later by execute(). Keep the existing sanitization and
error-wrapping flow in execute_tool so all invalid/missing arguments surface
through the same ValueError path instead of uncaught KeyError.

In `@tldw_Server_API/tests/MCP_unified/docs/test_docs_module_shim.py`:
- Around line 106-113: The test in DocsModule is checking source text instead of
runtime behavior, so replace the inspect.getsource assertions with a
behavior-based test that exercises DocsModule.on_initialize and/or
DocsModule.execute_tool and verifies docs_settings_from_module_config and
docs_scope_from_context are actually called with the expected arguments using
monkeypatch/spies. Keep the focus on the host adapter boundary in DocsModule and
avoid asserting on literal strings or implementation details.

In `@tldw_Server_API/tests/MCP_unified/docs/test_docs_schema_store.py`:
- Around line 165-206: Add a concurrency regression test for
DocsCatalogStore.upsert_document that exercises the TOCTOU race in the same
scope by issuing two upserts with the same canonical_uri from separate threads
or connections against the same DocsCatalogStore instance. Reuse the existing
test setup in test_default_scope_upsert_replaces_document_without_duplicates,
but drive concurrent writes and assert the outcome is either a single surviving
document with the updated content or the current IntegrityError until the
sqlite.py path is fixed.

---

Outside diff comments:
In `@apps/mcp-unified/src/mcp_unified/docs/acquisition/service.py`:
- Around line 1-141: `DocsAcquisitionService.ingest_url` returns several
non-"fetched" outcomes but does not log any of them, which makes blocked or
failed acquisitions hard to audit. Add Loguru logging in this module by
importing `logger`, and emit a `logger.info` or `logger.warning` in the
`ingest_url` early-return paths for capability disabled and any `fetched.status
!= "fetched"` cases, including the URL, status/reason, and final URL/redirect
context; keep the normal success path unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1f06e500-2ad7-4cc7-9bf9-30557a27e219

📥 Commits

Reviewing files that changed from the base of the PR and between 9445a17 and 4ed8b3c.

⛔ Files ignored due to path filters (5)

Docs/superpowers/plans/2026-06-30-standalone-mcp-docs-corpus-stage1-plan.md is excluded by !docs/**
Docs/superpowers/plans/2026-06-30-standalone-mcp-docs-url-acquisition-implementation-plan.md is excluded by !docs/**
Docs/superpowers/plans/2026-07-01-standalone-mcp-docs-stage3-server-mounting-plan.md is excluded by !docs/**
Docs/superpowers/specs/2026-06-30-standalone-mcp-docs-catalog-design.md is excluded by !docs/**
Docs/superpowers/specs/2026-06-30-standalone-mcp-docs-url-acquisition-design.md is excluded by !docs/**

📒 Files selected for processing (57)

apps/mcp-unified/pyproject.toml
apps/mcp-unified/src/mcp_unified/docs/__init__.py
apps/mcp-unified/src/mcp_unified/docs/acquisition/__init__.py
apps/mcp-unified/src/mcp_unified/docs/acquisition/extract.py
apps/mcp-unified/src/mcp_unified/docs/acquisition/fetcher.py
apps/mcp-unified/src/mcp_unified/docs/acquisition/models.py
apps/mcp-unified/src/mcp_unified/docs/acquisition/policy.py
apps/mcp-unified/src/mcp_unified/docs/acquisition/resolver.py
apps/mcp-unified/src/mcp_unified/docs/acquisition/service.py
apps/mcp-unified/src/mcp_unified/docs/errors.py
apps/mcp-unified/src/mcp_unified/docs/importers/__init__.py
apps/mcp-unified/src/mcp_unified/docs/importers/base.py
apps/mcp-unified/src/mcp_unified/docs/importers/html.py
apps/mcp-unified/src/mcp_unified/docs/importers/local.py
apps/mcp-unified/src/mcp_unified/docs/importers/markdown.py
apps/mcp-unified/src/mcp_unified/docs/mcp_module.py
apps/mcp-unified/src/mcp_unified/docs/models.py
apps/mcp-unified/src/mcp_unified/docs/retrieval/__init__.py
apps/mcp-unified/src/mcp_unified/docs/retrieval/aliases.py
apps/mcp-unified/src/mcp_unified/docs/retrieval/context.py
apps/mcp-unified/src/mcp_unified/docs/retrieval/search.py
apps/mcp-unified/src/mcp_unified/docs/settings.py
apps/mcp-unified/src/mcp_unified/docs/standalone.py
apps/mcp-unified/src/mcp_unified/docs/store/__init__.py
apps/mcp-unified/src/mcp_unified/docs/store/schema.sql
apps/mcp-unified/src/mcp_unified/docs/store/sqlite.py
backlog/completed/task-12085 - Rebase-standalone-MCP-docs-PR-on-dev.md
backlog/tasks/task-12071 - Design-standalone-MCP-document-corpus-and-Context7-compatible-RAG-tools.md
backlog/tasks/task-12073 - Plan-standalone-MCP-docs-corpus-Stage-1-implementation.md
backlog/tasks/task-12074 - Implement-standalone-MCP-docs-corpus-Stage-1.md
backlog/tasks/task-12076 - Design-standalone-MCP-docs-Stage-2-URL-acquisition.md
backlog/tasks/task-12077 - Plan-standalone-MCP-docs-Stage-2-URL-acquisition-implementation.md
backlog/tasks/task-12078 - Implement-standalone-MCP-docs-Stage-2-URL-acquisition.md
backlog/tasks/task-12079 - Plan-standalone-MCP-docs-Stage-3-server-mounting.md
backlog/tasks/task-12080 - Implement-standalone-MCP-docs-Stage-3-server-mounting.md
pyproject.toml
tldw_Server_API/Config_Files/mcp_modules.yaml
tldw_Server_API/app/core/MCP_unified/adapters/__init__.py
tldw_Server_API/app/core/MCP_unified/adapters/docs/__init__.py
tldw_Server_API/app/core/MCP_unified/adapters/docs/config.py
tldw_Server_API/app/core/MCP_unified/modules/implementations/docs_module.py
tldw_Server_API/app/core/MCP_unified/tests/test_dynamic_module_catalog.py
tldw_Server_API/app/core/MCP_unified/tests/test_runtime_package_boundary.py
tldw_Server_API/app/core/MCP_unified/tests/test_write_tools_validators.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_acquisition_extract.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_acquisition_fetcher.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_acquisition_policy.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_acquisition_service.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_host_adapter.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_import_boundaries.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_importers.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_mcp_provider.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_module_shim.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_retrieval_context.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_schema_store.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_settings.py
tldw_Server_API/tests/MCP_unified/docs/test_docs_standalone_mount.py

cubic-dev-ai

27 issues found across 62 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai

20 issues found across 64 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

gemini-code-assist Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread Docs/Development/Calendar_CalDAV_Smoke_Test.md Outdated

github-advanced-security AI found potential problems Jul 1, 2026

View reviewed changes

rmusser01 force-pushed the codex/mcp-docs-stage1 branch from 95a632d to 4ed8b3c Compare July 1, 2026 03:52

rmusser01 changed the base branch from main to dev July 1, 2026 03:52

rmusser01 marked this pull request as ready for review July 1, 2026 03:54

qodo-code-review Bot reviewed Jul 1, 2026

View reviewed changes

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Jul 1, 2026

View reviewed changes

rmusser01 force-pushed the codex/mcp-docs-stage1 branch from 4fbee43 to 0b0838c Compare July 1, 2026 05:37

cubic-dev-ai Bot reviewed Jul 1, 2026

View reviewed changes

rmusser01 force-pushed the codex/mcp-docs-stage1 branch 3 times, most recently from a7b3a74 to 1b892d6 Compare July 1, 2026 16:15

rmusser01 added 14 commits July 2, 2026 17:58

docs: design standalone mcp document corpus

bc3f8b6

docs: tighten standalone mcp docs corpus design

60b3271

docs: make mcp docs web acquisition optional

5f88050

docs: plan standalone mcp docs stage 1

9c8d062

feat: add standalone docs package boundary

f8e0cfb

feat: add docs sqlite corpus store

438c699

feat: add docs local import service

4237acf

feat: add docs retrieval context services

072b66e

feat: add docs mcp tool provider

8ea4568

feat: register docs corpus mcp module

813e761

chore: close docs corpus backlog task

1aabc6d

docs: design mcp docs url acquisition

76b2c39

docs: harden mcp docs url acquisition design

f3b9acf

docs: plan mcp docs url acquisition

3da5ee5

rmusser01 added 25 commits July 2, 2026 17:59

feat: add docs url acquisition settings

ca47ce2

fix: harden docs url settings validation

e5257f7

feat: add docs url source policy

8199887

fix: redact docs url policy matched rules

00d0add

fix: harden docs url source policy

3f1b3c2

fix: deny legacy local hosts in docs policy

e23fea0

fix: fail closed on legacy numeric hosts

88d00a7

fix: reject ambiguous docs policy hosts

a52d7b2

fix: fail closed on ambiguous policy urls

f0d7095

feat: add docs url fetcher

4ea56bc

feat: add lazy docs url extraction

d811dba

feat: add docs url acquisition service

bfb299a

feat: expose docs url ingestion tool

a011aa7

test: harden docs url acquisition boundaries

b78a2f7

chore: close docs url acquisition task

4a8b03b

docs: plan MCP docs server mounting

1b7d5ee

feat: add standalone docs mount

634fd1b

refactor: add docs host adapter boundary

d32a356

test: guard docs module server registration

6fe8dde

test: enforce docs standalone mount boundaries

af98a1c

style: format docs server mounting tests

86db499

chore: close docs server mounting task

e682a3d

fix: align docs package with standalone mcp source tree

041ba59

fix: address standalone mcp docs review feedback

ebf2e1b

fix: address pr 2565 ci failures after rebase

de99974

rmusser01 force-pushed the codex/mcp-docs-stage1 branch from 1b892d6 to de99974 Compare July 3, 2026 01:23

rmusser01 merged commit 6eb3330 into dev Jul 3, 2026
32 of 33 checks passed

rmusser01 deleted the codex/mcp-docs-stage1 branch July 3, 2026 01:30

coderabbitai Bot mentioned this pull request Jul 3, 2026

Merge dev into main for v0.1.34 #2571

Open

Conversation

rmusser01 commented Jul 1, 2026

Summary

Test Plan

Merge Gate

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qodo-code-review Bot commented Jul 1, 2026

PR Summary by Qodo

Uh oh!

qodo-code-review Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

qodo-code-review Bot commented Jul 1, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading