Skip to content

Conversation

@hongping-zh
Copy link

Summary

Implements the ACM Scholar Agent feature requested in #319.

Features

  • 🔍 Search ACM papers via OpenAlex API (free, no API key required)
  • 📥 One-click ingestion - download and add papers to notebooks
  • 🎨 Frontend UI - Research Papers dialog in Sources dropdown
  • Open Access filtering - only shows freely accessible papers

Demo

[Video coming soon / Screenshots]

How It Works

  1. User clicks "+ Add" → "Research Papers" in a notebook
  2. Search for any topic (e.g., "Large Language Models")
  3. Click "Add" on any paper
  4. Paper is downloaded and processed automatically

Technical Details

  • Uses OpenAlex API with ACM Publisher + CS Concept filters
  • Integrates with existing source processing pipeline
  • No additional dependencies required

Closes #319

hongping added 3 commits December 25, 2025 18:07
- Add ACM Agent service module with OpenAlex API integration
- Search ACM Digital Library papers with Open Access filtering
- Auto-download PDFs from trusted sources (arXiv, PubMed, etc.)
- Add Research Papers dialog in frontend UI
- Integrate with Open Notebook's source processing pipeline

Phase 1 MVP - Full open source implementation
- Search ACM Digital Library papers via OpenAlex API
- Filter by ACM publisher, Computer Science, Open Access
- Download and ingest papers into notebooks
- Frontend UI for searching and adding papers
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 11 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="api/routers/agent.py">

<violation number="1" location="api/routers/agent.py:102">
P2: Exposing raw exception messages in API responses can leak internal implementation details. Consider using a generic error message for 500 errors while logging the full exception internally.</violation>

<violation number="2" location="api/routers/agent.py:126">
P0: **SSRF Vulnerability**: The endpoint accepts arbitrary URLs without validation, allowing attackers to make requests to internal services, cloud metadata endpoints, or scan internal networks. Add URL validation to ensure only allowed external domains/schemes are accessed.</violation>

<violation number="3" location="api/routers/agent.py:127">
P2: Missing content-type validation. The response is saved as a PDF without verifying the `Content-Type` header or checking for PDF magic bytes (`%PDF-`). This could allow arbitrary content to be stored and processed.</violation>
</file>

<file name="docs/ACM_AGENT_TESTING_GUIDE.md">

<violation number="1" location="docs/ACM_AGENT_TESTING_GUIDE.md:26">
P2: Python version requirement is incorrect. The project requires Python 3.11+ (per pyproject.toml), but this documentation states 3.10+. Users following these instructions with Python 3.10 will encounter compatibility issues.</violation>
</file>

<file name="open_notebook/acm_agent_service/core.py">

<violation number="1" location="open_notebook/acm_agent_service/core.py:14">
P2: URL parsing using `split(&#39;/&#39;)[-1]` is fragile and doesn&#39;t handle query strings, trailing slashes, or URL fragments. Use `urllib.parse.urlparse()` to properly extract the path component.</violation>
</file>

<file name="open_notebook/acm_agent_service/tools.py">

<violation number="1" location="open_notebook/acm_agent_service/tools.py:1">
P2: Using `requests` library which is not declared in project dependencies. The project uses `httpx` as the HTTP client (declared in pyproject.toml). Consider using `httpx` instead for consistency:

```python
import httpx

# In search method:
response = httpx.get(cls.BASE_URL, params=params, timeout=10)

Alternatively, add requests to the project's runtime dependencies.

P1: This file uses raw `fetch()` instead of the established `apiClient` from `./client`. This bypasses authentication (Bearer token), the 10-minute timeout for slow operations, and the 401 redirect handling. Use `apiClient.get()` and `apiClient.post()` for consistency with the rest of the codebase. ```

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

)
except Exception as e:
logger.error(f"Error searching ACM papers: {e}")
raise HTTPException(status_code=500, detail=str(e))
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Exposing raw exception messages in API responses can leak internal implementation details. Consider using a generic error message for 500 errors while logging the full exception internally.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At api/routers/agent.py, line 102:

<comment>Exposing raw exception messages in API responses can leak internal implementation details. Consider using a generic error message for 500 errors while logging the full exception internally.</comment>

<file context>
@@ -0,0 +1,195 @@
+        )
+    except Exception as e:
+        logger.error(f&quot;Error searching ACM papers: {e}&quot;)
+        raise HTTPException(status_code=500, detail=str(e))
+
[email protected](&quot;/agent/acm/ingest&quot;, response_model=IngestPaperResponse)
</file context>
Fix with Cubic


# Use httpx for async download
async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client:
response = await client.get(request.pdf_url)
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Missing content-type validation. The response is saved as a PDF without verifying the Content-Type header or checking for PDF magic bytes (%PDF-). This could allow arbitrary content to be stored and processed.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At api/routers/agent.py, line 127:

<comment>Missing content-type validation. The response is saved as a PDF without verifying the `Content-Type` header or checking for PDF magic bytes (`%PDF-`). This could allow arbitrary content to be stored and processed.</comment>

<file context>
@@ -0,0 +1,195 @@
+        
+        # Use httpx for async download
+        async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client:
+            response = await client.get(request.pdf_url)
+            response.raise_for_status()
+            
</file context>
Fix with Cubic

filename += ".pdf"

# Use httpx for async download
async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client:
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0: SSRF Vulnerability: The endpoint accepts arbitrary URLs without validation, allowing attackers to make requests to internal services, cloud metadata endpoints, or scan internal networks. Add URL validation to ensure only allowed external domains/schemes are accessed.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At api/routers/agent.py, line 126:

<comment>**SSRF Vulnerability**: The endpoint accepts arbitrary URLs without validation, allowing attackers to make requests to internal services, cloud metadata endpoints, or scan internal networks. Add URL validation to ensure only allowed external domains/schemes are accessed.</comment>

<file context>
@@ -0,0 +1,195 @@
+            filename += &quot;.pdf&quot;
+        
+        # Use httpx for async download
+        async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client:
+            response = await client.get(request.pdf_url)
+            response.raise_for_status()
</file context>
Fix with Cubic


### Prerequisites

- Python 3.10+
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Python version requirement is incorrect. The project requires Python 3.11+ (per pyproject.toml), but this documentation states 3.10+. Users following these instructions with Python 3.10 will encounter compatibility issues.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/ACM_AGENT_TESTING_GUIDE.md, line 26:

<comment>Python version requirement is incorrect. The project requires Python 3.11+ (per pyproject.toml), but this documentation states 3.10+. Users following these instructions with Python 3.10 will encounter compatibility issues.</comment>

<file context>
@@ -0,0 +1,205 @@
+
+### Prerequisites
+
+- Python 3.10+
+- Node.js 18+
+- SurrealDB
</file context>
Fix with Cubic

return OpenAlexACMTool.search(query, limit)

def ingest_paper(self, paper_url: str) -> Dict[str, Any]:
filename = paper_url.split('/')[-1]
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: URL parsing using split('/')[-1] is fragile and doesn't handle query strings, trailing slashes, or URL fragments. Use urllib.parse.urlparse() to properly extract the path component.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At open_notebook/acm_agent_service/core.py, line 14:

<comment>URL parsing using `split(&#39;/&#39;)[-1]` is fragile and doesn&#39;t handle query strings, trailing slashes, or URL fragments. Use `urllib.parse.urlparse()` to properly extract the path component.</comment>

<file context>
@@ -0,0 +1,24 @@
+        return OpenAlexACMTool.search(query, limit)
+        
+    def ingest_paper(self, paper_url: str) -&gt; Dict[str, Any]:
+        filename = paper_url.split(&#39;/&#39;)[-1]
+        if not filename.endswith(&#39;.pdf&#39;):
+            filename += &quot;.pdf&quot;
</file context>
Fix with Cubic

@@ -0,0 +1,69 @@
import requests
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Using requests library which is not declared in project dependencies. The project uses httpx as the HTTP client (declared in pyproject.toml). Consider using httpx instead for consistency:

import httpx

# In search method:
response = httpx.get(cls.BASE_URL, params=params, timeout=10)

Alternatively, add requests to the project's runtime dependencies.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At open_notebook/acm_agent_service/tools.py, line 1:

<comment>Using `requests` library which is not declared in project dependencies. The project uses `httpx` as the HTTP client (declared in pyproject.toml). Consider using `httpx` instead for consistency:

```python
import httpx

# In search method:
response = httpx.get(cls.BASE_URL, params=params, timeout=10)

Alternatively, add requests to the project's runtime dependencies.

@@ -0,0 +1,69 @@ +import requests +from typing import List, Dict, Any +from loguru import logger ```
Fix with Cubic


export async function searchAcmPapers(query: string, limit: number = 5): Promise<SearchPapersResponse> {
const apiUrl = await getApiUrl()
const response = await fetch(`${apiUrl}/api/agent/acm/search?query=${encodeURIComponent(query)}&limit=${limit}`, {
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: This file uses raw fetch() instead of the established apiClient from ./client. This bypasses authentication (Bearer token), the 10-minute timeout for slow operations, and the 401 redirect handling. Use apiClient.get() and apiClient.post() for consistency with the rest of the codebase.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/api/agent.ts, line 33:

<comment>This file uses raw `fetch()` instead of the established `apiClient` from `./client`. This bypasses authentication (Bearer token), the 10-minute timeout for slow operations, and the 401 redirect handling. Use `apiClient.get()` and `apiClient.post()` for consistency with the rest of the codebase.</comment>

<file context>
@@ -0,0 +1,62 @@
+
+export async function searchAcmPapers(query: string, limit: number = 5): Promise&lt;SearchPapersResponse&gt; {
+  const apiUrl = await getApiUrl()
+  const response = await fetch(`${apiUrl}/api/agent/acm/search?query=${encodeURIComponent(query)}&amp;limit=${limit}`, {
+    method: &#39;GET&#39;,
+    headers: {
</file context>
Fix with Cubic

@hongping-zh
Copy link
Author

Here's a demo video showing the ACM Scholar Agent in action:

🎬 Demo Video: https://drive.google.com/file/d/1xPVI2EUEtvbNSVgpg4eQ4xNrG4IyGrD2/view?usp=drive_link

The video demonstrates:

  • Searching for ACM papers via OpenAlex API
  • One-click paper ingestion into notebook
  • Viewing extracted PDF content
  • Chat with the ingested paper

Let me know if you have any questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Proposal] Built-in Research Agent with ReAct Architecture

1 participant