-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Feature/acm agent feat: Add ACM Scholar Agent for paper searchbasic #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add ACM Agent service module with OpenAlex API integration - Search ACM Digital Library papers with Open Access filtering - Auto-download PDFs from trusted sources (arXiv, PubMed, etc.) - Add Research Papers dialog in frontend UI - Integrate with Open Notebook's source processing pipeline Phase 1 MVP - Full open source implementation
- Search ACM Digital Library papers via OpenAlex API - Filter by ACM publisher, Computer Science, Open Access - Download and ingest papers into notebooks - Frontend UI for searching and adding papers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 issues found across 11 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="api/routers/agent.py">
<violation number="1" location="api/routers/agent.py:102">
P2: Exposing raw exception messages in API responses can leak internal implementation details. Consider using a generic error message for 500 errors while logging the full exception internally.</violation>
<violation number="2" location="api/routers/agent.py:126">
P0: **SSRF Vulnerability**: The endpoint accepts arbitrary URLs without validation, allowing attackers to make requests to internal services, cloud metadata endpoints, or scan internal networks. Add URL validation to ensure only allowed external domains/schemes are accessed.</violation>
<violation number="3" location="api/routers/agent.py:127">
P2: Missing content-type validation. The response is saved as a PDF without verifying the `Content-Type` header or checking for PDF magic bytes (`%PDF-`). This could allow arbitrary content to be stored and processed.</violation>
</file>
<file name="docs/ACM_AGENT_TESTING_GUIDE.md">
<violation number="1" location="docs/ACM_AGENT_TESTING_GUIDE.md:26">
P2: Python version requirement is incorrect. The project requires Python 3.11+ (per pyproject.toml), but this documentation states 3.10+. Users following these instructions with Python 3.10 will encounter compatibility issues.</violation>
</file>
<file name="open_notebook/acm_agent_service/core.py">
<violation number="1" location="open_notebook/acm_agent_service/core.py:14">
P2: URL parsing using `split('/')[-1]` is fragile and doesn't handle query strings, trailing slashes, or URL fragments. Use `urllib.parse.urlparse()` to properly extract the path component.</violation>
</file>
<file name="open_notebook/acm_agent_service/tools.py">
<violation number="1" location="open_notebook/acm_agent_service/tools.py:1">
P2: Using `requests` library which is not declared in project dependencies. The project uses `httpx` as the HTTP client (declared in pyproject.toml). Consider using `httpx` instead for consistency:
```python
import httpx
# In search method:
response = httpx.get(cls.BASE_URL, params=params, timeout=10)
Alternatively, add requests to the project's runtime dependencies.
Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR
| ) | ||
| except Exception as e: | ||
| logger.error(f"Error searching ACM papers: {e}") | ||
| raise HTTPException(status_code=500, detail=str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Exposing raw exception messages in API responses can leak internal implementation details. Consider using a generic error message for 500 errors while logging the full exception internally.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At api/routers/agent.py, line 102:
<comment>Exposing raw exception messages in API responses can leak internal implementation details. Consider using a generic error message for 500 errors while logging the full exception internally.</comment>
<file context>
@@ -0,0 +1,195 @@
+ )
+ except Exception as e:
+ logger.error(f"Error searching ACM papers: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
[email protected]("/agent/acm/ingest", response_model=IngestPaperResponse)
</file context>
|
|
||
| # Use httpx for async download | ||
| async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client: | ||
| response = await client.get(request.pdf_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Missing content-type validation. The response is saved as a PDF without verifying the Content-Type header or checking for PDF magic bytes (%PDF-). This could allow arbitrary content to be stored and processed.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At api/routers/agent.py, line 127:
<comment>Missing content-type validation. The response is saved as a PDF without verifying the `Content-Type` header or checking for PDF magic bytes (`%PDF-`). This could allow arbitrary content to be stored and processed.</comment>
<file context>
@@ -0,0 +1,195 @@
+
+ # Use httpx for async download
+ async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client:
+ response = await client.get(request.pdf_url)
+ response.raise_for_status()
+
</file context>
| filename += ".pdf" | ||
|
|
||
| # Use httpx for async download | ||
| async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P0: SSRF Vulnerability: The endpoint accepts arbitrary URLs without validation, allowing attackers to make requests to internal services, cloud metadata endpoints, or scan internal networks. Add URL validation to ensure only allowed external domains/schemes are accessed.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At api/routers/agent.py, line 126:
<comment>**SSRF Vulnerability**: The endpoint accepts arbitrary URLs without validation, allowing attackers to make requests to internal services, cloud metadata endpoints, or scan internal networks. Add URL validation to ensure only allowed external domains/schemes are accessed.</comment>
<file context>
@@ -0,0 +1,195 @@
+ filename += ".pdf"
+
+ # Use httpx for async download
+ async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client:
+ response = await client.get(request.pdf_url)
+ response.raise_for_status()
</file context>
|
|
||
| ### Prerequisites | ||
|
|
||
| - Python 3.10+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Python version requirement is incorrect. The project requires Python 3.11+ (per pyproject.toml), but this documentation states 3.10+. Users following these instructions with Python 3.10 will encounter compatibility issues.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/ACM_AGENT_TESTING_GUIDE.md, line 26:
<comment>Python version requirement is incorrect. The project requires Python 3.11+ (per pyproject.toml), but this documentation states 3.10+. Users following these instructions with Python 3.10 will encounter compatibility issues.</comment>
<file context>
@@ -0,0 +1,205 @@
+
+### Prerequisites
+
+- Python 3.10+
+- Node.js 18+
+- SurrealDB
</file context>
| return OpenAlexACMTool.search(query, limit) | ||
|
|
||
| def ingest_paper(self, paper_url: str) -> Dict[str, Any]: | ||
| filename = paper_url.split('/')[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: URL parsing using split('/')[-1] is fragile and doesn't handle query strings, trailing slashes, or URL fragments. Use urllib.parse.urlparse() to properly extract the path component.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At open_notebook/acm_agent_service/core.py, line 14:
<comment>URL parsing using `split('/')[-1]` is fragile and doesn't handle query strings, trailing slashes, or URL fragments. Use `urllib.parse.urlparse()` to properly extract the path component.</comment>
<file context>
@@ -0,0 +1,24 @@
+ return OpenAlexACMTool.search(query, limit)
+
+ def ingest_paper(self, paper_url: str) -> Dict[str, Any]:
+ filename = paper_url.split('/')[-1]
+ if not filename.endswith('.pdf'):
+ filename += ".pdf"
</file context>
| @@ -0,0 +1,69 @@ | |||
| import requests | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Using requests library which is not declared in project dependencies. The project uses httpx as the HTTP client (declared in pyproject.toml). Consider using httpx instead for consistency:
import httpx
# In search method:
response = httpx.get(cls.BASE_URL, params=params, timeout=10)Alternatively, add requests to the project's runtime dependencies.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At open_notebook/acm_agent_service/tools.py, line 1:
<comment>Using `requests` library which is not declared in project dependencies. The project uses `httpx` as the HTTP client (declared in pyproject.toml). Consider using `httpx` instead for consistency:
```python
import httpx
# In search method:
response = httpx.get(cls.BASE_URL, params=params, timeout=10)
Alternatively, add requests to the project's runtime dependencies.
|
|
||
| export async function searchAcmPapers(query: string, limit: number = 5): Promise<SearchPapersResponse> { | ||
| const apiUrl = await getApiUrl() | ||
| const response = await fetch(`${apiUrl}/api/agent/acm/search?query=${encodeURIComponent(query)}&limit=${limit}`, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: This file uses raw fetch() instead of the established apiClient from ./client. This bypasses authentication (Bearer token), the 10-minute timeout for slow operations, and the 401 redirect handling. Use apiClient.get() and apiClient.post() for consistency with the rest of the codebase.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/api/agent.ts, line 33:
<comment>This file uses raw `fetch()` instead of the established `apiClient` from `./client`. This bypasses authentication (Bearer token), the 10-minute timeout for slow operations, and the 401 redirect handling. Use `apiClient.get()` and `apiClient.post()` for consistency with the rest of the codebase.</comment>
<file context>
@@ -0,0 +1,62 @@
+
+export async function searchAcmPapers(query: string, limit: number = 5): Promise<SearchPapersResponse> {
+ const apiUrl = await getApiUrl()
+ const response = await fetch(`${apiUrl}/api/agent/acm/search?query=${encodeURIComponent(query)}&limit=${limit}`, {
+ method: 'GET',
+ headers: {
</file context>
|
Here's a demo video showing the ACM Scholar Agent in action: 🎬 Demo Video: https://drive.google.com/file/d/1xPVI2EUEtvbNSVgpg4eQ4xNrG4IyGrD2/view?usp=drive_link The video demonstrates:
Let me know if you have any questions! |
Summary
Implements the ACM Scholar Agent feature requested in #319.
Features
Demo
[Video coming soon / Screenshots]
How It Works
Technical Details
Closes #319