Skip to content

Conversation

@prinskumar-tigergraph
Copy link
Collaborator

@prinskumar-tigergraph prinskumar-tigergraph commented Dec 8, 2025

PR Type

Enhancement, Bug fix, Documentation, Other


Description

  • Stream JSONL ingestion with temp sessions

  • PDF->Markdown via pymupdf4llm, image cleanup

  • Session-aware delete APIs and JSONL pruning

  • Dependency updates and license addition


Diagram Walkthrough

flowchart LR
  A["Local folder (server)"] -- "process_folder()" --> B["TextExtractor"]
  B -- "write docs incrementally" --> C["JSONL in uploads/ingestion_temp/<graph>/<session>"]
  C -- "runLoadingJobWithData(JSONL)" --> D["TigerGraph"]
  E["UI endpoints"] -- "delete by filename/session" --> C
  B -. "PDF via pymupdf4llm (locked), extract images, tg:// refs" .-> C
Loading

File Walkthrough

Relevant files
Enhancement
5 files
text_extractors.py
Async JSONL writing, PDF extraction via pymupdf4llm, and temp cleanup
+335/-138
supportai.py
Use temp-session JSONL for server ingest and single-shot loading
+56/-40 
ui.py
Add session-aware deletion and ingestion temp cleanup endpoint
+97/-3   
image_data_extractor.py
Simplify image description to file-path input; remove legacy save
+31/-134
Setup.tsx
UI setup updates for ingestion temp session handling         
+662/-629
Dependencies
1 files
requirements.txt
Add pymupdf4llm and bump PyMuPDF version                                 
+2/-1     
Documentation
1 files
pymupdf4llm-AGPL-3.0.txt
Add AGPL-3.0 license for pymupdf4llm dependency                   
+661/-0 

@tg-pr-agent
Copy link

tg-pr-agent bot commented Dec 8, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns

Path traversal:
The ingestion temp delete endpoint (and related deletions using session_id) joins unvalidated session_id into paths and can delete directories/files (shutil.rmtree, os.remove). A crafted session_id like "../../some/other/dir" could escape the base directory. Mitigations:

  • Accept only a strict UUID pattern for session_id.
  • Resolve and compare real paths (e.g., os.path.realpath) and enforce they start with the base temp folder.
  • Reject path components containing separators or "..".
    Also consider that delete_file_from_jsonl uses the same session_id-derived path; validate it similarly.
⚡ Recommended focus areas for review

Missing Import

New helpers use os.path functions but no local import of os is visible; this will raise NameError at runtime unless imported elsewhere in the module. Add import os near the top.

for m in _md_pattern.finditer(md_text):
    path = m.group(2)
    basename = os.path.basename(path)
    image_id = os.path.splitext(basename)[0]
    images.append({"path": path, "image_id": image_id})
return images
Logic Regression

Small images are no longer skipped in _extract_standalone_image_as_doc; the conditional now just passes, which proceeds to process them. If the intent was to skip, return early; otherwise expect increased noise and token usage.

if pil_image.width < 100 or pil_image.height < 100:
    pass

description = describe_image_with_llm(str(Path(file_path).absolute()))
description_lower = description.lower()
logo_indicators = ['logo:', 'icon:', 'logo', 'icon', 'branding',
                   'watermark', 'trademark', 'stylized letter',
                   'stylized text', 'word "', "word '"]
if any(indicator in description_lower for indicator in logo_indicators):
    return []
Path Traversal

Session ID is concatenated directly into filesystem paths for deletion in the ingestion temp delete API. Validate session_id (e.g., UUID pattern), normalize paths, and ensure the resolved path stays under the intended base directory to prevent directory traversal. Apply similar hardening to other places where session_id is joined into paths.

if not session_id:
    raise HTTPException(status_code=400, detail="session_id is required")

session_dir = os.path.join(base_temp_dir, session_id)

if not os.path.exists(session_dir):
    return {
        "status": "success",
        "message": f"No temp files found for session {session_id}",
        "deleted_files": [],
    }

deleted_files = []

if filename:
    # Delete specific file
    file_path = os.path.join(session_dir, filename)
    if os.path.exists(file_path) and os.path.isfile(file_path):
        os.remove(file_path)
        deleted_files.append(filename)
        logger.info(f"Deleted temp file {filename} from session {session_id}")

        # If session folder is now empty, remove it
        if not os.listdir(session_dir):
            os.rmdir(session_dir)
            logger.info(f"Removed empty session folder {session_id}")
    else:
        raise HTTPException(status_code=404, detail=f"File {filename} not found")
else:
    # Delete entire session folder
    import shutil
    for filename in os.listdir(session_dir):
        if os.path.isfile(os.path.join(session_dir, filename)):
            deleted_files.append(filename)

    shutil.rmtree(session_dir)
    logger.info(f"Deleted session folder {session_id} for graph {graphname}")

@chengbiao-jin chengbiao-jin changed the title 2011-updated code for local temp files supports and delete logic GML-2011 updated code for local temp files supports and delete logic Dec 9, 2025
- Removed temp_session_id UUID generation from supportai.py
- Temp folders now use consistent path: base_dir/ingestion_temp/graphname
- Fixed delete endpoints to remove corresponding JSONL files when raw files are deleted
}

async def _process_file_async(self, file_path, folder_path_obj, graphname):
async def _process_file_async(self, file_path, folder_path_obj, graphname, temp_folder):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prinskumar-tigergraph can you add description for the args as some of them are generic names and unclear what's the real value it will be, e.g., file_path.

- Add logo identification instruction to image LLM prompt
- Use runLoadingJobWithFile for memory-efficient data loading
- Preserve temp JSONL files after ingestion for faster re-ingestion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants