Skip to content

feat: speed up with parallel call and support for vLLM deployment emb…#19

Open
wuxuedaifu wants to merge 1 commit intonikmcfly:mainfrom
wuxuedaifu:feat/parallel_version
Open

feat: speed up with parallel call and support for vLLM deployment emb…#19
wuxuedaifu wants to merge 1 commit intonikmcfly:mainfrom
wuxuedaifu:feat/parallel_version

Conversation

@wuxuedaifu
Copy link
Copy Markdown

@wuxuedaifu wuxuedaifu commented Mar 24, 2026

PR: Parallel Execution & Local Inference Optimization(2X+ faster)

🚨 Existing Problems

  1. Synchronous Data Ingestion: Knowledge Graph text portions were ingested sequentially, causing severe performance degradation scaling alongside local LLM calls.
  2. I/O Thread Blocking: Profile generation locked the main thread entirely to write tracking logs (CSV/JSON) per entity, creating quadratic O(N²) delays.
  3. Infinite Generation Deadlocks (vLLM): Using strict json_object endpoints without max_tokens limits frequently caused local models to hallucinate infinite whitespace loops, permanently hanging execution threads.
  4. Inefficient ReACT Loops: The Report Agent rigorously enforced min_tool_calls=3, forcing aware agents to blindly search tools redundantly.
  5. Hardcoded Scaling Limits: Core concurrency modifiers (like thread counts and batch sizes) were hardcoded across various backend functions, preventing user-specific deployment tuning.
  6. Inflexible Embeddings: The embedding services lacked seamless configuration to support bare-metal or containerized local inference integrations (Ollama / vLLM).

🛠️ Key Improvements & Solutions

1. Parallel Knowledge Graph Pipeline (graph_builder.py, neo4j_storage.py)

  • Replaced iterative chunking mechanisms inside add_text_batch natively utilizing a ThreadPoolExecutor.
  • Concurrent graph interactions safely manage transactional Neo4j MERGE deadlocks via the existing TransientError exponential backoff implementation.

2. High-Throughput Profile Generation (oasis_profile_generator.py)

  • Background I/O Pipelines: Replaced sequential main-thread file persistence with threading.Thread fire-and-forget daemon tasks.
  • Concurrent DB Queries: Split sequential edges and nodes hybrid searches into simultaneous tandem DB queries, halving query wait times.
  • Anti-Deadlocking Constraints: Secured completions endpoints assigning max_tokens=4000. Combined directly with the _fix_truncated_json parser, the system automatically salvages any trailing JSON syntax instead of permanently hanging 128k context windows.

3. Streamlined Report Processing (report_agent.py)

  • Standardized report subsections to build entirely asynchronously minimizing total payload generation times.
  • Stripped the min_tool_calls metric allowing context-sufficient agents to generate responses instantly.

4. Dynamic Execution Metrics (config.py, .env)

  • Centralized hardware scaling configurations up to .env targeting adaptable performance across deployments natively:
    • GRAPH_BUILD_BATCH_SIZE=10
    • PROFILE_PARALLEL_COUNT=10
    • PROFILE_SEARCH_WORKERS=2
    • REPORT_PARALLEL_SECTIONS=5

5. Extended Local Deployment Support (embedding_service.py)

  • Abstracted embedding initialization parameters fully restoring unified connectivity targets mapping generalized configurations standardizing Ollama/vLLM accessibility automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants