Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encapsulate LlamaParse to optimize the processing of chunks and metadata information. #74

Merged
merged 2 commits into from
Jul 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ __pycache__/

.DS_Store

chroma_dir
diskcache_dir
sqlite_dir
web/download_dir

# C extensions
*.so

Expand Down
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,13 +91,17 @@ USE_RERANKING=1
USE_DEBUG=0
USE_LLAMA_PARSE=0
LLAMA_CLOUD_API_KEY="xxxx"
USE_GPT4O=0
```

- Don't modify **`LLM_NAME`**
- Modify the **`OPENAI_API_KEY`** with your own key. Please log in to the [OpenAI website](https://platform.openai.com/api-keys) to view your API Key.
- Update the **`GPT_MODEL_NAME`** setting, replacing `gpt-3.5-turbo` with `gpt-4-turbo` or `gpt-4o` if you want to use GPT-4.
- Change **`BOT_TOPIC`** to reflect your Bot's name. This is very important, as it will be used in `Prompt Construction`. Please try to use a concise and clear word, such as `OpenIM`, `LangChain`.
- Adjust **`URL_PREFIX`** to match your website's domain. This is mainly for generating accessible URL links for uploaded local files. Such as `http://127.0.0.1:7000/web/download_dir/2024_05_20/d3a01d6a-90cd-4c2a-b926-9cda12466caf/openssl-cookbook.pdf`.
- Set **`USE_LLAMA_PARSE`** to 1 if you want to use `LlamaParse`.
- Modify the **`LLAMA_CLOUD_API_KEY `** with your own key. Please log in to the [LLamaCloud website](https://cloud.llamaindex.ai/api-key) to view your API Key.
- Set **`USE_GPT4O`** to 1 if you want to use `GPT-4o` mode.
- For more information about the meanings and usages of constants, you can check under the `server/constant` directory.

#### Using ZhipuAI as the LLM base
Expand Down Expand Up @@ -130,6 +134,8 @@ LLAMA_CLOUD_API_KEY="xxxx"
- Update the **`GLM_MODEL_NAME`** setting, the model list is `['glm-3-turbo', 'glm-4', 'glm-4-0520', 'glm-4-air', 'glm-4-airx', 'glm-4-flash']`.
- Change **`BOT_TOPIC`** to reflect your Bot's name. This is very important, as it will be used in `Prompt Construction`. Please try to use a concise and clear word, such as `OpenIM`, `LangChain`.
- Adjust **`URL_PREFIX`** to match your website's domain. This is mainly for generating accessible URL links for uploaded local files. Such as `http://127.0.0.1:7000/web/download_dir/2024_05_20/d3a01d6a-90cd-4c2a-b926-9cda12466caf/openssl-cookbook.pdf`.
- Set **`USE_LLAMA_PARSE`** to 1 if you want to use `LlamaParse`.
- Modify the **`LLAMA_CLOUD_API_KEY `** with your own key. Please log in to the [LLamaCloud website](https://cloud.llamaindex.ai/api-key) to view your API Key.
- For more information about the meanings and usages of constants, you can check under the `server/constant` directory.

#### Using DeepSeek as the LLM base
Expand Down Expand Up @@ -167,6 +173,8 @@ LLAMA_CLOUD_API_KEY="xxxx"
- Update the **`DEEPSEEK_MODEL_NAME `** setting if you want to use other models of DeepSeek.
- Change **`BOT_TOPIC`** to reflect your Bot's name. This is very important, as it will be used in `Prompt Construction`. Please try to use a concise and clear word, such as `OpenIM`, `LangChain`.
- Adjust **`URL_PREFIX`** to match your website's domain. This is mainly for generating accessible URL links for uploaded local files. Such as `http://127.0.0.1:7000/web/download_dir/2024_05_20/d3a01d6a-90cd-4c2a-b926-9cda12466caf/openssl-cookbook.pdf`.
- Set **`USE_LLAMA_PARSE`** to 1 if you want to use `LlamaParse`.
- Modify the **`LLAMA_CLOUD_API_KEY `** with your own key. Please log in to the [LLamaCloud website](https://cloud.llamaindex.ai/api-key) to view your API Key.
- For more information about the meanings and usages of constants, you can check under the `server/constant` directory.


Expand Down Expand Up @@ -205,6 +213,8 @@ LLAMA_CLOUD_API_KEY="xxxx"
- Update the **`MOONSHOT_MODEL_NAME `** setting if you want to use other models of Moonshot.
- Change **`BOT_TOPIC`** to reflect your Bot's name. This is very important, as it will be used in `Prompt Construction`. Please try to use a concise and clear word, such as `OpenIM`, `LangChain`.
- Adjust **`URL_PREFIX`** to match your website's domain. This is mainly for generating accessible URL links for uploaded local files. Such as `http://127.0.0.1:7000/web/download_dir/2024_05_20/d3a01d6a-90cd-4c2a-b926-9cda12466caf/openssl-cookbook.pdf`.
- Set **`USE_LLAMA_PARSE`** to 1 if you want to use `LlamaParse`.
- Modify the **`LLAMA_CLOUD_API_KEY `** with your own key. Please log in to the [LLamaCloud website](https://cloud.llamaindex.ai/api-key) to view your API Key.
- For more information about the meanings and usages of constants, you can check under the `server/constant` directory.


Expand Down Expand Up @@ -242,6 +252,8 @@ LLAMA_CLOUD_API_KEY="xxxx"
- If you have changed the default `IP:PORT` when starting `Ollama`, please update **`OLLAMA_BASE_URL`**. Please pay special attention, only enter the IP (domain) and PORT here, without appending a URI.
- Change **`BOT_TOPIC`** to reflect your Bot's name. This is very important, as it will be used in `Prompt Construction`. Please try to use a concise and clear word, such as `OpenIM`, `LangChain`.
- Adjust **`URL_PREFIX`** to match your website's domain. This is mainly for generating accessible URL links for uploaded local files. Such as `http://127.0.0.1:7000/web/download_dir/2024_05_20/d3a01d6a-90cd-4c2a-b926-9cda12466caf/openssl-cookbook.pdf`.
- Set **`USE_LLAMA_PARSE`** to 1 if you want to use `LlamaParse`.
- Modify the **`LLAMA_CLOUD_API_KEY `** with your own key. Please log in to the [LLamaCloud website](https://cloud.llamaindex.ai/api-key) to view your API Key.
- For more information about the meanings and usages of constants, you can check under the `server/constant` directory.


Expand Down
1 change: 1 addition & 0 deletions env_of_openai
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ USE_RERANKING=1
USE_DEBUG=0
USE_LLAMA_PARSE=0
LLAMA_CLOUD_API_KEY="xxxx"
USE_GPT4O=0
3 changes: 1 addition & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,5 +29,4 @@ onnxruntime==1.16.3
numpy==1.26.4
et-xmlfile==1.1.0
openpyxl==3.1.2
llama-index==0.10.43
llama-parse==0.4.4
llama-parse==0.4.6
2 changes: 1 addition & 1 deletion server/app/account.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import time
from flask import Blueprint, Flask, request
from flask import Blueprint, request
from werkzeug.security import generate_password_hash, check_password_hash
from server.app.utils.decorators import token_required
from server.app.utils.diskcache_lock import diskcache_lock
Expand Down
2 changes: 1 addition & 1 deletion server/app/auth.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from flask import Blueprint, request
from server.logger.logger_config import my_logger as logger
from server.app.utils.token_helper import TokenHelper
from server.logger.logger_config import my_logger as logger

auth_bp = Blueprint('auth', __name__, url_prefix='/open_kf_api/auth')

Expand Down
2 changes: 1 addition & 1 deletion server/app/bot_config.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import json
import time
from flask import Blueprint, Flask, request
from flask import Blueprint, request
from server.app.utils.decorators import token_required
from server.app.utils.sqlite_client import get_db_connection
from server.app.utils.diskcache_client import diskcache_client
Expand Down
2 changes: 1 addition & 1 deletion server/app/common.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from datetime import datetime
import os
import uuid
from flask import Blueprint, Flask, request
from flask import Blueprint, request
from werkzeug.utils import secure_filename
from server.constant.constants import STATIC_DIR, MEDIA_DIR
from server.app.utils.decorators import token_required
Expand Down
10 changes: 6 additions & 4 deletions server/app/files.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,13 @@
from threading import Thread
import time
from typing import Dict, List, Any
from urllib.parse import urlparse
import uuid
from flask import Blueprint, Flask, request
from werkzeug.utils import secure_filename
from server.constant.constants import MAX_LOCAL_FILE_BATCH_LENGTH, MAX_FILE_SIZE, LOCAL_FILE_DOWNLOAD_DIR, STATIC_DIR, FILE_LOADER_EXTENSIONS, MAX_CONCURRENT_WRITES, LOCAL_FILE_PROCESS_FAILED
from flask import Blueprint, request
from server.constant.constants import (MAX_LOCAL_FILE_BATCH_LENGTH,
MAX_FILE_SIZE, LOCAL_FILE_DOWNLOAD_DIR,
STATIC_DIR, FILE_LOADER_EXTENSIONS,
MAX_CONCURRENT_WRITES,
LOCAL_FILE_PROCESS_FAILED)
from server.app.utils.decorators import token_required
from server.app.utils.sqlite_client import get_db_connection
from server.app.utils.diskcache_lock import diskcache_lock
Expand Down
2 changes: 1 addition & 1 deletion server/app/intervention.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import json
import time
from flask import Blueprint, Flask, request
from flask import Blueprint, request
from server.app.utils.decorators import token_required
from server.app.utils.sqlite_client import get_db_connection
from server.app.utils.diskcache_client import diskcache_client
Expand Down
6 changes: 4 additions & 2 deletions server/app/queries.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,11 @@
import time
from typing import List, Dict, Any, Tuple
from urllib.parse import urlparse
from flask import Blueprint, Flask, request, Response
from flask import Blueprint, request, Response
from langchain.schema.document import Document
from server.constant.constants import RECALL_TOP_K, RERANK_RECALL_TOP_K, MAX_QUERY_LENGTH, SESSION_EXPIRE_TIME, MAX_HISTORY_SESSION_LENGTH
from server.constant.constants import (RECALL_TOP_K, RERANK_RECALL_TOP_K,
MAX_QUERY_LENGTH, SESSION_EXPIRE_TIME,
MAX_HISTORY_SESSION_LENGTH)
from server.app.utils.decorators import token_required
from server.app.utils.sqlite_client import get_db_connection
from server.app.utils.diskcache_client import diskcache_client
Expand Down
9 changes: 6 additions & 3 deletions server/app/sitemaps.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,17 @@
import json
from threading import Thread
import time
from typing import Callable, Dict, Any, List, Union, Set
from typing import Callable, Dict, Any, List
from urllib.parse import urlparse
from flask import Blueprint, Flask, request
from flask import Blueprint, request
from server.app.utils.decorators import token_required
from server.app.utils.sqlite_client import get_db_connection
from server.app.utils.diskcache_lock import diskcache_lock
from server.app.utils.url_helper import is_valid_url
from server.constant.constants import ADD_SITEMAP_CONTENT, DELETE_SITEMAP_CONTENT, UPDATE_SITEMAP_CONTENT, DOMAIN_PROCESSING, FROM_SITEMAP_URL
from server.constant.constants import (ADD_SITEMAP_CONTENT,
DELETE_SITEMAP_CONTENT,
UPDATE_SITEMAP_CONTENT,
DOMAIN_PROCESSING, FROM_SITEMAP_URL)
from server.logger.logger_config import my_logger as logger
from server.rag.index.parser.html_parser.web_link_crawler import AsyncCrawlerSiteLink
from server.rag.index.parser.html_parser.web_content_crawler import AsyncCrawlerSiteContent
Expand Down
7 changes: 5 additions & 2 deletions server/app/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,11 @@
import time
from typing import Dict, Any
from urllib.parse import urlparse
from flask import Blueprint, Flask, request
from server.constant.constants import MAX_ISOLATED_URL_BATCH_LENGTH, FROM_ISOLATED_URL, ADD_ISOLATED_URL_CONTENT, DELETE_ISOLATED_URL_CONTENT
from flask import Blueprint, request
from server.constant.constants import (MAX_ISOLATED_URL_BATCH_LENGTH,
FROM_ISOLATED_URL,
ADD_ISOLATED_URL_CONTENT,
DELETE_ISOLATED_URL_CONTENT)
from server.app.utils.decorators import token_required
from server.app.utils.sqlite_client import get_db_connection
from server.app.utils.diskcache_lock import diskcache_lock
Expand Down
1 change: 0 additions & 1 deletion server/app/utils/diskcache_client.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
from contextlib import contextmanager
from typing import Any, Optional, List
from diskcache import Cache
from server.constant.constants import DISKCACHE_DIR
Expand Down
5 changes: 3 additions & 2 deletions server/app/utils/diskcache_lock.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
from contextlib import contextmanager
from typing import Generator, Any
from typing import Generator
from diskcache import Cache, Lock
from server.app.utils.diskcache_client import diskcache_client
from server.constant.constants import DISTRIBUTED_LOCK_ID, DISTRIBUTED_LOCK_EXPIRE_TIME
from server.constant.constants import (DISTRIBUTED_LOCK_ID,
DISTRIBUTED_LOCK_EXPIRE_TIME)


class DiskcacheLock:
Expand Down
5 changes: 4 additions & 1 deletion server/rag/index/embedder/document_embedder.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,10 @@
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import OllamaEmbeddings
from langchain.schema.document import Document
from server.constant.constants import OPENAI_EMBEDDING_MODEL_NAME, ZHIPUAI_EMBEDDING_MODEL_NAME, OPENAI_EMBEDDING_MODEL_NAME, CHROMA_DB_DIR, CHROMA_COLLECTION_NAME, OLLAMA_EMBEDDING_MODEL_NAME
from server.constant.constants import (OPENAI_EMBEDDING_MODEL_NAME,
ZHIPUAI_EMBEDDING_MODEL_NAME,
CHROMA_DB_DIR, CHROMA_COLLECTION_NAME,
OLLAMA_EMBEDDING_MODEL_NAME)
from server.logger.logger_config import my_logger as logger
from server.rag.index.embedder.zhipuai_embedder import ZhipuAIEmbeddings

Expand Down
1 change: 0 additions & 1 deletion server/rag/index/parser/file_loader/table_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,6 @@ def convert_table_to_markdown(

def identify_tables(self, sheet: Worksheet) -> List[str]:
"""Scan the worksheet for tables and return a list of Markdown formatted tables."""
sheet = self.wb.active
max_row = sheet.max_row
max_col = sheet.max_column
markdown_tables = []
Expand Down
Empty file.
27 changes: 27 additions & 0 deletions server/rag/index/parser/file_parser/llamaparse/file_handler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import shutil
from abc import ABC, abstractmethod


class FileHandler(ABC):
@abstractmethod
def download_file(self, file_path: str, destination_path: str) -> None:
pass

@abstractmethod
def upload_file(self, file_path: str, destination_path: str) -> None:
pass

@abstractmethod
def sync_foler(self, source: str, destination: str) -> None:
pass


class LocalHandler(FileHandler):
def download_file(self, file_path: str, destination_path: str) -> None:
shutil.copy(file_path, destination_path)

def upload_file(self, file_path: str, destination_path: str) -> None:
shutil.copy(file_path, destination_path)

def sync_foler(self, source: str, destination: str) -> None:
shutil.copytree(source, destination, dirs_exist_ok=True)
Loading