Skip to content

Commit 5e5618b

Browse files
committedJul 15, 2024
Encapsulate LlamaParse to optimize the processing of chunks and metadata information.
1 parent 06987c4 commit 5e5618b

22 files changed

+277
-25
lines changed
 

‎.gitignore

+5
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,11 @@ __pycache__/
77

88
.DS_Store
99

10+
chroma_dir
11+
diskcache_dir
12+
sqlite_dir
13+
web/download_dir
14+
1015
# C extensions
1116
*.so
1217

‎README.md

+12
Original file line numberDiff line numberDiff line change
@@ -91,13 +91,17 @@ USE_RERANKING=1
9191
USE_DEBUG=0
9292
USE_LLAMA_PARSE=0
9393
LLAMA_CLOUD_API_KEY="xxxx"
94+
USE_GPT4O=0
9495
```
9596

9697
- Don't modify **`LLM_NAME`**
9798
- Modify the **`OPENAI_API_KEY`** with your own key. Please log in to the [OpenAI website](https://platform.openai.com/api-keys) to view your API Key.
9899
- Update the **`GPT_MODEL_NAME`** setting, replacing `gpt-3.5-turbo` with `gpt-4-turbo` or `gpt-4o` if you want to use GPT-4.
99100
- Change **`BOT_TOPIC`** to reflect your Bot's name. This is very important, as it will be used in `Prompt Construction`. Please try to use a concise and clear word, such as `OpenIM`, `LangChain`.
100101
- Adjust **`URL_PREFIX`** to match your website's domain. This is mainly for generating accessible URL links for uploaded local files. Such as `http://127.0.0.1:7000/web/download_dir/2024_05_20/d3a01d6a-90cd-4c2a-b926-9cda12466caf/openssl-cookbook.pdf`.
102+
- Set **`USE_LLAMA_PARSE`** to 1 if you want to use `LlamaParse`.
103+
- Modify the **`LLAMA_CLOUD_API_KEY `** with your own key. Please log in to the [LLamaCloud website](https://cloud.llamaindex.ai/api-key) to view your API Key.
104+
- Set **`USE_GPT4O`** to 1 if you want to use `GPT-4o` mode.
101105
- For more information about the meanings and usages of constants, you can check under the `server/constant` directory.
102106

103107
#### Using ZhipuAI as the LLM base
@@ -130,6 +134,8 @@ LLAMA_CLOUD_API_KEY="xxxx"
130134
- Update the **`GLM_MODEL_NAME`** setting, the model list is `['glm-3-turbo', 'glm-4', 'glm-4-0520', 'glm-4-air', 'glm-4-airx', 'glm-4-flash']`.
131135
- Change **`BOT_TOPIC`** to reflect your Bot's name. This is very important, as it will be used in `Prompt Construction`. Please try to use a concise and clear word, such as `OpenIM`, `LangChain`.
132136
- Adjust **`URL_PREFIX`** to match your website's domain. This is mainly for generating accessible URL links for uploaded local files. Such as `http://127.0.0.1:7000/web/download_dir/2024_05_20/d3a01d6a-90cd-4c2a-b926-9cda12466caf/openssl-cookbook.pdf`.
137+
- Set **`USE_LLAMA_PARSE`** to 1 if you want to use `LlamaParse`.
138+
- Modify the **`LLAMA_CLOUD_API_KEY `** with your own key. Please log in to the [LLamaCloud website](https://cloud.llamaindex.ai/api-key) to view your API Key.
133139
- For more information about the meanings and usages of constants, you can check under the `server/constant` directory.
134140

135141
#### Using DeepSeek as the LLM base
@@ -167,6 +173,8 @@ LLAMA_CLOUD_API_KEY="xxxx"
167173
- Update the **`DEEPSEEK_MODEL_NAME `** setting if you want to use other models of DeepSeek.
168174
- Change **`BOT_TOPIC`** to reflect your Bot's name. This is very important, as it will be used in `Prompt Construction`. Please try to use a concise and clear word, such as `OpenIM`, `LangChain`.
169175
- Adjust **`URL_PREFIX`** to match your website's domain. This is mainly for generating accessible URL links for uploaded local files. Such as `http://127.0.0.1:7000/web/download_dir/2024_05_20/d3a01d6a-90cd-4c2a-b926-9cda12466caf/openssl-cookbook.pdf`.
176+
- Set **`USE_LLAMA_PARSE`** to 1 if you want to use `LlamaParse`.
177+
- Modify the **`LLAMA_CLOUD_API_KEY `** with your own key. Please log in to the [LLamaCloud website](https://cloud.llamaindex.ai/api-key) to view your API Key.
170178
- For more information about the meanings and usages of constants, you can check under the `server/constant` directory.
171179

172180

@@ -205,6 +213,8 @@ LLAMA_CLOUD_API_KEY="xxxx"
205213
- Update the **`MOONSHOT_MODEL_NAME `** setting if you want to use other models of Moonshot.
206214
- Change **`BOT_TOPIC`** to reflect your Bot's name. This is very important, as it will be used in `Prompt Construction`. Please try to use a concise and clear word, such as `OpenIM`, `LangChain`.
207215
- Adjust **`URL_PREFIX`** to match your website's domain. This is mainly for generating accessible URL links for uploaded local files. Such as `http://127.0.0.1:7000/web/download_dir/2024_05_20/d3a01d6a-90cd-4c2a-b926-9cda12466caf/openssl-cookbook.pdf`.
216+
- Set **`USE_LLAMA_PARSE`** to 1 if you want to use `LlamaParse`.
217+
- Modify the **`LLAMA_CLOUD_API_KEY `** with your own key. Please log in to the [LLamaCloud website](https://cloud.llamaindex.ai/api-key) to view your API Key.
208218
- For more information about the meanings and usages of constants, you can check under the `server/constant` directory.
209219

210220

@@ -242,6 +252,8 @@ LLAMA_CLOUD_API_KEY="xxxx"
242252
- If you have changed the default `IP:PORT` when starting `Ollama`, please update **`OLLAMA_BASE_URL`**. Please pay special attention, only enter the IP (domain) and PORT here, without appending a URI.
243253
- Change **`BOT_TOPIC`** to reflect your Bot's name. This is very important, as it will be used in `Prompt Construction`. Please try to use a concise and clear word, such as `OpenIM`, `LangChain`.
244254
- Adjust **`URL_PREFIX`** to match your website's domain. This is mainly for generating accessible URL links for uploaded local files. Such as `http://127.0.0.1:7000/web/download_dir/2024_05_20/d3a01d6a-90cd-4c2a-b926-9cda12466caf/openssl-cookbook.pdf`.
255+
- Set **`USE_LLAMA_PARSE`** to 1 if you want to use `LlamaParse`.
256+
- Modify the **`LLAMA_CLOUD_API_KEY `** with your own key. Please log in to the [LLamaCloud website](https://cloud.llamaindex.ai/api-key) to view your API Key.
245257
- For more information about the meanings and usages of constants, you can check under the `server/constant` directory.
246258

247259

‎env_of_openai

+1
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ USE_RERANKING=1
99
USE_DEBUG=0
1010
USE_LLAMA_PARSE=0
1111
LLAMA_CLOUD_API_KEY="xxxx"
12+
USE_GPT4O=0

‎requirements.txt

+1-2
Original file line numberDiff line numberDiff line change
@@ -29,5 +29,4 @@ onnxruntime==1.16.3
2929
numpy==1.26.4
3030
et-xmlfile==1.1.0
3131
openpyxl==3.1.2
32-
llama-index==0.10.43
33-
llama-parse==0.4.4
32+
llama-parse==0.4.6

‎server/app/account.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
import time
2-
from flask import Blueprint, Flask, request
2+
from flask import Blueprint, request
33
from werkzeug.security import generate_password_hash, check_password_hash
44
from server.app.utils.decorators import token_required
55
from server.app.utils.diskcache_lock import diskcache_lock

‎server/app/auth.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from flask import Blueprint, request
2-
from server.logger.logger_config import my_logger as logger
32
from server.app.utils.token_helper import TokenHelper
3+
from server.logger.logger_config import my_logger as logger
44

55
auth_bp = Blueprint('auth', __name__, url_prefix='/open_kf_api/auth')
66

‎server/app/bot_config.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import json
22
import time
3-
from flask import Blueprint, Flask, request
3+
from flask import Blueprint, request
44
from server.app.utils.decorators import token_required
55
from server.app.utils.sqlite_client import get_db_connection
66
from server.app.utils.diskcache_client import diskcache_client

‎server/app/common.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
from datetime import datetime
22
import os
33
import uuid
4-
from flask import Blueprint, Flask, request
4+
from flask import Blueprint, request
55
from werkzeug.utils import secure_filename
66
from server.constant.constants import STATIC_DIR, MEDIA_DIR
77
from server.app.utils.decorators import token_required

‎server/app/files.py

+6-4
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,13 @@
66
from threading import Thread
77
import time
88
from typing import Dict, List, Any
9-
from urllib.parse import urlparse
109
import uuid
11-
from flask import Blueprint, Flask, request
12-
from werkzeug.utils import secure_filename
13-
from server.constant.constants import MAX_LOCAL_FILE_BATCH_LENGTH, MAX_FILE_SIZE, LOCAL_FILE_DOWNLOAD_DIR, STATIC_DIR, FILE_LOADER_EXTENSIONS, MAX_CONCURRENT_WRITES, LOCAL_FILE_PROCESS_FAILED
10+
from flask import Blueprint, request
11+
from server.constant.constants import (MAX_LOCAL_FILE_BATCH_LENGTH,
12+
MAX_FILE_SIZE, LOCAL_FILE_DOWNLOAD_DIR,
13+
STATIC_DIR, FILE_LOADER_EXTENSIONS,
14+
MAX_CONCURRENT_WRITES,
15+
LOCAL_FILE_PROCESS_FAILED)
1416
from server.app.utils.decorators import token_required
1517
from server.app.utils.sqlite_client import get_db_connection
1618
from server.app.utils.diskcache_lock import diskcache_lock

‎server/app/intervention.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import json
22
import time
3-
from flask import Blueprint, Flask, request
3+
from flask import Blueprint, request
44
from server.app.utils.decorators import token_required
55
from server.app.utils.sqlite_client import get_db_connection
66
from server.app.utils.diskcache_client import diskcache_client

‎server/app/queries.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,11 @@
66
import time
77
from typing import List, Dict, Any, Tuple
88
from urllib.parse import urlparse
9-
from flask import Blueprint, Flask, request, Response
9+
from flask import Blueprint, request, Response
1010
from langchain.schema.document import Document
11-
from server.constant.constants import RECALL_TOP_K, RERANK_RECALL_TOP_K, MAX_QUERY_LENGTH, SESSION_EXPIRE_TIME, MAX_HISTORY_SESSION_LENGTH
11+
from server.constant.constants import (RECALL_TOP_K, RERANK_RECALL_TOP_K,
12+
MAX_QUERY_LENGTH, SESSION_EXPIRE_TIME,
13+
MAX_HISTORY_SESSION_LENGTH)
1214
from server.app.utils.decorators import token_required
1315
from server.app.utils.sqlite_client import get_db_connection
1416
from server.app.utils.diskcache_client import diskcache_client

‎server/app/sitemaps.py

+6-3
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,17 @@
33
import json
44
from threading import Thread
55
import time
6-
from typing import Callable, Dict, Any, List, Union, Set
6+
from typing import Callable, Dict, Any, List
77
from urllib.parse import urlparse
8-
from flask import Blueprint, Flask, request
8+
from flask import Blueprint, request
99
from server.app.utils.decorators import token_required
1010
from server.app.utils.sqlite_client import get_db_connection
1111
from server.app.utils.diskcache_lock import diskcache_lock
1212
from server.app.utils.url_helper import is_valid_url
13-
from server.constant.constants import ADD_SITEMAP_CONTENT, DELETE_SITEMAP_CONTENT, UPDATE_SITEMAP_CONTENT, DOMAIN_PROCESSING, FROM_SITEMAP_URL
13+
from server.constant.constants import (ADD_SITEMAP_CONTENT,
14+
DELETE_SITEMAP_CONTENT,
15+
UPDATE_SITEMAP_CONTENT,
16+
DOMAIN_PROCESSING, FROM_SITEMAP_URL)
1417
from server.logger.logger_config import my_logger as logger
1518
from server.rag.index.parser.html_parser.web_link_crawler import AsyncCrawlerSiteLink
1619
from server.rag.index.parser.html_parser.web_content_crawler import AsyncCrawlerSiteContent

‎server/app/urls.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,11 @@
44
import time
55
from typing import Dict, Any
66
from urllib.parse import urlparse
7-
from flask import Blueprint, Flask, request
8-
from server.constant.constants import MAX_ISOLATED_URL_BATCH_LENGTH, FROM_ISOLATED_URL, ADD_ISOLATED_URL_CONTENT, DELETE_ISOLATED_URL_CONTENT
7+
from flask import Blueprint, request
8+
from server.constant.constants import (MAX_ISOLATED_URL_BATCH_LENGTH,
9+
FROM_ISOLATED_URL,
10+
ADD_ISOLATED_URL_CONTENT,
11+
DELETE_ISOLATED_URL_CONTENT)
912
from server.app.utils.decorators import token_required
1013
from server.app.utils.sqlite_client import get_db_connection
1114
from server.app.utils.diskcache_lock import diskcache_lock

‎server/app/utils/diskcache_client.py

-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
from contextlib import contextmanager
21
from typing import Any, Optional, List
32
from diskcache import Cache
43
from server.constant.constants import DISKCACHE_DIR

‎server/app/utils/diskcache_lock.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
from contextlib import contextmanager
2-
from typing import Generator, Any
2+
from typing import Generator
33
from diskcache import Cache, Lock
44
from server.app.utils.diskcache_client import diskcache_client
5-
from server.constant.constants import DISTRIBUTED_LOCK_ID, DISTRIBUTED_LOCK_EXPIRE_TIME
5+
from server.constant.constants import (DISTRIBUTED_LOCK_ID,
6+
DISTRIBUTED_LOCK_EXPIRE_TIME)
67

78

89
class DiskcacheLock:

‎server/rag/index/embedder/document_embedder.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@
66
from langchain_openai import OpenAIEmbeddings
77
from langchain_community.embeddings import OllamaEmbeddings
88
from langchain.schema.document import Document
9-
from server.constant.constants import OPENAI_EMBEDDING_MODEL_NAME, ZHIPUAI_EMBEDDING_MODEL_NAME, OPENAI_EMBEDDING_MODEL_NAME, CHROMA_DB_DIR, CHROMA_COLLECTION_NAME, OLLAMA_EMBEDDING_MODEL_NAME
9+
from server.constant.constants import (OPENAI_EMBEDDING_MODEL_NAME,
10+
ZHIPUAI_EMBEDDING_MODEL_NAME,
11+
CHROMA_DB_DIR, CHROMA_COLLECTION_NAME,
12+
OLLAMA_EMBEDDING_MODEL_NAME)
1013
from server.logger.logger_config import my_logger as logger
1114
from server.rag.index.embedder.zhipuai_embedder import ZhipuAIEmbeddings
1215

‎server/rag/index/parser/file_parser/llamaparse/__init__.py

Whitespace-only changes.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
import shutil
2+
from abc import ABC, abstractmethod
3+
4+
5+
class FileHandler(ABC):
6+
@abstractmethod
7+
def download_file(self, file_path: str, destination_path: str) -> None:
8+
pass
9+
10+
@abstractmethod
11+
def upload_file(self, file_path: str, destination_path: str) -> None:
12+
pass
13+
14+
@abstractmethod
15+
def sync_foler(self, source: str, destination: str) -> None:
16+
pass
17+
18+
19+
class LocalHandler(FileHandler):
20+
def download_file(self, file_path: str, destination_path: str) -> None:
21+
shutil.copy(file_path, destination_path)
22+
23+
def upload_file(self, file_path: str, destination_path: str) -> None:
24+
shutil.copy(file_path, destination_path)
25+
26+
def sync_foler(self, source: str, destination: str) -> None:
27+
shutil.copytree(source, destination, dirs_exist_ok=True)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
import json
2+
import os
3+
import requests
4+
import tempfile
5+
import time
6+
from pathlib import Path
7+
from typing import Any, List, Optional
8+
from llama_parse import LlamaParse
9+
from server.rag.index.parser.file_parser.llamaparse.file_handler import FileHandler
10+
from server.logger.logger_config import my_logger as logger
11+
12+
all_elements_output_file = "all_elements.json"
13+
chunks_output_file = "chunks.json"
14+
15+
16+
class DocParser:
17+
def __init__(self,
18+
file_handler: FileHandler,
19+
language: str = "en",
20+
is_download_image: bool = True) -> None:
21+
self.file_handler = file_handler
22+
self.is_download_image = is_download_image
23+
USE_GPT4O = int(os.getenv('USE_GPT4O'))
24+
if USE_GPT4O:
25+
self.llamaparse = LlamaParse(
26+
api_key=os.getenv('LLAMA_CLOUD_API_KEY'),
27+
gpt4o_mode=True,
28+
gpt4o_api_key=os.getenv('OPENAI_API_KEY'),
29+
result_type="json",
30+
language=language,
31+
verbose=True)
32+
else:
33+
self.llamaparse = LlamaParse(
34+
api_key=os.getenv('LLAMA_CLOUD_API_KEY'),
35+
result_type="json",
36+
language=language,
37+
verbose=True)
38+
logger.info(
39+
f"Init DocParser of llamaparse, language: '{language}', is_download_image: {is_download_image}, USE_GPT4O: {USE_GPT4O}"
40+
)
41+
42+
def parse_file(
43+
self,
44+
filepath: Path,
45+
destination_folder: Path,
46+
include_chunking: bool = True) -> tuple[list[Any], list[Any]]:
47+
with tempfile.TemporaryDirectory() as temp_dir:
48+
temp_file = Path(temp_dir) / filepath.name
49+
self.file_handler.download_file(filepath.as_posix(),
50+
temp_file.as_posix())
51+
52+
elements_file = f"{temp_dir}/{all_elements_output_file}"
53+
54+
elements, chunks = self.partition_doc_to_folder(
55+
temp_file,
56+
Path(temp_dir),
57+
include_chunking=include_chunking,
58+
all_elements_output_file=elements_file)
59+
60+
self.file_handler.sync_foler(temp_dir,
61+
destination_folder.as_posix())
62+
63+
return elements, chunks
64+
65+
def partition_doc(
66+
self,
67+
input_file: Path,
68+
output_dir: Path,
69+
include_chunking: bool = True,
70+
) -> tuple[list[Any], list[Any]]:
71+
elements = []
72+
chunks = []
73+
try:
74+
import nest_asyncio
75+
nest_asyncio.apply()
76+
77+
json_objs = self.llamaparse.get_json_result(str(input_file))
78+
job_id = json_objs[0]["job_id"]
79+
elements = json_objs[0]["pages"]
80+
job_metadata = json_objs[0]["job_metadata"]
81+
logger.info(
82+
f"For inpput_file: '{input_file}', job_id is'{job_id}', job_metatdata is {job_metadata}"
83+
)
84+
85+
if self.is_download_image:
86+
"""
87+
TODO:
88+
To enhance the efficiency of image downloading, the following optimizations could be considered:
89+
1. Handle image downloads through asynchronous tasks to improve response times.
90+
2. Implement concurrent downloads to make effective use of resources and accelerate the download process.
91+
"""
92+
for page_item in elements:
93+
images = page_item["images"]
94+
for image_item in images:
95+
image_name = image_item["name"]
96+
logger.info(
97+
f"For inpput_file: '{input_file}', downloading image: '{image_name}'"
98+
)
99+
download_image(job_id, image_name,
100+
output_dir.as_posix())
101+
102+
if include_chunking:
103+
"""
104+
TODO:
105+
The current chunking strategy treats each page as a separate chunk. Future optimizations might include:
106+
1. Evaluating whether adjacent pages can be merged into a single chunk.
107+
2. Considering whether it's necessary to split a single page into multiple chunks.
108+
"""
109+
filename = input_file.name
110+
file_extension = input_file.suffix
111+
for page_item in elements:
112+
page_number = page_item["page"]
113+
chunk_item = {
114+
"chunk_text": page_item["md"],
115+
"metadata": {
116+
"filename": filename,
117+
"filetype": f"application/{file_extension[1:]}",
118+
"last_modified_timestamp": int(time.time()),
119+
"beginning_page": page_number,
120+
"ending_page": page_number
121+
}
122+
}
123+
chunks.append(chunk_item)
124+
except Exception as e:
125+
logger.error(
126+
f"Parsing file: '{input_file}' is failed, exception: {e}")
127+
128+
return elements, chunks
129+
130+
def partition_doc_to_folder(
131+
self,
132+
input_file: Path,
133+
output_dir: Path,
134+
all_elements_output_file: str,
135+
include_chunking: bool = True,
136+
) -> tuple[list[Any], list[Any]]:
137+
elements, chunks = self.partition_doc(input_file, output_dir,
138+
include_chunking)
139+
140+
elements_output_file = output_dir / all_elements_output_file
141+
elements_to_json(elements, elements_output_file.as_posix())
142+
elements_to_json(chunks, (output_dir / chunks_output_file).as_posix())
143+
144+
return elements, chunks
145+
146+
147+
def elements_to_json(
148+
elements: List[Any],
149+
filename: Optional[str] = None,
150+
indent: int = 4,
151+
encoding: str = "utf-8",
152+
) -> Optional[str]:
153+
"""
154+
Saves a list of elements to a JSON file if filename is specified.
155+
Otherwise, return the list of elements as a string.
156+
"""
157+
# -- serialize `elements` as a JSON array (str) --
158+
json_str = json.dumps(elements, indent=indent, sort_keys=False)
159+
if filename is not None:
160+
with open(filename, "w", encoding=encoding) as f:
161+
f.write(json_str)
162+
return None
163+
return json_str
164+
165+
166+
def download_image(job_id: str, image_name: str, output_dir: str) -> None:
167+
url = f"https://api.cloud.llamaindex.ai/api/parsing/job/{job_id}/result/image/{image_name}"
168+
headers = {
169+
'Authorization': f'Bearer {os.getenv("LLAMA_CLOUD_API_KEY")}',
170+
'Accept': 'application/json',
171+
'Content-Type': 'multipart/form-data'
172+
}
173+
try:
174+
response = requests.get(url, headers=headers)
175+
if response.status_code == 200:
176+
with open(f'{output_dir}/{image_name}', 'wb') as f:
177+
f.write(response.content)
178+
else:
179+
logger.error(
180+
f"Failed to retrieve '{image_name}', status_code: {response.status_code}, text: {response.text}"
181+
)
182+
except Exception as e:
183+
logger.error(f"Download '{image_name}' failed, error: {e}")

‎server/rag/index/parser/file_parser/markdown_parser.py

+6-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,12 @@
44
from typing import List
55
from server.app.utils.diskcache_lock import diskcache_lock
66
from server.logger.logger_config import my_logger as logger
7-
from server.constant.constants import SQLITE_DB_DIR, SQLITE_DB_NAME, MAX_CHUNK_LENGTH, CHUNK_OVERLAP, FROM_LOCAL_FILE, LOCAL_FILE_PARSING, LOCAL_FILE_PARSING_COMPLETED, LOCAL_FILE_EMBEDDED, LOCAL_FILE_PROCESS_FAILED
7+
from server.constant.constants import (SQLITE_DB_DIR, SQLITE_DB_NAME,
8+
MAX_CHUNK_LENGTH, CHUNK_OVERLAP,
9+
FROM_LOCAL_FILE, LOCAL_FILE_PARSING,
10+
LOCAL_FILE_PARSING_COMPLETED,
11+
LOCAL_FILE_EMBEDDED,
12+
LOCAL_FILE_PROCESS_FAILED)
813
from server.rag.index.chunk.markdown_splitter import MarkdownTextSplitter
914
from server.rag.index.embedder.document_embedder import document_embedder
1015

‎server/rag/index/parser/html_parser/web_content_crawler.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,10 @@
1010
from bs4 import BeautifulSoup
1111
import html2text
1212
from server.app.utils.hash import generate_md5
13-
from server.constant.constants import SQLITE_DB_DIR, SQLITE_DB_NAME, MAX_CRAWL_PARALLEL_REQUEST, MAX_CHUNK_LENGTH, CHUNK_OVERLAP, FROM_SITEMAP_URL
13+
from server.constant.constants import (SQLITE_DB_DIR, SQLITE_DB_NAME,
14+
MAX_CRAWL_PARALLEL_REQUEST,
15+
MAX_CHUNK_LENGTH, CHUNK_OVERLAP,
16+
FROM_SITEMAP_URL)
1417
from server.logger.logger_config import my_logger as logger
1518
from server.rag.index.chunk.markdown_splitter import MarkdownTextSplitter
1619
from server.rag.index.embedder.document_embedder import document_embedder

‎server/rag/index/parser/html_parser/web_link_crawler.py

+5-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,11 @@
77
from bs4 import BeautifulSoup
88
from server.app.utils.url_helper import is_same_domain, normalize_url
99
from server.app.utils.diskcache_lock import diskcache_lock
10-
from server.constant.constants import SQLITE_DB_DIR, SQLITE_DB_NAME, MAX_CRAWL_PARALLEL_REQUEST, SITEMAP_URL_RECORDED, SITEMAP_URL_EXPIRED, DOMAIN_STATISTICS_GATHERING_COLLECTED
10+
from server.constant.constants import (SQLITE_DB_DIR, SQLITE_DB_NAME,
11+
MAX_CRAWL_PARALLEL_REQUEST,
12+
SITEMAP_URL_RECORDED,
13+
SITEMAP_URL_EXPIRED,
14+
DOMAIN_STATISTICS_GATHERING_COLLECTED)
1115
from server.logger.logger_config import my_logger as logger
1216

1317

0 commit comments

Comments
 (0)
Please sign in to comment.