Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(builder): fix pdf reader for normalizing text in outline #344

Merged
merged 23 commits into from
Feb 17, 2025
Merged
Changes from 22 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
7be43f2
fix buidler init
northmachine Oct 31, 2024
12a3f7d
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Oct 31, 2024
af2cc50
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Nov 4, 2024
66abbf5
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Nov 4, 2024
e37f1c2
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Nov 5, 2024
976630e
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Nov 9, 2024
9524644
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Nov 19, 2024
bba0d65
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Nov 21, 2024
2824cd3
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Nov 28, 2024
e40f699
add pro commit
northmachine Nov 28, 2024
4ff4270
rename graphalgoclient to graphclient
northmachine Dec 11, 2024
498642f
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Dec 19, 2024
b7a0091
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Jan 13, 2025
b98ebc7
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Jan 13, 2025
b406a14
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Jan 14, 2025
7d5778d
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Jan 15, 2025
3d8fb1d
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Jan 16, 2025
c629d04
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Jan 17, 2025
8f05deb
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Jan 22, 2025
48b534a
Merge branch 'master' of github.com:OpenSPG/KAG
northmachine Feb 11, 2025
c52578c
fix pdf reader
northmachine Feb 13, 2025
00a051a
fix pdf reader
northmachine Feb 13, 2025
7f11d63
fix pdf reader
northmachine Feb 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions kag/builder/component/reader/pdf_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,15 +113,19 @@ def get_content_start(outline, page_contents):
page_contents[page_start : page_end + 1 if page_end != -1 else None]
)

# 标准化标题中的特殊字符
# Normalize special characters in the title
def normalize_text(text):
# 将破折号"—"转换为中文数字"一"
# Convert dash "—" to Chinese number "一"
text = text.replace("—", "一")
# 可以添加其他中英文标点的统一转换
# Can add other unified conversions for Chinese and English punctuation
text = re.sub(r"[", "[", text)
text = re.sub(r"]", "]", text)
text = re.sub(r"(", "(", text)
text = re.sub(r")", ")", text)
# Remove special characters and control characters
text = re.sub(
r"[\u200b\u200c\u200d\ufeff\u3000\x00-\x1f\x7f-\x9f]+", "", text
)
return text

outline = (normalize_text(outline[0]), outline[1], outline[2], outline[3])
Expand Down Expand Up @@ -397,7 +401,7 @@ def _invoke(self, input: str, **kwargs) -> Sequence[Output]:
content = content.replace("\n", "")
page_contents.append(content)

# 使用正则表达式移除所有空白字符(包括空格、制表符、换行符等)
# Using regular expressions to remove all whitespace (including spaces, tabs, newlines, etc.)
page_contents = [
re.sub(r"\s+", "", content) for content in page_contents
]
Expand Down Expand Up @@ -455,7 +459,7 @@ def _invoke(self, input: str, **kwargs) -> Sequence[Output]:
)
chunks.append(chunk)

# # 保存中间结果到文件
# # Save intermediate results to file
# import pickle

# with open("debug_data.pkl", "wb") as f:
Expand All @@ -478,7 +482,7 @@ def _invoke(self, input: str, **kwargs) -> Sequence[Output]:
pdf_path = os.path.join(
os.path.dirname(__file__), "../../../../tests/builder/data/aiwen.pdf"
)
pdf_path = "/Users/zhangxinhong.zxh/Downloads/labor-law-v5.pdf"
pdf_path = "/Users/zhangxinhong.zxh/Downloads/05. 医学生物学.pdf"
# pdf_path = "/Users/zhangxinhong.zxh/Downloads/toaz.info-5dsm-5-pr_56e68a629dc4fe62699960dd5afbe362.pdf"
chunk = pdf_reader.invoke(pdf_path)
a = 1
Loading