-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Open
Labels
🐞 bugSomething isn't working, pull request that fix bug.Something isn't working, pull request that fix bug.
Description
Self Checks
- I have searched for existing issues search for existing issues, including closed ones.
- I confirm that I am using English to submit this report (Language Policy).
- Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
RAGFlow image version
d55f446(v0.20.3)
Other environment information
Actual behavior
This commit causes no delimiter to work when using general
. Delimiter will only be used after chunk_len reaches chunk_token_num
.
Expected behavior
Use delimiter
to split and double split when chunk_token_num
is exceeded
Steps to reproduce
1. Upload excel document
2. Use the `general` method to execute `parse`
3. Modify `delimiter` to the characters that exist in other known documents and rerun `parse`
4. Check the number of chunks, there will be no changes
Additional information
def naive_merge(sections, chunk_token_num=128, delimiter="\n。;!?", overlapped_percent=0):
from deepdoc.parser.pdf_parser import RAGFlowPdfParser
if not sections:
return []
if isinstance(sections[0], type("")):
sections = [(s, "") for s in sections]
cks = [""]
tk_nums = [0]
def add_chunk(t, pos):
nonlocal cks, tk_nums, delimiter
tnum = num_tokens_from_string(t)
if not pos:
pos = ""
if tnum < 8:
pos = ""
# Ensure that the length of the merged chunk does not exceed chunk_token_num
if cks[-1] == "" or tk_nums[-1] > chunk_token_num * (100 - overlapped_percent)/100.:
if cks:
overlapped = RAGFlowPdfParser.remove_tag(cks[-1])
t = overlapped[int(len(overlapped)*(100-overlapped_percent)/100.):] + t
if t.find(pos) < 0:
t += pos
cks.append(t)
tk_nums.append(tnum)
else:
if cks[-1].find(pos) < 0:
t += pos
cks[-1] += t
tk_nums[-1] += tnum
dels = get_delimiters(delimiter)
for sec, pos in sections:
# this if ignore any delimiter
if num_tokens_from_string(sec) < chunk_token_num:
add_chunk(sec, pos)
continue
splited_sec = re.split(r"(%s)" % dels, sec, flags=re.DOTALL)
for sub_sec in splited_sec:
if re.match(f"^{dels}$", sub_sec):
continue
add_chunk(sub_sec, pos)
dosubot
Metadata
Metadata
Assignees
Labels
🐞 bugSomething isn't working, pull request that fix bug.Something isn't working, pull request that fix bug.