Skip to content

[Bug]: delimiter does not work for any document when general #9857

@lidongshengxdayu

Description

@lidongshengxdayu

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (Language Policy).
  • Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

d9fe279

RAGFlow image version

d55f446(v0.20.3)

Other environment information

Actual behavior

This commit causes no delimiter to work when using general. Delimiter will only be used after chunk_len reaches chunk_token_num.

Expected behavior

Use delimiter to split and double split when chunk_token_num is exceeded

Steps to reproduce

1. Upload excel document
2. Use the `general` method to execute `parse`
3. Modify `delimiter` to the characters that exist in other known documents and rerun `parse`
4. Check the number of chunks, there will be no changes

Additional information

def naive_merge(sections, chunk_token_num=128, delimiter="\n。;!?", overlapped_percent=0):
    from deepdoc.parser.pdf_parser import RAGFlowPdfParser
    if not sections:
        return []
    if isinstance(sections[0], type("")):
        sections = [(s, "") for s in sections]
    cks = [""]
    tk_nums = [0]

    def add_chunk(t, pos):
        nonlocal cks, tk_nums, delimiter
        tnum = num_tokens_from_string(t)
        if not pos:
            pos = ""
        if tnum < 8:
            pos = ""
        # Ensure that the length of the merged chunk does not exceed chunk_token_num  
        if cks[-1] == "" or tk_nums[-1] > chunk_token_num * (100 - overlapped_percent)/100.:
            if cks:
                overlapped = RAGFlowPdfParser.remove_tag(cks[-1])
                t = overlapped[int(len(overlapped)*(100-overlapped_percent)/100.):] + t
            if t.find(pos) < 0:
                t += pos
            cks.append(t)
            tk_nums.append(tnum)
        else:
            if cks[-1].find(pos) < 0:
                t += pos
            cks[-1] += t
            tk_nums[-1] += tnum

    dels = get_delimiters(delimiter)
    for sec, pos in sections:
        # this if ignore any delimiter
        if num_tokens_from_string(sec) < chunk_token_num:
            add_chunk(sec, pos)
            continue
        splited_sec = re.split(r"(%s)" % dels, sec, flags=re.DOTALL)
        for sub_sec in splited_sec:
            if re.match(f"^{dels}$", sub_sec):
                continue
            add_chunk(sub_sec, pos)

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 bugSomething isn't working, pull request that fix bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions