Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnboundLocalError and Loss of Data from Multiple Documents #89

Open
imene-swaan opened this issue Sep 19, 2024 · 1 comment
Open

UnboundLocalError and Loss of Data from Multiple Documents #89

imene-swaan opened this issue Sep 19, 2024 · 1 comment

Comments

@imene-swaan
Copy link

Description:

The current implementation of the Export multimodal Docling Example (examples/export_multimodal.py) has two issues:

  1. UnboundLocalError: When no documents are successfully converted, the rows list is not initialized, resulting in an UnboundLocalError when trying to normalize the data into a DataFrame.
  2. Loss of data from multiple documents: The rows list is reinitialized inside the loop that processes each document. This causes the data from previous documents to be discarded, keeping only the data from the last converted document.

Expected Behavior:

  • The rows list should accumulate the data from all successfully converted documents.
  • If no documents are successfully converted, the script should handle this gracefully and not raise an UnboundLocalError.

Suggested Fix:

  • Move the initialization of the rows list outside the loop so that it collects data from all documents.
  • Add a check before normalizing the rows into a DataFrame to ensure that the list is not empty.

Original code:

rows = []  # This is inside the document loop

for (
    content_text,
    content_md,
    content_dt,
    page_cells,
    page_segments,
    page,
) in generate_multimodal_pages(doc):
    # Rows are appended here, but this only keeps data for the current document
    ...

Suggested Fix:

# Initialize rows before the loop
rows = []

for doc in converted_docs:
    if doc.status != ConversionStatus.SUCCESS:
        continue  # Log failures
    for (
        content_text,
        content_md,
        content_dt,
        page_cells,
        page_segments,
        page,
    ) in generate_multimodal_pages(doc):
        rows.append( ... )  # Now rows accumulate data from all documents
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@imene-swaan and others