Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid argument error: all columns in a record batch must have the specified row count #3167

Open
ryzhyk opened this issue Jan 28, 2025 · 4 comments
Labels
bug Something isn't working mre-needed Whether an MRE needs to be provided

Comments

@ryzhyk
Copy link

ryzhyk commented Jan 28, 2025

Environment

Delta-rs version: 0.22.2

Binding: Rust

Environment:

  • Cloud provider: AWS
  • OS: Linux
  • Other:

Bug

We ran into this error when reading from a managed Databricks Delta Table:

Invalid argument error: all columns in a record batch must have the specified row count

This happens when running the select * from my_table query via datafusion. This occurs in the customer environment, so we don't have a reliable reproduction yet.

FWIW, I found a similar issue that someone ran into when reading from Iceberg, along with the explanation that the number of physical and logical records in a batch may not match:
apache/datafusion-comet#973

Is it possible that a similar issue exists with Delta?

Thanks!

@ryzhyk ryzhyk added the bug Something isn't working label Jan 28, 2025
@ion-elgreco
Copy link
Collaborator

@ryzhyk please provide an MRE, it's not clear how you are reading delta with Datafusion

@ion-elgreco ion-elgreco added the mre-needed Whether an MRE needs to be provided label Jan 28, 2025
@ryzhyk
Copy link
Author

ryzhyk commented Jan 29, 2025

Thanks for your reply @ion-elgreco. I will work with the customer on a reproduction. Like I said, we don't have access to the delta table, so can't reproduce on our end. I was hoping that someone familiar with the code base might have some suggestions. Maybe there are some specific things we can ask the customer to check, e.g., could this happen if the table has deletion vectors enabled?

To query the DeltaLake using datafusion, we use register_table to register the Delta table provider with datafusion and then issue queries via

let options: SQLOptions = SQLOptions::new()
    .with_allow_ddl(false)
    .with_allow_dml(false);

let df = self.datafusion.sql_with_options("select * from my_table", options).await?;

let mut stream = match df.execute_stream().await?;

while let Some(batch) = stream.next().await {
    let batch = match batch {
        Ok(batch) => batch,
        Err(e) => {
            /*THE ERROR IS REPORTED HERE */
        }
    };
}

@ion-elgreco
Copy link
Collaborator

The TableProvider uses parquetExec underneath, so I think it's rather an issue surfacing from Datafusion.

Have you confirmed, whether you can read it properly with Polars in python or pyArrow datasets in python?

@ryzhyk
Copy link
Author

ryzhyk commented Jan 30, 2025

Nope, I don't have a way to do this yet. I'll try it if I can get my hands on a table that causes this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mre-needed Whether an MRE needs to be provided
Projects
None yet
Development

No branches or pull requests

2 participants