Skip to content

Conversation

@Standing-Man
Copy link
Contributor

@Standing-Man Standing-Man commented Oct 20, 2025

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

  • Add optional parameters for copy table from command
    • headers(default: false): Enforce validation to ensure that the imported CSV file matches the table’s schema, including column names and types.
    • continue_on_error (default:true): If a mismatch is detected between the imported csv file and the table schema, and continue_on_error is set to true, the file will be skipped and the next CSV file will be processed. Otherwise, the process will stop.

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

@Standing-Man Standing-Man requested a review from a team as a code owner October 20, 2025 03:02
@github-actions github-actions bot added size/XS docs-not-required This change does not impact docs. labels Oct 20, 2025
@WenyXu
Copy link
Member

WenyXu commented Oct 20, 2025

Hi @Standing-Man, Thank you for your contribution! Could you add some SQLness tests or integration tests under the tests-integration directory? You can find an example in tests-integration/src/test/instance_test.rs.

@Standing-Man
Copy link
Contributor Author

Standing-Man commented Oct 20, 2025

Hi @WenyXu, I’m a bit confused — should header and has_header be consistent? Once the header exists, can we directly validate it? Also, how do we determine if a record is corrupted?

https://github.com/GreptimeTeam/greptimedb/blob/main/src/operator/src/statement/copy_table_from.rs#L234

So far, I see how a stream connects to a CSV source, but I’m unsure where the CSV reading and processing logic happens. Could you point me in the right direction?

@WenyXu
Copy link
Member

WenyXu commented Oct 20, 2025

should header and has_header be consistent?

Yes

how do we determine if a record is corrupted?

It depends on the CSV reader’s capabilities. If users attempt to copy data from multiple files and one of them has a schema that differs from the target table, we should consider skipping data from that file. It would be best to review the CSV reader’s source code.

@fengjiachun fengjiachun requested a review from WenyXu October 21, 2025 03:31
@Standing-Man
Copy link
Contributor Author

Hi @WenyXu, no matter whether has_header is true or false, the schema compatibility of the target table will be checked.

ensure_schema_compatible(&projected_file_schema, &projected_table_schema)?;

Also, the CSV reader reads rows from the source in batches, with default batch size of DEFAULT_BATCH_SIZE.

.with_batch_size(DEFAULT_BATCH_SIZE);

I don’t think it’s necessary to add these two optional parameters to enhance the COPY TABLE FROM FILE (FORMAT='CSV') command.

@WenyXu
Copy link
Member

WenyXu commented Oct 27, 2025

Hi @WenyXu, no matter whether has_header is true or false, the schema compatibility of the target table will be checked.

ensure_schema_compatible(&projected_file_schema, &projected_table_schema)?;

Yes, this is the expected behavior. The underlying CSV reader supports reading files without a header line — in such cases, the fields are named field_0, field_1, and so on. Therefore, we assume that the column order in the input CSV matches that of the target table, and align the CSV schema with the table schema before performing the compatibility check.

@Standing-Man Standing-Man marked this pull request as draft November 10, 2025 08:42
@Standing-Man Standing-Man marked this pull request as ready for review November 11, 2025 07:22
@github-actions github-actions bot removed the size/XS label Nov 11, 2025
@Standing-Man Standing-Man force-pushed the copy-csv branch 4 times, most recently from f81849c to aa7812a Compare November 13, 2025 02:32
@Standing-Man
Copy link
Contributor Author

@killme2008 and @WenyXu PTAL when you have time.

@Standing-Man Standing-Man requested a review from WenyXu November 21, 2025 12:16
Comment on lines 517 to +524
fn ensure_schema_compatible(from: &SchemaRef, to: &SchemaRef) -> Result<()> {
if from.fields().len() != to.fields().len() {
return error::InvalidHeaderSnafu {
table_schema: to.to_string(),
file_schema: from.to_string(),
}
.fail();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can’t simply determine schema incompatibility based on field length. If the source schema is a subset of the target schema, they should be considered compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement a separate header checker method to verify whether the schema lengths of the file and the table match, and whether their names and types are consistent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I suggest adding more tests to cover this feature, as there are many cases we need to handle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to add some test cases in tests-integration to cover the scenarios when header=false, e.g.,:

  • The order of CSV fields differs from the table’s.
  • The number of CSV fields is fewer or greater than the table’s.

BTW, when testing with header=false, please make sure to use a CSV file without a header row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestions. If a CSV file doesn’t contain a header, can I skip checking the header?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a CSV file doesn’t contain a header, can I skip checking the header?

I think we should assume that the data fields in the CSV are in the same order as the table columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance copy from csv file command

2 participants