feat: add optional parameters for copy table from command #7115

Standing-Man · 2025-10-20T03:02:12Z

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

Closes Enhance copy from csv file command #6989.

What's changed and what's your intention?

Add optional parameters for copy table from command
- headers(default: false): Enforce validation to ensure that the imported CSV file matches the table’s schema, including column names and types.
- continue_on_error (default:true): If a mismatch is detected between the imported csv file and the table schema, and continue_on_error is set to true, the file will be skipped and the next CSV file will be processed. Otherwise, the process will stop.

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

I have written the necessary rustdoc comments.
I have added the necessary unit tests and integration tests.
This PR requires documentation updates.
API changes are backward compatible.
Schema or data changes are backward compatible.

WenyXu · 2025-10-20T06:12:28Z

Hi @Standing-Man, Thank you for your contribution! Could you add some SQLness tests or integration tests under the tests-integration directory? You can find an example in tests-integration/src/test/instance_test.rs.

Standing-Man · 2025-10-20T08:59:06Z

Hi @WenyXu, I’m a bit confused — should header and has_header be consistent? Once the header exists, can we directly validate it? Also, how do we determine if a record is corrupted?

https://github.com/GreptimeTeam/greptimedb/blob/main/src/operator/src/statement/copy_table_from.rs#L234

So far, I see how a stream connects to a CSV source, but I’m unsure where the CSV reading and processing logic happens. Could you point me in the right direction?

WenyXu · 2025-10-20T09:13:37Z

should header and has_header be consistent?

Yes

how do we determine if a record is corrupted?

It depends on the CSV reader’s capabilities. If users attempt to copy data from multiple files and one of them has a schema that differs from the target table, we should consider skipping data from that file. It would be best to review the CSV reader’s source code.

Standing-Man · 2025-10-26T08:07:27Z

Hi @WenyXu, no matter whether has_header is true or false, the schema compatibility of the target table will be checked.

greptimedb/src/operator/src/statement/copy_table_from.rs

Line 405 in d8563ba

ensure_schema_compatible(&projected_file_schema, &projected_table_schema)?;

Also, the CSV reader reads rows from the source in batches, with default batch size of DEFAULT_BATCH_SIZE.

greptimedb/src/operator/src/statement/copy_table_from.rs

Line 247 in d8563ba

.with_batch_size(DEFAULT_BATCH_SIZE);

I don’t think it’s necessary to add these two optional parameters to enhance the COPY TABLE FROM FILE (FORMAT='CSV') command.

WenyXu · 2025-10-27T02:53:04Z

Hi @WenyXu, no matter whether has_header is true or false, the schema compatibility of the target table will be checked.

greptimedb/src/operator/src/statement/copy_table_from.rs

Line 405 in d8563ba

ensure_schema_compatible(&projected_file_schema, &projected_table_schema)?;

Yes, this is the expected behavior. The underlying CSV reader supports reading files without a header line — in such cases, the fields are named field_0, field_1, and so on. Therefore, we assume that the column order in the input CSV matches that of the target table, and align the CSV schema with the table schema before performing the compatibility check.

src/operator/src/statement/copy_table_from.rs

src/common/datasource/src/file_format/csv.rs

Standing-Man · 2025-11-13T10:55:16Z

@killme2008 and @WenyXu PTAL when you have time.

tests/cases/standalone/common/copy/copy_from_fs_csv.sql

src/common/datasource/src/file_format/csv.rs

Signed-off-by: Alan Tang <[email protected]>

…lated test cases Signed-off-by: Alan Tang <[email protected]>

Signed-off-by: Alan Tang <[email protected]>

Signed-off-by: StandingMan <[email protected]>

WenyXu · 2025-11-24T06:19:05Z

src/operator/src/statement/copy_table_from.rs

 fn ensure_schema_compatible(from: &SchemaRef, to: &SchemaRef) -> Result<()> {
+    if from.fields().len() != to.fields().len() {
+        return error::InvalidHeaderSnafu {
+            table_schema: to.to_string(),
+            file_schema: from.to_string(),
+        }
+        .fail();
+    }


We can’t simply determine schema incompatibility based on field length. If the source schema is a subset of the target schema, they should be considered compatible.

Should we implement a separate header checker method to verify whether the schema lengths of the file and the table match, and whether their names and types are consistent?

Yes, and I suggest adding more tests to cover this feature, as there are many cases we need to handle.

WenyXu · 2025-11-24T06:23:17Z

tests/cases/standalone/common/copy/copy_from_csv_parameters.result

I think it would be better to add some test cases in tests-integration to cover the scenarios when header=false, e.g.,:

The order of CSV fields differs from the table’s.

The number of CSV fields is fewer or greater than the table’s.

BTW, when testing with header=false, please make sure to use a CSV file without a header row.

Good suggestions. If a CSV file doesn’t contain a header, can I skip checking the header?

If a CSV file doesn’t contain a header, can I skip checking the header?

I think we should assume that the data fields in the CSV are in the same order as the table columns.

Standing-Man requested a review from a team as a code owner October 20, 2025 03:02

github-actions bot added size/XS docs-not-required This change does not impact docs. labels Oct 20, 2025

fengjiachun requested a review from WenyXu October 21, 2025 03:31

Standing-Man marked this pull request as draft November 10, 2025 08:42

Standing-Man force-pushed the copy-csv branch from a10f0df to a858a14 Compare November 11, 2025 02:48

github-actions bot added the size/S label Nov 11, 2025

Standing-Man marked this pull request as ready for review November 11, 2025 07:22

github-actions bot removed the size/XS label Nov 11, 2025

Standing-Man commented Nov 11, 2025

View reviewed changes

src/operator/src/statement/copy_table_from.rs Outdated Show resolved Hide resolved

Standing-Man force-pushed the copy-csv branch 4 times, most recently from f81849c to aa7812a Compare November 13, 2025 02:32

Standing-Man commented Nov 13, 2025

View reviewed changes

src/common/datasource/src/file_format/csv.rs Outdated Show resolved Hide resolved

WenyXu reviewed Nov 14, 2025

View reviewed changes

tests/cases/standalone/common/copy/copy_from_fs_csv.sql Show resolved Hide resolved

src/common/datasource/src/file_format/csv.rs Show resolved Hide resolved

github-actions bot added size/M and removed size/S labels Nov 15, 2025

Standing-Man force-pushed the copy-csv branch from 7866972 to d9d7217 Compare November 16, 2025 09:08

github-actions bot added size/S and removed size/M labels Nov 16, 2025

Standing-Man force-pushed the copy-csv branch 3 times, most recently from 9ec7db0 to c5efb9b Compare November 19, 2025 05:07

Standing-Man added 11 commits November 21, 2025 18:18

feat: add optional parameters for copy table from command

ca99158

Signed-off-by: Alan Tang <[email protected]>

feat(csv): support has_header and skip_bad_record options, and add re…

d5ca4fc

…lated test cases Signed-off-by: Alan Tang <[email protected]>

fix(tests): correct the results of test cases

1cc9a3a

Signed-off-by: Alan Tang <[email protected]>

feat: add continue_on_error option to skip files with invalid headers

07e7e34

Signed-off-by: Alan Tang <[email protected]>

refactor: update the approach for accessing optional parameters

6318c7a

Signed-off-by: Alan Tang <[email protected]>

fix: fix the integration test error

6c85d33

Signed-off-by: Alan Tang <[email protected]>

chore: revert the copy_from_fs_csv test

f298625

Signed-off-by: StandingMan <[email protected]>

chore: change the default value of header to false

369e738

Signed-off-by: StandingMan <[email protected]>

chore: revert 'fix: fix the integration test error'

abb331d

Signed-off-by: StandingMan <[email protected]>

chore: remove redundant modules

1e5c71e

Signed-off-by: StandingMan <[email protected]>

feat: remove the skip_bad_records parameter

e38c534

Signed-off-by: StandingMan <[email protected]>

Standing-Man force-pushed the copy-csv branch from c5efb9b to e38c534 Compare November 21, 2025 10:45

feat: improve JSON type validation in schema headers

721df07

Signed-off-by: StandingMan <[email protected]>

Standing-Man requested a review from WenyXu November 21, 2025 12:16

WenyXu reviewed Nov 24, 2025

View reviewed changes

feat: add optional parameters for copy table from command #7115

Are you sure you want to change the base?

feat: add optional parameters for copy table from command #7115

Uh oh!

Conversation

Standing-Man commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

PR Checklist

Uh oh!

WenyXu commented Oct 20, 2025

Uh oh!

Standing-Man commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WenyXu commented Oct 20, 2025

Uh oh!

Standing-Man commented Oct 26, 2025

Uh oh!

WenyXu commented Oct 27, 2025

Uh oh!

Uh oh!

Uh oh!

Standing-Man commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

WenyXu Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Standing-Man Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

WenyXu Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

WenyXu Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Standing-Man Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

WenyXu Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Standing-Man commented Oct 20, 2025 •

edited

Loading

Standing-Man commented Oct 20, 2025 •

edited

Loading