WIP: Bulk loads #163

pimeys · 2021-07-27T15:30:06Z

So, my try continuing the work of @nickolay implementing an interface for bulk loads.

The interface here allows you to efficiently store rows without loading them all to memory.

Would people who need the feature be willing to test this and extend it? I see a need for tests and fixing possible issues we might face. Also I'm not completely sure about the interface, so would be great to have some opinions on that.

Closes: #104

pimeys · 2021-07-28T07:57:33Z

Missing: when using strings, we must have an API to set the collation. By not setting it we just get an error...

pimeys · 2021-08-09T17:12:52Z

Ok, so I had a bit more time to do research here. I know how to do the actual feature, but making an API that doesn't randomly give you errors that are really nasty to solve is not that easy.

One example: nullable columns. We can see first a bulk load that works with what we have in the pull request:

let mut client = Client::connect(config, tcp.compat_write()).await?;

client
    .execute(
        "CREATE TABLE ##bulk_test1 (id INT IDENTITY PRIMARY KEY, content INT NOT NULL)",
        &[],
    )
    .await?;

let mut meta = BulkLoadMetadata::new();
meta.add_column("content", TypeInfo::int());

let mut req = client.bulk_insert("##bulk_test1", meta).await?;
let count = 2000i32;

for i in 0..count {
    let mut row = TokenRow::new();
    row.push(i.into_sql());
    req.send(row).await?;
}

dbg!(req.finalize().await?);

The whole thing will break catastrophically into a cryptic error of Invalid column type from bcp client for colid 1. if the type of content is INT NULL instead of INT NOT NULL. We have to explicitely add a ColumnFlag::Nullable to the request metadata!

So, then we can imagine the API to be something like:

meta.add_column("content", TypeInfo::int(), true);

or

meta.add_column("content", TypeInfo::int(), ColumnFlag::Nullable.into());

... but I can already sense how much issues will get opened for mystical errors, when you forget to set the correct flags for your metadata.

In the end, what I would love to see with this API is something that the dear rustc can catch in the type level. E.g. start a bulk request with a trait-backed struct, that will tell the types, nullability, collation info and what not. Then we could imagine something like

#[tiberius]
pub struct MyData {
    value: i32,
    nullable_integer: Option<i32>,
    text_value: Option<String>,
}

This then would be in the client, starting a new bulk request:

let req: BulkLoadRequest<MyData> = client.bulk_insert("my_data_table").await?;
req.send(MyData { value: 3, nullable_integer: Some(32), text_value: None });

This would behind the scene then handle the correct order of columns, the correct typing and nullability info. First though there is a lot of work just detecting what types work and what do not. The question of collations is also something I'd like to write an answer. Can we just use the UTF-16 default collation everywhere, or do we actually need to let the user to tell it for us, making the API tough for people who don't really have that much SQL Server experience to just tell us the right value.

nickolay · 2021-08-09T18:55:06Z

I think that what you have here so far is a great starting point! (Though I won't be using it for a real project in the near future.)

a cryptic error of Invalid column type from bcp client for colid 1. if the type of content is INT NULL instead of INT NOT NULL. We have to explicitely add a ColumnFlag::Nullable to the request metadata!

Note that this error is about a mismatch in structure between a database table and the bulkload request's metadata. Relying on Rust's type system won't be able to help with that, if one needs to load into a table that already exists.

The #[tiberius] pub struct MyData { can be a nice wrapper API for those who need to work with tables whose structure is known at compile-time (and the CSV serde API looks very similar to what we need here). It wouldn't help with implementing bcp-style utility, which can bulk-load from CSV files whose structure is unknown ahead of time.

pimeys · 2021-08-10T08:39:12Z

I'm more thinking of something that prevents us sending data to the bulk load that is against the description given in the headers, that can be detected in compile time. Right now the PR has an API that requires lot of tweaking in a few places every time you want to change something in the data types.

What would for sure be possible is for example having two connections, the other one giving back a QueryStream and we could pass that to a bulk load, which then sees the data types in the query and starts the request with correct header information.

One API that might also be worth a try would be something that queries a table, and returns the metadata to be used in a bulk load.

In the end the feature is not that hard to implement, but making a robust API that's worth the upcoming Tiberius 1.0 is something that I'd like to get done before merging anything...

nickolay · 2021-12-18T07:41:58Z

something that queries a table, and returns the metadata to be used in a bulk load.

Yep, apparently, that's how the MSSQL ODBC driver works ("When a data file is specified, bcp_init examines the structure of the database source or target table, not the data file."; here's how FreeTDS does it).

I started implementing bulkload from CSV via arrow2 nickolay@2c0e5e5#diff-705aa08e944f5e9851ee12e3ad6b491a763150b42e9cf3ecde25fc84a7e32a9eR36 - but got stuck with lifetime issues (should into_sql really return ColumnData<'static>?)

rogerpengyu · 2022-06-04T00:36:59Z

@pimeys I think your implementation has most pieces. To your concern, in my scenario, I will just get the column info from SQL database. I.e., assume the table exist before bulkupload. And trigger a SQL query to fill in the column metadata. Whether the column should be null or not is already decided by the table.

I will take this PR, and add some adaptor to enable loading schema. You ok with that?

rogerpengyu · 2022-06-04T00:55:03Z

regarding datatype, why not use the existing Column/Row structure, but define a new set of structure MetaDatacolumn and TokenRow?

rogerpengyu · 2022-06-07T01:25:50Z

a bit more research on this. It seems retrieving the column meta data info from table first is a common practice (MS C# lib follows the similar pattern). I will go ahead with the following:

remove the ColumnMetaData interface, instead just take a table name;
making two TDS calls, one to retrieve the ColumnMetaData, then use the ColumnMetaData to fill in row data.

@pimeys , please let me know if you are ok with it.

pimeys force-pushed the bulk branch 2 times, most recently from d58ce24 to 489adad Compare July 27, 2021 16:03

pimeys force-pushed the bulk branch 2 times, most recently from 3c24cb4 to ae26c3b Compare July 28, 2021 14:34

WIP: Bulk loads

a5c9b8c

pimeys force-pushed the bulk branch 2 times, most recently from 8595a4f to a5c9b8c Compare August 9, 2021 12:54

Start testing, poking the API

345d47c

pimeys force-pushed the bulk branch from c13d6ac to 4a5a469 Compare August 10, 2021 20:34

wipwip

e8f03bc

pimeys force-pushed the bulk branch from 20d7ae0 to e8f03bc Compare August 19, 2021 08:51

pimeys force-pushed the main branch from 801e6f7 to 951c890 Compare January 7, 2022 15:17

rogerpengyu mentioned this pull request Jun 20, 2022

Bulk uploads #227

Merged

pimeys closed this Jul 4, 2022

miguelff deleted the bulk branch April 18, 2023 08:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Bulk loads #163

WIP: Bulk loads #163

pimeys commented Jul 27, 2021 •

edited

Loading

pimeys commented Jul 28, 2021

pimeys commented Aug 9, 2021

nickolay commented Aug 9, 2021

pimeys commented Aug 10, 2021 •

edited

Loading

nickolay commented Dec 18, 2021 •

edited

Loading

rogerpengyu commented Jun 4, 2022

rogerpengyu commented Jun 4, 2022

rogerpengyu commented Jun 7, 2022

WIP: Bulk loads #163

WIP: Bulk loads #163

Conversation

pimeys commented Jul 27, 2021 • edited Loading

pimeys commented Jul 28, 2021

pimeys commented Aug 9, 2021

nickolay commented Aug 9, 2021

pimeys commented Aug 10, 2021 • edited Loading

nickolay commented Dec 18, 2021 • edited Loading

rogerpengyu commented Jun 4, 2022

rogerpengyu commented Jun 4, 2022

rogerpengyu commented Jun 7, 2022

pimeys commented Jul 27, 2021 •

edited

Loading

pimeys commented Aug 10, 2021 •

edited

Loading

nickolay commented Dec 18, 2021 •

edited

Loading